maanantai 28. joulukuuta 2015

Outlier detection using Spark

Introduction

Detecting outliers is very important ie. in fraud detection, in medicine and in intruder detection systems. One important application is to determine whether a particular test result should lead to medical care or extensive clinical testing. All of these applications require a scalable outlier detection algorithms and computational platform as datasets can be very big and include both numerical and categorical values.

It is hard to define what outlier is exactly but it seems to be any observation or item that is somehow distinctly different from other observations or items. There are several algorithms to detect outliers but let's here take a look in one of more straightforward algorithms. Attribute value frequency (AVF) algorithm scores observations based on frequencies of the their attribute values. Observations that receive the lowest scores are assumed to be outliers based on hyperparameter $k$ determining the number of outliers.

Spark is an easy to use distributed computing platform using Resilient Distributed Datasets (RDD) allowing efficient iterative algorithms. Spark supports Java, Python and Scala programming languages and AWS supports creating Spark clusters with a only a couple of mouse clicks. For more information on how spark achieves its efficiency please see ie. Spark Cluster Computing with Working Sets or Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.

To develop the algorithm and test its scalability I wrote tools to generate arbitrarily sized test datasets. Tools and spark program for AVF can be found here. Using a generated arbitary dataset is a good way to make sure that your algorithm works are intended.

Attribute Value Frequency algorithm

Attribute value frequency algorithm computes frequencies of the attribute values using

\begin{equation} AVF Score(x_{i}) = \frac{1}{m}\sum_{i=1}^{m}f(x_{il}), \end{equation} where $m$ is number of attribute values, $x_{il}$ is l-th attribute value of $x_{i}$ and $f(x_{il})$ is number of times l-th attribute value of $x_{i}$ appears in the dataset.

AVF algorithm pseudocode:
Input: dataset D
Output: k detected outliers

Label all data points as non-outliers
calculate frequency of each attribute value
foreach point x
  AVFscore = compute AVF score for item x(i)
end foreach
return to k outliers with minimum AFVscore


Using spark this can be expressed as

Running spark on AWS

An excellent tutorial to submitting spark jobs to AWS EMR using CLI can found here.

First one needs to create a Spark cluster and then add step to run the code. The code to be run needs to be put to the s3 and it is advisable to also write the results to s3.

In this case there is no need to install custom software to cluster. However, using public DNS name and ssh key (selected when cluster was created), you can you can login directly to master or slave nodes to install needed software like scikit-learn or opencv. Note that you must first allow ssh connections (port 22) from your IP addess to nodes. If you wish to allow connections from everywhere set IP address to '0.0.0.0/0'.

Conclusions

Spark allows straightforward implementations of certain classes of outlier detection algorithms. AWS provides easy-to-use environment to scale you computations to arbitrary scale dynamically. There are some hickups when setting up the cluster in AWS but mostly it is an enjoyable experience.