Many algorithms in SAMOA have some ac= tion triggered by some condition (e.g., call this function every 1000 event= s). Some of the engines we run on could benefit from having this behav= ior exposed, in order to optimize their execution (e.g., by using windowing= semantics). One such example is Apache Flink.

The goal of this project is to design= an API for event triggering that can be used at the ML level to describe a= ctions to be taken upon conditions, and exposed at the System level in orde= r to be available for optimization by the underlying execution engine.

Stochastic Gra= dient Descent (SGD) is a classical optimization algorithm used in many mach= ine learning algorithms. Currently, there are some parallel implementations= of the algorithm but they have two shortcomings: either they are in C or C= ++ or they work only for shared memory systems (or both) [1,2,3]. In any ca= se, they are unsuitable to the modern big data ecosystem.

There are a few implementations in Ma= pReduce, however MapReduce and Hadoop are not suitable to deal with streams= of data. The goal of this project is to implement a variant of SGD that will in= tegrate with the current big data ecosystem. One recent advance in delay-tolerant SGD seems well s= uited for implementation on SAMOA [4].

[1] http://www.csie.ntu.edu.tw/~cjlin/paper=
s/libmf.pd

[2] http://hazy.cs.wisc.edu/hazy/v=
ictor/Hogwild/

[3] http://hunch.net/~vw/

[4] http://static.googleusercontent.com/media/research.google.com/en//pubs=
/archive/43138.pdf

Hoeffding trees (aka,Very Fast Decisi= on Trees) [1] are decision trees for streaming data. They can be used for c= lassification of streams of unbounded data that needs to be analyzed very f= ast. The Vertical Hoeffding Tree is a particular implementation of this alg= orithm. The VHT is a parallel algorithm that works on distributed streaming= environments.

The goal of this project is twofold. = First, to extend the current implementation of the VHT to handle =E2=80=9Cc= oncept drift=E2=80=9D, that is, a change in the distribution of the attribu= tes among the classes due to the evolution of the process generating the da= ta [2].

Second, to implement a boosting versi= on of the distributed algorithm [3]. Boosting is a meta-algorithm that trai= ns an ensemble of classifiers. The challenge in parallelizing Boosting is t= hat the models are dependent on each other in a linear chain (the output of= the first model determines the input to the second model).

[1] http://homes.cs.washington.edu/~pe=
drod/papers/kdd00.pdf

[2] http://www.lsi.upc.edu/=
~abifet/R09-9.pdf

[3] http://en.w=
ikipedia.org/wiki/Boosting_(machine_learning)

The regression tree is an algorithm for regression, often used to perfor= m classification by simple thresholding. They are the basic building block = of one of the most successful modern algorithms for classification, Gradien= t Boosted Decision Tree (GBDT). Several approaches to parallelization have = been proposed (e.g., [2]).

The goal of this project is to implement a parallel version of GBDT in S= AMOA.

[1] http://www-stat.stanford.edu/~jhf/ftp/t=
rebst.pdf http://www-stat.stanford.edu/~jhf/=
ftp/stobst.pdf

[2] http://ww=
w.cslu.ogi.edu/~zak/cs506-pslc/sgradboostedtrees.pdf

Dealing with Big Data, the quantity of space needed to store it is very = relevant. There are two main approaches: compression where we don=E2=80=99t= lose anything, or sampling where we choose data that is more representativ= e. Using compression, we may take more time and less space, so we can consi= der it as a transformation from time to space. Using sampling, we are losin= g information, but the gains in space may be in orders of magnitude.

In this project we will use coresets to reduce the complexity of Big Dat= a problems. Coresets are small sets that provably approximate the original = data for a given problem [1]. Using merge-reduce the small sets can then be= used for solving hard machine learning problems in parallel.

[1] http://people.csail.mit.edu/dannyf/subspac= e.pdf

Data stream real time analytics are needed to manage the data currently = generated, at an ever increasing rate, from such applications as: sensor ne= tworks, measurements in network monitoring and traffic management, log reco= rds or click-streams in web exploring, manufacturing processes, call detail= records, email, blogging, twitter posts and others. In the data stream mod= el, data arrive at high speed, and algorithms that process them must do so = under very strict constraints of space and time. An important challenge for= data mining algorithm design is to make use of limited resources (time and= memory).

In this project we plan to implement streaming structures called sketche= s to reduce the resources used in stream mining [1,2].

[1] http://bl=
og.aggregateknowledge.com/2011/09/13/streaming-algorithms-and-sketches/=

[2] http://github.com/addthis/stream-lib