Scalearn

Simple library for statistics, machine learning, natural language processing in Scala

Download .zip Download .tar.gz View on GitHub

Welcome to Scalearn

This library would help you to deal with a bunch of data by making use of statistics, time series analysis, and machine learning. Although such academic domains seem to be difficult for most developers, this library targets ordinary scala developers; accordingly, it focus on usability rather than precise definition. Besides, I hope this could be useful for beginners of scala.

Prerequisite

scala
sbt

Architecture

Some functions in Scalearn runs on spark, which is constructed on Hadoop distribution
=> spark
The main difference between Hadoop and Spark is that spark concentrates on the tasks which need to load data again and again like machine learning algorithms. For such tasks spark runs much faster( at most 100 times roughly speaking ). Thankfully, it enables us to implement machine learning algorithms running on your distributed system without complicated configuration.

Installation

So far, no sbt installation available, therefore please build on your own ( since this is still the younger version and need to polish to some extent ) .

Quick Start

#Machine Learning

Roughly speanking, machine leaning has two different direction; one is supervised learning, and the other is unsupervised learning. Supervised learning requires you to collect data for study so as to construct classifier for instance, while unsupervised learning doesn't. Scalearn offers some famous methods of them such as k-means, naive bayes, and so on.

#Natural Language Processing

Supervised Classification

import scalearn.classification.supervised._

val file_paths = 
        ListBuffer( 
            ("classA","resource/doc1.txt"),
            ("classA","resource/doc2.txt"),
            ("classA","resource/doc3.txt"),
            ("classB","resource/doc4.txt"),
            ("classB","resource/doc5.txt"),
            ("classB","resource/doc6.txt"),
            ("classC","resource/doc7.txt"),
            ("classC","resource/doc8.txt"),
            ("classC","resource/doc9.txt")
            )
val nb = NaiveBayes(file_paths)  	
nb.classify("resource/examine.txt")

val pn = ParallelNaive(file_paths)
pn.classify("resource/examine.txt")

As you can expect above, both NaiveBayes and ParallelNaive classes are classifier to classify coming new data based on the past data for studying. The parameter of both classes is the ListBuffer of Pair(class name document belong to , path to the document for study). If the amount of data is not quite a lot, better to run on memory; should use NaiveBayes class as a classifier because it possess whole words on memory including data for study. On the other hand, ParallelNaive requires memory to have only the coming data you wanna classify; slower than NaiveBayes, yet could handle a bunch of data. Also, you could run on Hadoop thanks to Spark if you want. As a default, ParallelNaive utilizes whole cores on your computer.

※Docuemnts(data) should be japanese so far.

Unsupervised Classification => Clustering

import scalearn.classification.unsupervised.Clustering
import scala.collection.mutable.ArrayBuffer

val data = ArrayBuffer(
                Vector(3.2,4.64,2.21,6.7,4.5),
                Vector(1.1,2.2,32.3,4.4,5.5),
                Vector(9.9,8.8,7.73,6.65,5.5)
                Vector(9.5,4.8,3.7,2.6,53.5)
                Vector(3.2,4.64,2.21,6.7,4.5),
                Vector(1.1,2.2,3.3,4.4,5.533),
                Vector(3.2,4.64,2.21,6.7,43.5),
                Vector(1.1,2.22,3.322,4.42,5.5),
                Vector(9.9,82.8,7.7,6.6,5.5)
                )
                
val result: IndexedSeq[VectorCluster] = Clustering.kmeans(2,data,200)

When you have no idea about the ideal result of classification, it's time to take unsupervised classification like clustering. The example above shows how to use kmeans algorithm which is one of the clustering methods. The form of data must be ArrayBufer[Vector[Double]] in this function, so you need to import mutable ArrayBuffer first of all. Then, import Clustering object to call method of it. That's it!! The rest of your task is just use it by giving proper parameter to it. As the note, the parameter numberOfPrototype determine the number of local data you would realize as prototype and you cannot classify the prototype itself since it would vary again and again; in that sense, it is the procedure to determine not only how many class ( or cluster ) do you create but also how many data you lose.

#Descriptive Statistics

If you are interested in statistics, take a look at "scalearn.statistics" package since it is what you need to handle. Even compared to the other statistical environment, scala has a variety of simple but powerful functions to process data; in that sense, you could preprocess raw data on your own through scala's plain functions such as map, filter, and flatMap. Of course, when you have to handle natural language as data, Scalearn would help you not only to analyze but also to preprocess.
The description below might be the common flow of statistical analysis in Scalearn

Get raw data
Preprocessing the raw data
Convert that raw data into instance of some data class scalearn offers like data, dase, infds, tsda, tsds, and so on
Obtain statistical value by calling properties and methods

Well, Let's take some examples

First of all, you need to import "scalearn.statistics package" and "Converter object" into your code. They allow you to obtain statistical value easily.

import scalearn.statistics._
import scalearn.statistics.Converter._

and assume you have raw data already( let me say just "raw" to distinguish with data instance)

val raw1:Vector[Double] = Vector(1,2,3,4,5,6)
val raw2 = Vector(1.0,2,3,4,5)
val raw3 = Vector(1,2,3,4,5).map(elt=>elt.toDouble)

Ok, let's get start it. To begin with, create data instance( if you need to make a set of data, generate instance of "dase" class by giving parameters [ data* ] )

val data1:data = data(raw1)
val data2:data = Vector(1,2,3,4,5).toda

val dase1:dase = dase(data1,data2,data1)
val dase2:dase = new dase(Vector(data1,data2))
val dase3:dase = Vector(raw1,raw2,raw3).tods

Get statistical value for data1

val mean:Double = data1.mean

val standard_deviation:Double = data1.sd

val length:Int = data1.size

val deviation:Vector[Double] = data1.dv

val axis_of_time:Vector[Double] = data1.time

val regression_slope_intercept:(Double,Double) = data1.reg

val regression_line:Double=>Double = data1.regline

val residual:data = data1.resi

Perhaps, you would worry about cost to create data instance since it possesses a lot of properties; don't worry about it too much because scala initializes most of the properties when you require to use thanks to "lazy" keyword. Unless you call, it does't do anything to them.

Then, when you have to handle different kinds of data, it's time to use "dase" class ( dase stands for dase )

val mean:Vector[Double] = dase1.mean

val standard_deviation:Vector[Double] = dase1.sd

val combinations:Vector[Vector[Double]] = dase1.combi

val covariance:Vector[Double] = dase1.covar

val pearson_correlation:Vector[Double] = dase1.pears

val spearman_correlation:Vector[Double] = dase1.spears

val eucliean_distance:Vector[Double] = dase1.eucli

val regression_paremeters:Vector[(Double,Double)] = dase1.reg

val regression_line:Vector[Double=>Double] = dase1.regline

Scalearn treats data as a sample obtained from population and assume that we wanna know about population ( not sample data itself ), so "sd" is estimated standard deviation of population or square root of unbiased variance ( If you are not familiar with statistics, don't worry about this and just ignore cuz it is actually tiny problem when sample size is big enough ).

In the case wanna know about variance and standard deviation of sample itself

val biased_variance = data1.samplevari
val sample_standard_deviation = data1.samplesd

#Statistical Inference

Quite often, we have to distinguish whether the samples are taken from the same population or not; in other words, we should figure out the range of accidental error. Basically, "scalearn.statistics" package offers you understandable tools so as to analyze in almost same way with previous one; just create infds class instance( stands for dase in statistical inference ).

To begin with, let's make instance of infds. The infds class takes data instances as parameters, so you need to create them first;
alternatively, you could convert Vector [ Vector [ Double ] ] into infds instance implicitly

val infer_dase1:infds = infds(data1,data2,data3)

val infer_dase2:infds = Vector(raw1,raw2,raw3).toinf

val grand_mean :Double = infer_dase1.grandmean

val grand_size :Int = infer_dase1.grandsize

val grand_sum :Double = infer_dase1.grandsum

val effects :Vector[Double] = infer_dase1.effects

val tpair_vector :Vector[(Double,Boolean)] = infer_dase1.tpair

val twelch_vector :Vector[(Double,Boolean)] = infer_dase1.twelch

val anova_result :(Boolean,Double) = infer_dase1.anova

#Time Series Analysis

Create instance of "tsda" class ( stands for time series data ) or "Timedase" class ( stands for time series dase ). Either of them below is available.

val ts1:tsda = tsda(1.0,2,3,4)
val ts2:tsda = new tsda(Vector(1.0,2,3,4,5))
val ts3:tsda = data1.ts

val tsds1:tsds = tsds(ts1,ts2,ts3)
val tsds2:tsds = new tsds(Vector(ts1,ts2,ts3))
val tsds3:tsds = dase3.ts

val detrended:tsda = ts1.detrending
val differenced:tsda = ts1.differencing
ts1.differencing.differencing.detrending

Since tsda class inherits data class, you can call properties and methods of data class freely.

For more examples, recommend that you should take a look at tests in the test directory.