Scalearn


Simple library for statistics, machine learning, natural language processing in Scala

Download .zip Download .tar.gz View on GitHub

Welcome to Scalearn

This library would help you to deal with a bunch of data by making use of statistics, time series analysis, and machine learning. Although such academic domains seem to be difficult for most developers, this library targets ordinary scala developers; accordingly, it focus on usability rather than precise definition. Besides, I hope this could be useful for beginners of scala.

Prerequisite

Architecture

Some functions in Scalearn runs on spark, which is constructed on Hadoop distribution
=> spark
The main difference between Hadoop and Spark is that spark concentrates on the tasks which need to load data again and again like machine learning algorithms. For such tasks spark runs much faster( at most 100 times roughly speaking ). Thankfully, it enables us to implement machine learning algorithms running on your distributed system without complicated configuration.

Installation

So far, no sbt installation available, therefore please build on your own ( since this is still the younger version and need to polish to some extent ) .

Quick Start


#Machine Learning


Roughly speanking, machine leaning has two different direction; one is supervised learning, and the other is unsupervised learning. Supervised learning requires you to collect data for study so as to construct classifier for instance, while unsupervised learning doesn't. Scalearn offers some famous methods of them such as k-means, naive bayes, and so on.


#Natural Language Processing


  • Supervised Classification

  • import scalearn.classification.supervised._
    
    val file_paths = 
            ListBuffer( 
                ("classA","resource/doc1.txt"),
                ("classA","resource/doc2.txt"),
                ("classA","resource/doc3.txt"),
                ("classB","resource/doc4.txt"),
                ("classB","resource/doc5.txt"),
                ("classB","resource/doc6.txt"),
                ("classC","resource/doc7.txt"),
                ("classC","resource/doc8.txt"),
                ("classC","resource/doc9.txt")
                )
    val nb = NaiveBayes(file_paths)  	
    nb.classify("resource/examine.txt")
    
    val pn = ParallelNaive(file_paths)
    pn.classify("resource/examine.txt")

    As you can expect above, both NaiveBayes and ParallelNaive classes are classifier to classify coming new data based on the past data for studying. The parameter of both classes is the ListBuffer of Pair(class name document belong to , path to the document for study). If the amount of data is not quite a lot, better to run on memory; should use NaiveBayes class as a classifier because it possess whole words on memory including data for study. On the other hand, ParallelNaive requires memory to have only the coming data you wanna classify; slower than NaiveBayes, yet could handle a bunch of data. Also, you could run on Hadoop thanks to Spark if you want. As a default, ParallelNaive utilizes whole cores on your computer.

    ※Docuemnts(data) should be japanese so far.

  • Unsupervised Classification => Clustering

  • import scalearn.classification.unsupervised.Clustering
    import scala.collection.mutable.ArrayBuffer
    
    val data = ArrayBuffer(
                    Vector(3.2,4.64,2.21,6.7,4.5),
                    Vector(1.1,2.2,32.3,4.4,5.5),
                    Vector(9.9,8.8,7.73,6.65,5.5)
                    Vector(9.5,4.8,3.7,2.6,53.5)
                    Vector(3.2,4.64,2.21,6.7,4.5),
                    Vector(1.1,2.2,3.3,4.4,5.533),
                    Vector(3.2,4.64,2.21,6.7,43.5),
                    Vector(1.1,2.22,3.322,4.42,5.5),
                    Vector(9.9,82.8,7.7,6.6,5.5)
                    )
                    
    val result: IndexedSeq[VectorCluster] = Clustering.kmeans(2,data,200)

    When you have no idea about the ideal result of classification, it's time to take unsupervised classification like clustering. The example above shows how to use kmeans algorithm which is one of the clustering methods. The form of data must be ArrayBufer[Vector[Double]] in this function, so you need to import mutable ArrayBuffer first of all. Then, import Clustering object to call method of it. That's it!! The rest of your task is just use it by giving proper parameter to it. As the note, the parameter numberOfPrototype determine the number of local data you would realize as prototype and you cannot classify the prototype itself since it would vary again and again; in that sense, it is the procedure to determine not only how many class ( or cluster ) do you create but also how many data you lose.


    #Descriptive Statistics


    If you are interested in statistics, take a look at "scalearn.statistics" package since it is what you need to handle. Even compared to the other statistical environment, scala has a variety of simple but powerful functions to process data; in that sense, you could preprocess raw data on your own through scala's plain functions such as map, filter, and flatMap. Of course, when you have to handle natural language as data, Scalearn would help you not only to analyze but also to preprocess.
    The description below might be the common flow of statistical analysis in Scalearn

    1. Get raw data
    2. Preprocessing the raw data
    3. Convert that raw data into instance of some data class scalearn offers like data, dase, infds, tsda, tsds, and so on
    4. Obtain statistical value by calling properties and methods

    Well, Let's take some examples

    First of all, you need to import "scalearn.statistics package" and "Converter object" into your code. They allow you to obtain statistical value easily.

    import scalearn.statistics._
    import scalearn.statistics.Converter._

    and assume you have raw data already( let me say just "raw" to distinguish with data instance)

    val raw1:Vector[Double] = Vector(1,2,3,4,5,6)
    val raw2 = Vector(1.0,2,3,4,5)
    val raw3 = Vector(1,2,3,4,5).map(elt=>elt.toDouble)

    Ok, let's get start it. To begin with, create data instance( if you need to make a set of data, generate instance of "dase" class by giving parameters [ data* ] )

    val data1:data = data(raw1)
    val data2:data = Vector(1,2,3,4,5).toda
    
    val dase1:dase = dase(data1,data2,data1)
    val dase2:dase = new dase(Vector(data1,data2))
    val dase3:dase = Vector(raw1,raw2,raw3).tods

    Get statistical value for data1

    val mean:Double = data1.mean
    
    val standard_deviation:Double = data1.sd
    
    val length:Int = data1.size
    
    val deviation:Vector[Double] = data1.dv
    
    val axis_of_time:Vector[Double] = data1.time
    
    val regression_slope_intercept:(Double,Double) = data1.reg
    
    val regression_line:Double=>Double = data1.regline
    
    val residual:data = data1.resi

    Perhaps, you would worry about cost to create data instance since it possesses a lot of properties; don't worry about it too much because scala initializes most of the properties when you require to use thanks to "lazy" keyword. Unless you call, it does't do anything to them.

    Then, when you have to handle different kinds of data, it's time to use "dase" class ( dase stands for dase )

    val mean:Vector[Double] = dase1.mean
    
    val standard_deviation:Vector[Double] = dase1.sd
    
    val combinations:Vector[Vector[Double]] = dase1.combi
    
    val covariance:Vector[Double] = dase1.covar
    
    val pearson_correlation:Vector[Double] = dase1.pears
    
    val spearman_correlation:Vector[Double] = dase1.spears
    
    val eucliean_distance:Vector[Double] = dase1.eucli
    
    val regression_paremeters:Vector[(Double,Double)] = dase1.reg
    
    val regression_line:Vector[Double=>Double] = dase1.regline

    Scalearn treats data as a sample obtained from population and assume that we wanna know about population ( not sample data itself ), so "sd" is estimated standard deviation of population or square root of unbiased variance ( If you are not familiar with statistics, don't worry about this and just ignore cuz it is actually tiny problem when sample size is big enough ).

    In the case wanna know about variance and standard deviation of sample itself

    val biased_variance = data1.samplevari
    val sample_standard_deviation = data1.samplesd

    #Statistical Inference


    Quite often, we have to distinguish whether the samples are taken from the same population or not; in other words, we should figure out the range of accidental error. Basically, "scalearn.statistics" package offers you understandable tools so as to analyze in almost same way with previous one; just create infds class instance( stands for dase in statistical inference ).

    To begin with, let's make instance of infds. The infds class takes data instances as parameters, so you need to create them first;
    alternatively, you could convert Vector [ Vector [ Double ] ] into infds instance implicitly

    val infer_dase1:infds = infds(data1,data2,data3)
    
    val infer_dase2:infds = Vector(raw1,raw2,raw3).toinf
    val grand_mean :Double = infer_dase1.grandmean
    
    val grand_size :Int = infer_dase1.grandsize
    
    val grand_sum :Double = infer_dase1.grandsum
    
    val effects :Vector[Double] = infer_dase1.effects
    
    val tpair_vector :Vector[(Double,Boolean)] = infer_dase1.tpair
    
    val twelch_vector :Vector[(Double,Boolean)] = infer_dase1.twelch
    
    val anova_result :(Boolean,Double) = infer_dase1.anova

    #Time Series Analysis


    1. Create instance of "tsda" class ( stands for time series data ) or "Timedase" class ( stands for time series dase ). Either of them below is available.

    2. val ts1:tsda = tsda(1.0,2,3,4)
      val ts2:tsda = new tsda(Vector(1.0,2,3,4,5))
      val ts3:tsda = data1.ts
      
      val tsds1:tsds = tsds(ts1,ts2,ts3)
      val tsds2:tsds = new tsds(Vector(ts1,ts2,ts3))
      val tsds3:tsds = dase3.ts
      val detrended:tsda = ts1.detrending
      val differenced:tsda = ts1.differencing
      ts1.differencing.differencing.detrending

      Since tsda class inherits data class, you can call properties and methods of data class freely.

    For more examples, recommend that you should take a look at tests in the test directory.