Welcome to Scalearn
This library would help you to deal with a bunch of data by making use of statistics, time series analysis, and machine learning. Although such academic domains seem to be difficult for most developers, this library targets ordinary scala developers; accordingly, it focus on usability rather than precise definition. Besides, I hope this could be useful for beginners of scala.
Prerequisite
Architecture
Some functions in Scalearn runs on spark, which is constructed on Hadoop distribution
=> spark
The main difference between Hadoop and Spark is that spark concentrates on the tasks which need to load data again and again like machine learning algorithms. For such tasks spark runs much faster( at most 100 times roughly speaking ). Thankfully, it enables us to implement machine learning algorithms running on your distributed system without complicated configuration.
Installation
So far, no sbt installation available, therefore please build on your own ( since this is still the younger version and need to polish to some extent ) .
Quick Start
#Machine Learning
Roughly speanking, machine leaning has two different direction; one is supervised learning, and the other is unsupervised learning. Supervised learning requires you to collect data for study so as to construct classifier for instance, while unsupervised learning doesn't. Scalearn offers some famous methods of them such as k-means, naive bayes, and so on.
#Natural Language Processing
import scalearn.classification.supervised._
val file_paths =
ListBuffer(
("classA","resource/doc1.txt"),
("classA","resource/doc2.txt"),
("classA","resource/doc3.txt"),
("classB","resource/doc4.txt"),
("classB","resource/doc5.txt"),
("classB","resource/doc6.txt"),
("classC","resource/doc7.txt"),
("classC","resource/doc8.txt"),
("classC","resource/doc9.txt")
)
val nb = NaiveBayes(file_paths)
nb.classify("resource/examine.txt")
val pn = ParallelNaive(file_paths)
pn.classify("resource/examine.txt")
As you can expect above, both NaiveBayes and ParallelNaive classes are classifier to classify coming new data based on the past data for studying. The parameter of both classes is the ListBuffer of Pair(class name document belong to , path to the document for study). If the amount of data is not quite a lot, better to run on memory; should use NaiveBayes class as a classifier because it possess whole words on memory including data for study. On the other hand, ParallelNaive requires memory to have only the coming data you wanna classify; slower than NaiveBayes, yet could handle a bunch of data. Also, you could run on Hadoop thanks to Spark if you want. As a default, ParallelNaive utilizes whole cores on your computer.
※Docuemnts(data) should be japanese so far.
import scalearn.classification.unsupervised.Clustering
import scala.collection.mutable.ArrayBuffer
val data = ArrayBuffer(
Vector(3.2,4.64,2.21,6.7,4.5),
Vector(1.1,2.2,32.3,4.4,5.5),
Vector(9.9,8.8,7.73,6.65,5.5)
Vector(9.5,4.8,3.7,2.6,53.5)
Vector(3.2,4.64,2.21,6.7,4.5),
Vector(1.1,2.2,3.3,4.4,5.533),
Vector(3.2,4.64,2.21,6.7,43.5),
Vector(1.1,2.22,3.322,4.42,5.5),
Vector(9.9,82.8,7.7,6.6,5.5)
)
val result: IndexedSeq[VectorCluster] = Clustering.kmeans(2,data,200)
When you have no idea about the ideal result of classification, it's time to take unsupervised classification like clustering. The example above shows how to use kmeans algorithm which is one of the clustering methods. The form of data must be ArrayBufer[Vector[Double]] in this function, so you need to import mutable ArrayBuffer first of all. Then, import Clustering object to call method of it. That's it!! The rest of your task is just use it by giving proper parameter to it. As the note, the parameter numberOfPrototype determine the number of local data you would realize as prototype and you cannot classify the prototype itself since it would vary again and again; in that sense, it is the procedure to determine not only how many class ( or cluster ) do you create but also how many data you lose.
#Descriptive Statistics
If you are interested in statistics, take a look at "scalearn.statistics" package since it is what you need to handle. Even compared to the other statistical environment, scala has a variety of simple but powerful functions to process data; in that sense, you could preprocess raw data on your own through scala's plain functions such as map, filter, and flatMap. Of course, when you have to handle natural language as data, Scalearn would help you not only to analyze but also to preprocess.
The description below might be the common flow of statistical analysis in Scalearn
- Get raw data
- Preprocessing the raw data
- Convert that raw data into instance of some data class scalearn offers like data, dase, infds, tsda, tsds, and so on
- Obtain statistical value by calling properties and methods
Well, Let's take some examples
First of all, you need to import "scalearn.statistics package" and "Converter object" into your code. They allow you to obtain statistical value easily.
import scalearn.statistics._
import scalearn.statistics.Converter._
and assume you have raw data already( let me say just "raw" to distinguish with data instance)
val raw1:Vector[Double] = Vector(1,2,3,4,5,6)
val raw2 = Vector(1.0,2,3,4,5)
val raw3 = Vector(1,2,3,4,5).map(elt=>elt.toDouble)
Ok, let's get start it. To begin with, create data instance( if you need to make a set of data, generate instance of "dase" class by giving parameters [ data* ] )
val data1:data = data(raw1)
val data2:data = Vector(1,2,3,4,5).toda
val dase1:dase = dase(data1,data2,data1)
val dase2:dase = new dase(Vector(data1,data2))
val dase3:dase = Vector(raw1,raw2,raw3).tods
Get statistical value for data1
val mean:Double = data1.mean
val standard_deviation:Double = data1.sd
val length:Int = data1.size
val deviation:Vector[Double] = data1.dv
val axis_of_time:Vector[Double] = data1.time
val regression_slope_intercept:(Double,Double) = data1.reg
val regression_line:Double=>Double = data1.regline
val residual:data = data1.resi
Perhaps, you would worry about cost to create data instance since it possesses a lot of properties; don't worry about it too much because scala initializes most of the properties when you require to use thanks to "lazy" keyword. Unless you call, it does't do anything to them.
Then, when you have to handle different kinds of data, it's time to use "dase" class ( dase stands for dase )
val mean:Vector[Double] = dase1.mean
val standard_deviation:Vector[Double] = dase1.sd
val combinations:Vector[Vector[Double]] = dase1.combi
val covariance:Vector[Double] = dase1.covar
val pearson_correlation:Vector[Double] = dase1.pears
val spearman_correlation:Vector[Double] = dase1.spears
val eucliean_distance:Vector[Double] = dase1.eucli
val regression_paremeters:Vector[(Double,Double)] = dase1.reg
val regression_line:Vector[Double=>Double] = dase1.regline
Scalearn treats data as a sample obtained from population and assume that we wanna know about population ( not sample data itself ), so "sd" is estimated standard deviation of population or square root of unbiased variance ( If you are not familiar with statistics, don't worry about this and just ignore cuz it is actually tiny problem when sample size is big enough ).
In the case wanna know about variance and standard deviation of sample itself
val biased_variance = data1.samplevari
val sample_standard_deviation = data1.samplesd
#Statistical Inference
Quite often, we have to distinguish whether the samples are taken from the same population or not; in other words, we should figure out the range of accidental error. Basically, "scalearn.statistics" package offers you understandable tools so as to analyze in almost same way with previous one; just create infds class instance( stands for dase in statistical inference ).
To begin with, let's make instance of infds. The infds class takes data instances as parameters, so you need to create them first;
alternatively, you could convert Vector [ Vector [ Double ] ] into infds instance implicitly
val infer_dase1:infds = infds(data1,data2,data3)
val infer_dase2:infds = Vector(raw1,raw2,raw3).toinf
val grand_mean :Double = infer_dase1.grandmean
val grand_size :Int = infer_dase1.grandsize
val grand_sum :Double = infer_dase1.grandsum
val effects :Vector[Double] = infer_dase1.effects
val tpair_vector :Vector[(Double,Boolean)] = infer_dase1.tpair
val twelch_vector :Vector[(Double,Boolean)] = infer_dase1.twelch
val anova_result :(Boolean,Double) = infer_dase1.anova
#Time Series Analysis
- Create instance of "tsda" class ( stands for time series data ) or "Timedase" class ( stands for time series dase ). Either of them below is available.
val ts1:tsda = tsda(1.0,2,3,4)
val ts2:tsda = new tsda(Vector(1.0,2,3,4,5))
val ts3:tsda = data1.ts
val tsds1:tsds = tsds(ts1,ts2,ts3)
val tsds2:tsds = new tsds(Vector(ts1,ts2,ts3))
val tsds3:tsds = dase3.ts
val detrended:tsda = ts1.detrending
val differenced:tsda = ts1.differencing
ts1.differencing.differencing.detrending
Since tsda class inherits data class, you can call properties and methods of data class freely.
For more examples, recommend that you should take a look at tests in the test directory.