Skip to content

MachineLearning examples using Spark MLIB and Databricks

Notifications You must be signed in to change notification settings

aosama/MachineLearningSamples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MachineLearningSamples

This repo hosts variety of examples based on Apache Spark MLIB.

Databricks Notebooks

Scala IDE Based Examples

A vanilla decision tree example.

How to get a stratified sample so the test and train datasets are sampled accross possible values.

How to index and encode categorical features.

How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.

How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.

How to handle multiple categorical and continuous features on a real-life data set. Uses the Census Income data set.

Data Sets References

First line from adult.test file removed for loading into Spark.

Census Income data set citation: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

About

MachineLearning examples using Spark MLIB and Databricks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published