GitHub - data-commons/prep-buddy: A Scala / Java / Python library for cleansing, transforming and preparing large datasets for ML operations on Apache Spark.

Prep Buddy

Data Preparation Library for Spark

A Scala / Java / Python library for cleaning, transforming and executing other preparation tasks for large datasets on Apache Spark.

It is currently maintained by a team of developers from ThoughtWorks.

Post questions and comments to the Google group, or email them directly to [email protected]

Docs are available at http://data-commons.github.io/prep-buddy Or check out the Scaladocs.

Our aim is to provide a set of algorithms for cleaning and transforming very large data sets,
inspired by predecessors such as Open Refine, Pandas and Scikit-learn packages.

Important links

Official source code repo: https://github.com/data-commons/prep-buddy
Scala docs (development version): http://data-commons.github.io/prep-buddy/scaladocs
Download releases: Latest Release
Issue tracker: Github
Mailing list: [email protected]

Usage!

To use this library, add a maven dependency to datacommons in your project:

<dependency>
    <groupId>com.thoughtworks.datacommons</groupId>
    <artifactId>prep-buddy</artifactId>
    <version>0.5.1</version>
</dependency>

For other build tools check on Maven Repositry

##Python

If you don't have pip. Intsall pip.

pip install prep-buddy

For using pyspark on command-line Download the Jar.

pyspark --jars [PATH-TO-JAR]

spark-submit --driver-class-path [PATH-TO-JAR] [Your python file.]

This library is currently built for Spark 1.6.x, but is also compatible with 1.4.x.

Dependencies

The library depends on a few other libraries.

Apache Commons Math for general math and statistics functionality.
Apache Spark for all the distributed computation capabilities.
Open CSV for parsing the files.

Download

Stable 0.5.1(Beta).

Documentation Wiki

http://data-commons.github.io/prep-buddy

Contributing

Create a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 960 Commits
data		data
python		python
scripts		scripts
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
README.md		README.md
build-project		build-project
pom.xml		pom.xml
prep-buddy.iml		prep-buddy.iml
scalastyle-config.xml		scalastyle-config.xml
scalastyle-output.xml		scalastyle-output.xml
setup-project		setup-project
testConfig.properties		testConfig.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prep Buddy

Data Preparation Library for Spark

Important links

Usage!

Dependencies

Download

Documentation Wiki

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

data-commons/prep-buddy

Folders and files

Latest commit

History

Repository files navigation

Prep Buddy

Data Preparation Library for Spark

Important links

Usage!

Dependencies

Download

Documentation Wiki

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages