spark-fast-tests

A fast Apache Spark testing framework with beautifully formatted error messages!

For example, the assertSmallDatasetEquality method can be used to compare two Datasets (or two DataFrames).

val sourceDF = Seq(
  (1),
  (5)
).toDF("number")

val expectedDF = Seq(
  (1, "word"),
  (5, "word")
).toDF("number", "word")

assertSmallDatasetEquality(sourceDF, expectedDF)
// throws a DatasetSchemaMismatch exception

The assertSmallDatasetEquality method can also be used to compare Datasets.

val sourceDS = Seq(
  Person("bob", 1),
  Person("alice", 5)
).toDS

val expectedDS = Seq(
  Person("frank", 10),
  Person("lucy", 5)
).toDS

assertLargeDatasetEquality(sourceDS, expectedDS)
// throws an exception because the Datasets have different data

The DatasetComparer has assertSmallDatasetEquality and assertLargeDatasetEquality methods to compare either Datasets or DataFrames.

If you only need to compare DataFrames, you can use DataFrameComparer with the associated assertSmallDataFrameEquality and assertLargeDataFrameEquality methods. Under the hood, DataFrameComparer uses the assertSmallDatasetEquality and assertLargeDatasetEquality.

Setup

Add the sbt-spark-package plugin so you can install Spark Packages.

Then add these lines to your build.sbt file to install Spark SQL and spark-fast-tests:

spDependencies += "MrPowers/spark-fast-tests:0.4.0"
sparkComponents ++= Seq("sql")

Usage

The spark-fast-tests project doesn't provide a SparkSession object in your test suite, so you'll need to make one yourself.

import org.apache.spark.sql.SparkSession

trait SparkSessionTestWrapper {

  lazy val spark: SparkSession = {
    SparkSession.builder().master("local").appName("spark session").getOrCreate()
  }

}

The DatasetComparer trait defines the assertSmallDatasetEquality method. Extend your spec file with the SparkSessionTestWrapper trait to create DataFrames and the DatasetComparer trait to make DataFrame comparisons.

class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {

  import spark.implicits._

    it("aliases a DataFrame") {

      val sourceDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("name")

      val actualDF = sourceDF.select(col("name").alias("student"))

      val expectedDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("student")

      assertSmallDatasetEquality(actualDF, expectedDF)

    }

  }

}

To compare large DataFrames that are partitioned across different nodes in a cluster, use the assertLargeDatasetEquality method.

assertLargeDatasetEquality(actualDF, expectedDF)

assertSmallDatasetEquality is faster for test suites that run on your local machine. assertLargeDatasetEquality should only be used for DataFrames that are split across nodes in a cluster.

Alternatives

The spark-testing-base project has more features (e.g. streaming support) and is compiled to support a variety of Scala and Spark versions.

You might want to use spark-fast-tests instead of spark-testing-base in these cases:

You want to run tests in parallel (you need to set parallelExecution in Test := false with spark-testing-base)
You don't want to include hive as a project dependency
You don't want to restart the SparkSession after each test file executes so the suite runs faster

Additional Goals

Use memory efficiently so Spark test runs don't crash
Provide readable error messages
Easy to use in conjunction with other test suites
Give the user control of the SparkSession

Spark Versions

spark-fast-tests supports Spark 2.x. There are no plans to retrofit the project to work with Spark 1.x.

Contributing

Open an issue or send a pull request to contribute. Anyone that makes good contributions to the project will be promoted to project maintainer status.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-fast-tests

Setup

Usage

Alternatives

Additional Goals

Spark Versions

Contributing

About

Releases

Packages

Languages

License

xdreamcode/spark-fast-tests

Folders and files

Latest commit

History

Repository files navigation

spark-fast-tests

Setup

Usage

Alternatives

Additional Goals

Spark Versions

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages