Skip to content

xdreamcode/spark-fast-tests

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-fast-tests

A fast Apache Spark testing framework with beautifully formatted error messages!

Build Status

Codacy Badge

For example, the assertSmallDatasetEquality method can be used to compare two Datasets (or two DataFrames).

val sourceDF = Seq(
  (1),
  (5)
).toDF("number")

val expectedDF = Seq(
  (1, "word"),
  (5, "word")
).toDF("number", "word")

assertSmallDatasetEquality(sourceDF, expectedDF)
// throws a DatasetSchemaMismatch exception

The assertSmallDatasetEquality method can also be used to compare Datasets.

val sourceDS = Seq(
  Person("bob", 1),
  Person("alice", 5)
).toDS

val expectedDS = Seq(
  Person("frank", 10),
  Person("lucy", 5)
).toDS

assertLargeDatasetEquality(sourceDS, expectedDS)
// throws an exception because the Datasets have different data

The DatasetComparer has assertSmallDatasetEquality and assertLargeDatasetEquality methods to compare either Datasets or DataFrames.

If you only need to compare DataFrames, you can use DataFrameComparer with the associated assertSmallDataFrameEquality and assertLargeDataFrameEquality methods. Under the hood, DataFrameComparer uses the assertSmallDatasetEquality and assertLargeDatasetEquality.

Setup

Add the sbt-spark-package plugin so you can install Spark Packages.

Then add these lines to your build.sbt file to install Spark SQL and spark-fast-tests:

spDependencies += "MrPowers/spark-fast-tests:0.4.0"
sparkComponents ++= Seq("sql")

Usage

The spark-fast-tests project doesn't provide a SparkSession object in your test suite, so you'll need to make one yourself.

import org.apache.spark.sql.SparkSession

trait SparkSessionTestWrapper {

  lazy val spark: SparkSession = {
    SparkSession.builder().master("local").appName("spark session").getOrCreate()
  }

}

The DatasetComparer trait defines the assertSmallDatasetEquality method. Extend your spec file with the SparkSessionTestWrapper trait to create DataFrames and the DatasetComparer trait to make DataFrame comparisons.

class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {

  import spark.implicits._

    it("aliases a DataFrame") {

      val sourceDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("name")

      val actualDF = sourceDF.select(col("name").alias("student"))

      val expectedDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("student")

      assertSmallDatasetEquality(actualDF, expectedDF)

    }

  }

}

To compare large DataFrames that are partitioned across different nodes in a cluster, use the assertLargeDatasetEquality method.

assertLargeDatasetEquality(actualDF, expectedDF)

assertSmallDatasetEquality is faster for test suites that run on your local machine. assertLargeDatasetEquality should only be used for DataFrames that are split across nodes in a cluster.

Alternatives

The spark-testing-base project has more features (e.g. streaming support) and is compiled to support a variety of Scala and Spark versions.

You might want to use spark-fast-tests instead of spark-testing-base in these cases:

  • You want to run tests in parallel (you need to set parallelExecution in Test := false with spark-testing-base)
  • You don't want to include hive as a project dependency
  • You don't want to restart the SparkSession after each test file executes so the suite runs faster

Additional Goals

  • Use memory efficiently so Spark test runs don't crash
  • Provide readable error messages
  • Easy to use in conjunction with other test suites
  • Give the user control of the SparkSession

Spark Versions

spark-fast-tests supports Spark 2.x. There are no plans to retrofit the project to work with Spark 1.x.

Contributing

Open an issue or send a pull request to contribute. Anyone that makes good contributions to the project will be promoted to project maintainer status.

About

Fast Apache Spark testing framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 100.0%