Skip to content

Using Hash table based indexes for optimising joins in Apache Spark

Notifications You must be signed in to change notification settings

anish749/spark-indexed-dedup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Using HashSet based indexes in Apache Spark

An Apache Spark based solution to partition data and de-duplicate it while incrementally loading to a table.

The code is described further along with the problem statement in a lot more detail in this blog post here.

The problem statement

A table exists in Hive or any destination system and data is loaded every hour (or day) to this table. As per the source systems, it can generate duplicate data. However the final table should de-duplicate the data while loading. We assume that data is immutable, but can be delivered more than once, and we need a logic to filter these duplicates before appending the data to our master store.

Assumption: We have the data partitioned and data doesn’t get repeated across partition. This just makes the problem a simpler to optimize as it would help in reducing shuffles. For other use cases we can very well consider the whole data to be in one partition.

Sample use case: I used this on click stream data with an at least once delivery guarantee. However since the data is immutable, the source timestamp doesn’t change, and I partition the data based on this source timestamp.

More indepth explanation is here

Examples

An example of using the APIs which are exposed outside is here

About

Using Hash table based indexes for optimising joins in Apache Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages