Adding HDFS support for data generation #188

bilalbari · 2024-06-10T19:51:24Z

This PR contains the following changes -

Adding DbGen class for running data generation as part of mapper
Updating build files for the same
Changes to README
Changes to existing python files for supporting HDFS data generation

Signed-off-by: Sayed Bilal Bari <[email protected]>

nds-h/tpch-gen/pom.xml

nds-h/tpch-gen/src/main/java/org/nvidia/nds_h/GenTable.java

Signed-off-by: Sayed Bilal Bari <[email protected]>

gerashegalov · 2024-06-11T18:09:24Z

nds-h/nds_h_gen_data.py

@@ -115,16 +116,96 @@ def generate_data_local(args, range_start, range_end, tool_path):
    # show summary
    subprocess.run(['du', '-h', '-d1', data_dir])

+def clean_temp_data(temp_data_path):
+    cmd = ['hadoop', 'fs', '-rm', '-r', '-skipTrash', temp_data_path]


Note that beyond subpar user-perceived delays with shelling out to launch heavy JVMs, we hit limitations in the past where hadoop CLI is not available. If we document that this script can be launched via spark-submit than we can use PY4J NVIDIA/spark-rapids#10599

On the other hand why do we need to wrap a Java program in Python CLI to begin with?

The java program is a mapper triggered only when generating data for hdfs. In case of local data generation, the python wrapper does not trigger a mapreduce job.
For the missing hadoop cli, there is a primary check in the python program triggering install hadoop cli message just for verbosity.
Here the hadoop job is just creating a limiting set of directories ( 8 in total - 1 per nds-h table) and moving nds-h generated data to the required folders.
Currently this is not being triggered via spark-submit.

This is more like a tech debt. When initializing this project, the order was to use Python and void the one that DB is using. I argued that time that we can also use Scala but failed.

More details I can recall, to avoid the way that we chain call "python-hdfs", the best option was to leverage the

_FILTER = [Y|N] -- output data to stdout

argument to pipe text output to stdout then pipe it directly into Spark Dataframe. (This is also what DB does) In this way we don't need any hadoop job to help generate distributed dataset.

Unluckily, latest TPC-DS v.3.20 disabled this argument, and the order was to use latest version and try our best not to modify it.

Thus it becomes what it is now.

As per allen's comment, I can pick this up as a separate issue later to figure out if there is any alternate solution to avoid chaining hadoop commands from python.

Signed-off-by: Sayed Bilal Bari <[email protected]>

nds-h/README.md

Signed-off-by: Sayed Bilal Bari <[email protected]>

bilalbari added 5 commits June 7, 2024 16:30

Changes for adding hdfs submitter class

92410f5

Signed-off-by: Sayed Bilal Bari <[email protected]>

Working changes to GenTable for hdfs run

1d81676

Signed-off-by: Sayed Bilal Bari <[email protected]>

Changes for mapReduce

7c6bfd4

Signed-off-by: Sayed Bilal Bari <[email protected]>

Changes to makefile

0940abf

Signed-off-by: Sayed Bilal Bari <[email protected]>

Adding comments to the Java file

9646c81

Signed-off-by: Sayed Bilal Bari <[email protected]>

bilalbari requested a review from mattahrens June 10, 2024 20:28

Removed redundant files

2d95d61

Signed-off-by: Sayed Bilal Bari <[email protected]>

mattahrens reviewed Jun 11, 2024

View reviewed changes

nds-h/tpch-gen/pom.xml Outdated Show resolved Hide resolved

mattahrens reviewed Jun 11, 2024

View reviewed changes

nds-h/tpch-gen/src/main/java/org/nvidia/nds_h/GenTable.java Outdated Show resolved Hide resolved

mattahrens reviewed Jun 11, 2024

View reviewed changes

nds-h/tpch-gen/src/main/java/org/nvidia/nds_h/GenTable.java Outdated Show resolved Hide resolved

mattahrens reviewed Jun 11, 2024

View reviewed changes

nds-h/tpch-gen/src/main/java/org/nvidia/nds_h/GenTable.java Outdated Show resolved Hide resolved

Correcting typo in query stream

d7e9df4

Signed-off-by: Sayed Bilal Bari <[email protected]>

gerashegalov reviewed Jun 11, 2024

View reviewed changes

Review changes

19e8c01

Signed-off-by: Sayed Bilal Bari <[email protected]>

mattahrens reviewed Jun 14, 2024

View reviewed changes

nds-h/README.md Show resolved Hide resolved

Fixing typo in example command

26f5037

Signed-off-by: Sayed Bilal Bari <[email protected]>

mattahrens previously approved these changes Jun 24, 2024

View reviewed changes

Changes for supporting json_summary+sub_queries -s

bb70cc5

Signed-off-by: Sayed Bilal Bari <[email protected]>

bilalbari dismissed mattahrens’s stale review via bb70cc5 June 27, 2024 16:52

Correcting typo and README

461227a

Signed-off-by: Sayed Bilal Bari <[email protected]>

mattahrens approved these changes Jun 27, 2024

View reviewed changes

bilalbari merged commit c41b702 into NVIDIA:dev Jul 3, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding HDFS support for data generation #188

Adding HDFS support for data generation #188

bilalbari commented Jun 10, 2024 •

edited

Loading

gerashegalov Jun 11, 2024

bilalbari Jun 13, 2024

wjxiz1992 Jun 15, 2024

bilalbari Jun 24, 2024

Adding HDFS support for data generation #188

Adding HDFS support for data generation #188

Conversation

bilalbari commented Jun 10, 2024 • edited Loading

gerashegalov Jun 11, 2024

Choose a reason for hiding this comment

bilalbari Jun 13, 2024

Choose a reason for hiding this comment

wjxiz1992 Jun 15, 2024

Choose a reason for hiding this comment

bilalbari Jun 24, 2024

Choose a reason for hiding this comment

bilalbari commented Jun 10, 2024 •

edited

Loading