-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding HDFS support for data generation #188
Conversation
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
@@ -115,16 +116,96 @@ def generate_data_local(args, range_start, range_end, tool_path): | |||
# show summary | |||
subprocess.run(['du', '-h', '-d1', data_dir]) | |||
|
|||
def clean_temp_data(temp_data_path): | |||
cmd = ['hadoop', 'fs', '-rm', '-r', '-skipTrash', temp_data_path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that beyond subpar user-perceived delays with shelling out to launch heavy JVMs, we hit limitations in the past where hadoop CLI is not available. If we document that this script can be launched via spark-submit than we can use PY4J NVIDIA/spark-rapids#10599
On the other hand why do we need to wrap a Java program in Python CLI to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java program is a mapper triggered only when generating data for hdfs. In case of local data generation, the python wrapper does not trigger a mapreduce job.
For the missing hadoop cli, there is a primary check in the python program triggering install hadoop cli message just for verbosity.
Here the hadoop job is just creating a limiting set of directories ( 8 in total - 1 per nds-h table) and moving nds-h generated data to the required folders.
Currently this is not being triggered via spark-submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like a tech debt. When initializing this project, the order was to use Python and void the one that DB is using. I argued that time that we can also use Scala but failed.
More details I can recall, to avoid the way that we chain call "python-hdfs", the best option was to leverage the
_FILTER = [Y|N] -- output data to stdout
argument to pipe text output to stdout then pipe it directly into Spark Dataframe. (This is also what DB does) In this way we don't need any hadoop job to help generate distributed dataset.
Unluckily, latest TPC-DS v.3.20 disabled this argument, and the order was to use latest version and try our best not
to modify it.
Thus it becomes what it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per allen's comment, I can pick this up as a separate issue later to figure out if there is any alternate solution to avoid chaining hadoop commands from python.
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
This PR contains the following changes -