Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ray version of the html2parquet transform #666

Merged
merged 22 commits into from
Oct 7, 2024
Merged

Added ray version of the html2parquet transform #666

merged 22 commits into from
Oct 7, 2024

Conversation

sungeunan-ibm
Copy link
Collaborator

Why are these changes needed?

Added ray version of the html2parquet transform

Related issue number (if any).

Issue #635

@shahrokhDaijavad
Copy link
Member

Great. Thank you very much @sungeunan-ibm for finishing this. It looks like the conflict in yml is the result of 1) The move of this transform from universal to the language folder and 2) The fact that a new hap transform was added yesterday. But, the main reviewer for the correct implementation of ray is @daw3rd.

@sungeunan-ibm
Copy link
Collaborator Author

Thank you @shahrokhDaijavad! @daw3rd Could you advise on how I should resolve these conflicts? Should I simply regenerate the file by running the following commands?

cd .github/workflows
rm test-language-html2parquet.yml 
make

daw3rd
daw3rd previously requested changes Oct 4, 2024
@@ -1,7 +1,11 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
<<<<<<<< HEAD:.github/workflows/test-language-html2parquet.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you check this in? I thought git would not allow it. Anyway. I recommend the following.

  1. merge the latest dev branch into your branch
  2. run make in this directory and you should get a proper version of this file.

@@ -1,7 +1,11 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
<<<<<<<< HEAD:.github/workflows/test-language-html2parquet.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here


minio-start: .minio-start

kind-load-image:: .transforms.kind-load-image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the following rules to match those in other transforms from the latest dev branch. For example,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Makefile you are referring to is located in the ray folder, while the example provided is directly under the ededup folder. I checked the Makefile for the ededup transform within the ray folder, and it matches the same structure.


The transform will produce the following statsd metrics:

| metric name | Description |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not part of the output parquet file. I think you need to tell the reader where these can be viewed. Also, please link from here to the python readme for annotations and configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not entirely sure where these metrics can be viewed. In the Ray implementation, I borrowed the pdf2parquet code, but I’m not very familiar with Prometheus metrics. These metrics are generated from the Html2ParquetRayTransform class, and they can be viewed through logging. After some investigation, it appears that you can access these metrics in the Prometheus dashboard under the relevant namespace for your application. However, my experience with this is limited. I would greatly appreciate your assistance in adding more detailed descriptions!

Sungeun An added 4 commits October 4, 2024 11:52
Signed-off-by: Sungeun An <[email protected]>
Signed-off-by: Sungeun An <[email protected]>
Signed-off-by: Sungeun An <[email protected]>
transforms/language/html2parquet/ray/pyproject.toml Outdated Show resolved Hide resolved
transforms/language/html2parquet/ray/Makefile Outdated Show resolved Hide resolved
transforms/language/html2parquet/ray/Makefile Outdated Show resolved Hide resolved
transforms/language/html2parquet/ray/Makefile Outdated Show resolved Hide resolved
transforms/language/html2parquet/ray/Makefile Outdated Show resolved Hide resolved
transforms/language/html2parquet/ray/Makefile Outdated Show resolved Hide resolved
transforms/language/html2parquet/ray/Makefile Outdated Show resolved Hide resolved
transforms/language/html2parquet/ray/Makefile Outdated Show resolved Hide resolved
package name for ray cannot be the same as package name for python
Setting up the vent for ray testing, is different than python testing
here we need to test ray modules (not python)
Need to publish ray image (not python)
Need to build ray image (not python)
Need to test ray image (not python)
Not sure if we are using this but it should still be ray sample as the default goes to python sample
@@ -102,6 +102,7 @@ jobs:
- name: Test KFP libs (shared and v2) and run a workflow
timeout-minutes: 120
run: |

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file being changed by this PR?

@@ -53,7 +53,7 @@ docker-save-image::
@# Help: Recursively make $@ in all subdirs
$(MAKE) RULE=$@ .recurse

.PHONY: workflow-venv
.PHONY: workflow-vent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no kfp support yet. Why is this in here ?

Need a base image to build the docker image for ray
@touma-I touma-I requested review from touma-I and daw3rd October 4, 2024 23:27
@touma-I touma-I dismissed daw3rd’s stale review October 7, 2024 15:53

We will update the transforms/language/html2parquet/ray/README.md in second PR.

@touma-I touma-I merged commit 6f06e9a into dev Oct 7, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants