Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add merge transform #1064

Draft
wants to merge 17 commits into
base: dev
Choose a base branch
from
Draft

Add merge transform #1064

wants to merge 17 commits into from

Conversation

roytman
Copy link
Member

@roytman roytman commented Feb 25, 2025

Why are these changes needed?

This transform merges two or more tables, assuming that while the tables have different sets of columns, their rows
contain the same data. It facilitates embarrassingly parallel data processing by merging the results.

Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
@roytman roytman changed the title Merge Add merge transform Feb 25, 2025
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
@roytman roytman requested a review from revit13 February 26, 2025 09:30
contain the same data. It facilitates **embarrassingly parallel** data processing by merging the results.

The transform receives a list of directories (merge_input_dirs) from which the tables to be merged are located.
One of these tables serves as the main table and is provided as a regular table for other transforms. The transform
Copy link
Collaborator

@revit13 revit13 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the example it seems that the main table is taken from the input_folder and not the merge_input_dirs

Signed-off-by: Alexey Roytman <[email protected]>
@revit13
Copy link
Collaborator

revit13 commented Feb 27, 2025

LGTM. btw you also need to update transforms/pyproject.toml

@roytman roytman requested a review from touma-I February 28, 2025 16:55
@shahrokhDaijavad
Copy link
Member

@roytman and @revit13, I discussed this with @touma-I today. This is definitely a useful transform that we will merge. Can you please add at least one Python notebook and if you can, a second Ray notebook to this PR? The notebooks are very simple and you can see any of the similar ones we have for other language/universal transforms.

@roytman roytman marked this pull request as draft March 2, 2025 06:34
@roytman
Copy link
Member Author

roytman commented Mar 9, 2025

@roytman and @revit13, I discussed this with @touma-I today. This is definitely a useful transform that we will merge. Can you please add at least one Python notebook and if you can, a second Ray notebook to this PR? The notebooks are very simple and you can see any of the similar ones we have for other language/universal transforms.

@shahrokhDaijavad , @touma-I , I checked the existing notebooks, and I have some questions.
The Python notebooks have a step ***** Setup runtime parameters for this transform
and the next step, is called ***** Use python runtime to invoke the transform

Actually, the execution is done in the first step, and the second one is empty.
Is it the desired behavior?

@shahrokhDaijavad
Copy link
Member

@roytman Which notebook are you looking at? A typical one to look at is this one: https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/gneissweb_classification/gneissweb_classification.ipynb

@shahrokhDaijavad
Copy link
Member

You are right, @roytman. Definitely, rep_removal is not a good example, by not having a cell that shows the table of input parameters. It was written under the time pressure of delivering the Gneissweb transforms. I will create an issue to improve it. Doc-id is not too bad. In any case, I still suggest making it like the notebook for gneissweb_classification.

@roytman
Copy link
Member Author

roytman commented Mar 9, 2025

thank you @shahrokhDaijavad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants