-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add merge transform #1064
base: dev
Are you sure you want to change the base?
Add merge transform #1064
Conversation
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
Signed-off-by: Alexey Roytman <[email protected]>
contain the same data. It facilitates **embarrassingly parallel** data processing by merging the results. | ||
|
||
The transform receives a list of directories (merge_input_dirs) from which the tables to be merged are located. | ||
One of these tables serves as the main table and is provided as a regular table for other transforms. The transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the example it seems that the main table is taken from the input_folder and not the merge_input_dirs
Signed-off-by: Alexey Roytman <[email protected]>
LGTM. btw you also need to update transforms/pyproject.toml |
@roytman and @revit13, I discussed this with @touma-I today. This is definitely a useful transform that we will merge. Can you please add at least one Python notebook and if you can, a second Ray notebook to this PR? The notebooks are very simple and you can see any of the similar ones we have for other language/universal transforms. |
@shahrokhDaijavad , @touma-I , I checked the existing notebooks, and I have some questions. Actually, the execution is done in the first step, and the second one is empty. |
@roytman Which notebook are you looking at? A typical one to look at is this one: https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/gneissweb_classification/gneissweb_classification.ipynb |
You are right, @roytman. Definitely, rep_removal is not a good example, by not having a cell that shows the table of input parameters. It was written under the time pressure of delivering the Gneissweb transforms. I will create an issue to improve it. Doc-id is not too bad. In any case, I still suggest making it like the notebook for gneissweb_classification. |
thank you @shahrokhDaijavad |
Why are these changes needed?
This transform merges two or more tables, assuming that while the tables have different sets of columns, their rows
contain the same data. It facilitates embarrassingly parallel data processing by merging the results.