Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a new transform to automate crawling and then convert to parquet #751

Open
1 of 2 tasks
shahrokhDaijavad opened this issue Oct 29, 2024 · 4 comments
Open
1 of 2 tasks
Assignees
Labels
enhancement New feature or request simplify-DPK

Comments

@shahrokhDaijavad
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

This is using the DPK-connector lib in the repo (https://github.com/IBM/data-prep-kit/tree/dev/data-connector-lib) which is available as a stand-alone pip install now. This is about making this a "data ingestion" transform with parquet output that can easily be fed into other DPK transforms.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad shahrokhDaijavad added the enhancement New feature or request label Oct 29, 2024
@touma-I touma-I self-assigned this Nov 5, 2024
@touma-I
Copy link
Collaborator

touma-I commented Nov 5, 2024

@touma-I to get the code from the developers and adapt it to the DPK transform and submit a PR.

@Bytes-Explorer
Copy link
Collaborator

Request for consideration as we build this transform:

  1. Simple API call with minimal lines
  2. Explain what parameters are there in the API and how to use it in the readme
  3. Request from Sujee on storing data

Image

@touma-I
Copy link
Collaborator

touma-I commented Nov 12, 2024

@Bytes-Explorer Can you explain more why you need a structured/nested directory structure ? What are the motivation for it and how this will be used? There are two issues why we should NOT do it:

  1. Web sites tend to have badly defined structure with cyclical graph
  2. The framework is not setup to handle nested folder.
    Please provide additional detail or we can skip this requirement for now.

@Bytes-Explorer
Copy link
Collaborator

@touma-I That requirement is from @sujee Lets discuss in the call. It is in the issue so that requests from our users are not missed on slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request simplify-DPK
Projects
None yet
Development

No branches or pull requests

3 participants