Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Astra DB Source Connector Support #3212

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

erichare
Copy link

This Pull Request adds support for Astra DB as a source connector. The idea being that some collections in Astra DB may not be vector collections. Here, we can pull structured data, and use the destination connector to produce a structured table with vector support. CC @potter-potter

@potter-potter
Copy link
Contributor

@erichare I'll check this out this week. Thanks!

@erichare
Copy link
Author

Hi @potter-potter , just wondering would it be helpful to you if i merged in the changes from #3179 into this PR? I can do so and close the other PR if so!

@property
def filename(self):
return (
Path(self.processor_config.output_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to the wrong folder. It needs to go to the download folder so that the Partitioner can pick it up. So something like this:

    @property
    def filename(self):
        return (
            Path(self.read_config.download_dir)
            / self.connector_config.collection_name
            / f"{self.metadata['_id']}.txt"
        ).resolve()

@SourceConnectionError.wrap
@requires_dependencies(["astrapy"], extras="astra")
@BaseSingleIngestDoc.skip_if_file_exists
def get_file(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this should go to the download folder, I think we usually do something like this. Maybe check out how the mongodb is doing this.

        self.filename.parent.mkdir(parents=True, exist_ok=True)
        with open(self.filename, "w") as f:

@BaseSingleIngestDoc.skip_if_file_exists
def get_file(self):
with open(self.filename, "w") as f:
f.write(self.metadata["content"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a "content" key in the data that I was downloading from the collection we are using.

Also, can you have each field's value write to a separate line like we do in Mongodb? Or were you mimicing a different source connector.

For mongodb connector the download .txt file just looks like this:

1901
Kansas Saloon Smashers
American
Unknown
nan
unknown
https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers

@potter-potter
Copy link
Contributor

@erichare This looks like it's pretty close to being done.

@erichare
Copy link
Author

erichare commented Jun 24, 2024

@potter-potter pushed some updates just now based on the feedback! Thanks so much.

EDIT: Also just wanted to check if you want me to just merge in the changes from #3179 into this one, and close 3179? just let me know.


@property
def _output_filename(self):
return Path(self.processor_config.output_dir) / f"{self.filename}.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting saved in wrong directory due to self.filename being a path. Should be something like:

        return (Path(self.processor_config.output_dir)
            / self.connector_config.collection_name
            / f"{self.metadata['_id']}.json"
            )

@potter-potter
Copy link
Contributor

@erichare You can merge in the changes from 3179.

@erichare
Copy link
Author

@erichare You can merge in the changes from 3179.

Thanks @potter-potter ! I've done that, made the update to the output paths, and did some little cleanup of the imports. Let me know if this looks good or if there's anything else to address. Thanks so much for the reviews.

@potter-potter
Copy link
Contributor

@erichare This all looks good. I'll take over and get it into PR today or tomorrow. Nice work!

@erichare
Copy link
Author

@erichare This all looks good. I'll take over and get it into PR today or tomorrow. Nice work!

Thanks @potter-potter !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants