Skip to content

Conversation

vringar
Copy link
Contributor

@vringar vringar commented Jun 15, 2020

This function should give the user some general overview over the crawl_history and what kind of data loss to expect.

@vringar vringar force-pushed the display_crawl_history branch from 1b2bee5 to 8e88f65 Compare June 15, 2020 10:23
@englehardt
Copy link
Contributor

@vringar do you have a sample notebook I can use to check the outputs of these? I know we don't have tests (which is okay for now), but that would help me check whether everything is what I'd expect.

Copy link
Contributor

@englehardt englehardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@vringar
Copy link
Contributor Author

vringar commented Jul 6, 2020

@vringar do you have a sample notebook I can use to check the outputs of these? I know we don't have tests (which is okay for now), but that would help me check whether everything is what I'd expect.

This function is used in the dataquality notebook on Databricks.

@vringar vringar force-pushed the display_crawl_history branch 2 times, most recently from 3220d0f to d8eb4bb Compare July 20, 2020 14:38
@vringar vringar changed the base branch from load_table_enhancement to master April 9, 2021 10:27
Stefan Zabka and others added 2 commits April 9, 2021 12:38
Downloading files via the SparkContext was much slower than
downloading via boto (which is what S3Dataset does.
So now both classes use the same method, as PySparkS3Dataset
inherits from S3Dataset
This parameter allows for filtering out VisitIds that are part of
`incompleted_visits` or that had a command with a command_status other than
"ok" since users probably shouldn't consider them for analysis

This filtering functionality is extracted into the TableFilter class to
be reused by other Datasets.
@vringar vringar force-pushed the display_crawl_history branch 2 times, most recently from cb8a25f to 00a3d47 Compare April 9, 2021 11:57
@vringar vringar force-pushed the display_crawl_history branch from 00a3d47 to 247adea Compare April 9, 2021 12:10
@vringar vringar requested a review from englehardt April 9, 2021 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants