Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT add Data Health Checker #1574

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

benheckmann
Copy link

@benheckmann benheckmann commented Jun 28, 2023

Description

First draft for a data health checker as discussed in #854. The checker receives a path to the data in CSV or qlib format (not implemented yet). It will convert the data to a DataFrame and perform basic checks for data completeness and correctness.

I am not too familiar with the qlib data handling yet, so I am hoping to get some first feedback on whether this goes in the right direction.

Motivation and Context

See #854. This was an issue where a user would get a non-meaningful error message when his data did not adhere to the format (specifically the "volume" column was named "vol"). When checking the data of #854 with this checker, the user would get:

[...]
ERROR:root:002645.SZ.csv: Missing columns ['volume'] of required columns ['open', 'high', 'low', 'close', 'volume'].
WARNING:root:002645.SZ.csv: Missing 'factor' column, trading unit will be disabled.

Summary of data health check (4220 files checked):
-----------------------
Problem                   Count  Affected columns
MISSING_REQUIRED_COLUMN   4220   {'volume'}
MISSING_DATA              0      -
LARGE_STEP_CHANGE         14     {'low', 'open', 'close', 'high'}
MISSING_FACTOR            4220   {'factor'}

Note: the large step change uses two configurable thresholds (one for price and one for volume) and checks only step changes in OHLCV columns.

How Has This Been Tested?

No tests yet as this is only a first draft

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

  1. Pipeline test:
  2. Your own tests:

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

@github-actions github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Jun 28, 2023
@benheckmann
Copy link
Author

@microsoft-github-policy-service agree

@Fivele-Li
Copy link
Contributor

Add unit tests in qlib.test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for triage Cannot auto-triage, wait for triage.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants