-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create anomaly detection workflow #3340
Comments
Dataset Quality Checker Proposal This is a rough idea for a basic dataset quality checker. The goal is to see how technically challenging it might be while helping us spot important anomalies we can actually do something about. We're looking for things like fat-finger errors, out-of-range values, or extreme differences between countries. We're not trying to catch more complex stuff like regime shifts or changes in definitions—those are a bit too advanced for now.
This could be relatively easy to pull out for a few datasets and see how well does it work. Ideally, we'd have a list of real anomalies to test it against, but if not, we can just artificially mess up data. |
Proposal ASummaryWe create a new dedicated page in wizard for anomaly detection. To begin with, this will be a table with one row per indicator, among the list of indicators from datasets that have changed in a given PR. Changes in wizardWe need a new dedicated page for anomaly detection. New backend code for anomaly detection in wizardWe need the code to execute anomaly detection at the individual indicator level, and when comparing indicator versions. Changes in indicator upgrader (optional)We may also need to look into how to handle variable mappings. In the easiest version, we just store a json file locally, but in the long term it would be good to be able to "restore" mappings in indicator upgrader, and store mappings in the grapher database. Distribution of tasksWe thought we could roughly divide the work as follows:
|
To add on Pablo's summary:
|
WorkflowHow I see the data manager workflow:
Challenges
Possible structure of AD outputStep 2 generates an output summary of the anomalies, which is then used in the streamlit app (step 3). We should agree on its format anomalies:
- indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_1"
- description: "Sudden spike in ..."
- description: "Sudden drop in ..."
- description: "Missing values in period ..."
- indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_2"
- description: "Values way higher in ..."
indicator_old_slug: "grapher/bla/2023-01-01/bla/bla#indicator_2" Note: some anomalies could be just for one indicator, and others could be for that indicator relative to its old version (see Optional: We could try to pin-point country & years for single anomaly (i.e. 'when that anomaly happened'), though that could be either 'a single country-year', 'a single country for a year period', 'multiple countries and years', etc. So I am unsure about the format at the moment. |
From our discussion: First versionThe UI and workflow / Lucas
Anomalies within an indicator / MojmirWe will probably use a ChatGPT-based approach. A major challenge is packing the dataset into the 128k context window. We could:
Anomalies relative to a baseline / PabloWe can experiment with an approach based on percentage change. Later optionsIf all of this went well, we could consider:
|
Today we had a long discussion where we discussed which are the tasks pending and how to organize ourselves to work on this project. From our meeting, we concluded that there are three main parts:
Some other important points:
Other pending discussion points
|
The first version of The Anomalist has been merged into the master and shared with the team. Pablo shared it with a short demo on how to use it. We are currently gathering enhancements and improvements in this separate issue: #3436 Unsure if this issue should remain open. thoughts @pabloarosado ? |
Thanks @lucasrodes, I think we can close this "big project" issue for now, and keep the tracking issue open. |
Summary
We should perform better data quality checks, and ideally have tools to help us identify data anomalies, as part the normal flow of our data work.
Problem
We currently perform sanity checks via assertions or with ad-hoc code outside of ETL, or via visual inspection on Indicator Upgrader explorer tool, or chart diff. But many data issues still remain in our data, and often users point them out.
In most cases, the issues were "not our fault", since they were already in the original data. However, we should, at least, be aware of these issues, contact data providers early on, and fix them when possible.
Impact
These data issues can lead to the following undesired outcomes:
Scope
List of PRs
Open issues and enhancements
We were discussing issues on slack, but from now on (since we'll have to wrap up next week), we can keep adding issues here:
#3436
The text was updated successfully, but these errors were encountered: