Freeform data normalization #221
Replies: 4 comments 8 replies
-
Here's a rough mockup of what a manual normalization UI could look like. |
Beta Was this translation helpful? Give feedback.
-
For inspiration I've been testing out the BERT keyword extractor from the Streamlit (a tool to build data-centric admin app in Python): https://bert-keyword-extractor.streamlit.app/ Here it's not exactly a NER or keyword extraction issue but more a classification problem, you want to know if the current text content "belongs" to one or multiple predefined labels. You want to be able to add labels on the fly, and probably "online" training so the algorithm improves everytime you manually normalize some data. |
Beta Was this translation helpful? Give feedback.
-
We can do this. ChatGPT API collects data however. I'm thinking we have no confidential data anyway so must not be a problem (correct me if i am wrong). Before that, we have to preprocess the data since we cannot just feed them all to the API. I also believe that the normalization process is enough for some questions. We can check what could work with the data that we have before anything else. |
Beta Was this translation helpful? Give feedback.
-
Screen.Recording.2023-10-02.at.07.04.52.AM.mp4It works! |
Beta Was this translation helpful? Give feedback.
-
The State of HTML survey has an unusually high number of freeform questions, many of which use the TextList input which itself accepts multiple responses.
As an example, the first freeform question ("What are your pain points around HTML forms?") has already received over 3000 answers.
Current Normalization Process
The goal of the normalization process is to match a freeform text string ("Date/time pickers are often finicky") to a canonical ID (
input_type_date
).Here's a recap of how our current system works.
forms_pain_points
, which specifies its match tag(s) in its metadata (in this casehtml_pain_points
)html_pain_points
and look up its regex patternid
to the question'snormalized
field.If there is no entity yet, we need to manually create it and then run the normalization process again.
Limitations
In many cases there is no regex that can realistically match a string. To make up an example, the string "Form inputs of the kind you use to book a flight" should probably match
input_type_date
, but doesn't contain an easily matchable character string.Strategy 1: Manual Matching
We could imagine a manual interface that lets you type in the
id
to match on your own for any freeform response.Issues
id
is meant to be reusableStrategy 2: Something Else
We could also do something else entirely, such as using ML or the ChatGPT API to classify the data.
Issues
Beta Was this translation helpful? Give feedback.
All reactions