Freeform data normalization #221

SachaG · 2023-09-29T05:01:00Z

SachaG
Sep 29, 2023
Maintainer

The State of HTML survey has an unusually high number of freeform questions, many of which use the TextList input which itself accepts multiple responses.

As an example, the first freeform question ("What are your pain points around HTML forms?") has already received over 3000 answers.

Current Normalization Process

The goal of the normalization process is to match a freeform text string ("Date/time pickers are often finicky") to a canonical ID (input_type_date).

Here's a recap of how our current system works.

We are parsing question forms_pain_points, which specifies its match tag(s) in its metadata (in this case html_pain_points)
We find an unmatched string ("Date/time pickers are often finicky")
We go through every entity that has tag html_pain_points and look up its regex pattern
If the regex pattern is a match, we add the entity's id to the question's normalized field.

If there is no entity yet, we need to manually create it and then run the normalization process again.

Limitations

In many cases there is no regex that can realistically match a string. To make up an example, the string "Form inputs of the kind you use to book a flight" should probably match input_type_date, but doesn't contain an easily matchable character string.

Strategy 1: Manual Matching

We could imagine a manual interface that lets you type in the id to match on your own for any freeform response.

Issues

This would involve a lot of manual work
Still requires separately adding entity to YAML files if id is meant to be reusable

Strategy 2: Something Else

We could also do something else entirely, such as using ML or the ChatGPT API to classify the data.

Issues

More upfront development work
If this is not done through our system we will not be able to show the data in the public report as easily

SachaG · 2023-09-29T05:10:22Z

SachaG
Sep 29, 2023
Maintainer Author

Here's a rough mockup of what a manual normalization UI could look like.

1 reply

SachaG Sep 29, 2023
Maintainer Author

eric-burel · 2023-09-29T07:01:45Z

eric-burel
Sep 29, 2023
Maintainer

For inspiration I've been testing out the BERT keyword extractor from the Streamlit (a tool to build data-centric admin app in Python): https://bert-keyword-extractor.streamlit.app/
It was not great though out-of-the-box, but I'll keep digging in this direction. NER/categorization algorithms can be trained on labellized samples, so you will always need an UI anyway, and then ML can progressively learn to prefill the thing.

Here it's not exactly a NER or keyword extraction issue but more a classification problem, you want to know if the current text content "belongs" to one or multiple predefined labels. You want to be able to add labels on the fly, and probably "online" training so the algorithm improves everytime you manually normalize some data.

5 replies

ShaineRosewel Sep 29, 2023

If it is a classification problem, we need to have labelled training data . But I'm thinking we do not want this right now since labelling will take time. Pretrained models are not expected to perform well, but we can try to explore what could work. The last time I checked, there was one language model trained on stackoverflow data, it could be useful.

eric-burel Sep 29, 2023
Maintainer

@ShaineRosewel I've added you to the datascience repo, I am trying to get better at ML so that's a good occasion to try things.

ShaineRosewel Sep 29, 2023

Cool! Seems like I have not been notified? Is it with this username that you sent the invite?

eric-burel Sep 29, 2023
Maintainer

You can check https://github.com/Devographics/datascience/ normally you'll have access
It just contains the code for the CSV "cleaner", you can't use it without the normalized CSV data exported from Mongo but it shows how we can use Python via Docker and VS Code container (I'd like to keep Python code in a container so it doesn't start biting people with its package install issues and version mismatch venom...) + an experiment related to clustering. I am going to experiment classification there too.

ShaineRosewel Sep 29, 2023

okay got that thanks!

ShaineRosewel · 2023-09-29T08:04:53Z

ShaineRosewel
Sep 29, 2023

Strategy 2: Something Else

We could also do something else entirely, such as using ML or the ChatGPT API to classify the data.

We can do this. ChatGPT API collects data however. I'm thinking we have no confidential data anyway so must not be a problem (correct me if i am wrong). Before that, we have to preprocess the data since we cannot just feed them all to the API. I also believe that the normalization process is enough for some questions. We can check what could work with the data that we have before anything else.

2 replies

eric-burel Sep 29, 2023
Maintainer

Here is an article about few-shot classification using ChatGPT and comparison with another model "RoBERTa", they seem to work with a few hundreds sample: https://towardsdatascience.com/text-classification-challenge-with-extra-small-datasets-fine-tuning-versus-chatgpt-6348fecea357 (sorry if you hit the paywall)
The insane thing with ChatGPT is that it would take our data almost as is, while as far as I know a with a more traditional approach we would first need to learn a correct representation for the data (I guess word embeddings in our case) then apply some classifier for each label (if the embeddings are good we may not even need ML at this step? I don't know much about text classification).

Regarding privacy the only risk is that people give personal data in their input but we can probably easily detect emails for instance and scrap them out. Also I don't know about the cost of this.

ShaineRosewel Sep 29, 2023

ChatGPT in itself works without the need to create embeddings. One disadvantage is that the data used for training is until Sept 2021. Hence, in our case, since we are dealing with the 'latest'. Some words might be out of its knowledge. We can test with data.

SachaG · 2023-10-01T22:07:54Z

SachaG
Oct 1, 2023
Maintainer Author

Screen.Recording.2023-10-02.at.07.04.52.AM.mp4

It works!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeform data normalization #221

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Strategy 2: Something Else

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Freeform data normalization #221

SachaG Sep 29, 2023 Maintainer

Current Normalization Process

Limitations

Strategy 1: Manual Matching

Issues

Strategy 2: Something Else

Issues

Replies: 4 comments · 8 replies

SachaG Sep 29, 2023 Maintainer Author

SachaG Sep 29, 2023 Maintainer Author

eric-burel Sep 29, 2023 Maintainer

ShaineRosewel Sep 29, 2023

eric-burel Sep 29, 2023 Maintainer

ShaineRosewel Sep 29, 2023

eric-burel Sep 29, 2023 Maintainer

ShaineRosewel Sep 29, 2023

ShaineRosewel Sep 29, 2023

Strategy 2: Something Else

eric-burel Sep 29, 2023 Maintainer

ShaineRosewel Sep 29, 2023

SachaG Oct 1, 2023 Maintainer Author

SachaG
Sep 29, 2023
Maintainer

Replies: 4 comments 8 replies

SachaG
Sep 29, 2023
Maintainer Author

SachaG Sep 29, 2023
Maintainer Author

eric-burel
Sep 29, 2023
Maintainer

eric-burel Sep 29, 2023
Maintainer

eric-burel Sep 29, 2023
Maintainer

ShaineRosewel
Sep 29, 2023

eric-burel Sep 29, 2023
Maintainer

SachaG
Oct 1, 2023
Maintainer Author