OpenAI Data Classification

I spun this up quickly as an example to see how well the openai davinci engine could be used to classify and identify different levels of PII data.

The use case here is multi-faceted: if I am scanning a file for PII I'd like more than just a boolean view of whether it contains PII or not, it's more useful to see what type of PII and the level of combined PII risk associated with the contents / payload / response. Different classification levels can have different treatments so being able to sample and detect and then apply a treatment or control becomes useful.

It is interesting to see how AI models think about and consider PII data. It's also interesting to see how well they can classify and identify different types of PII data. I'm sure there are many other use cases for this type of classification.

The original version of this used the classification model but has been updated to use the completion model. The completion model is a bit more flexible and allows for more complex queries and responses. The classification model is more limited in that it only allows for a single classification per query. Part of the analysis is whether the completion model is suited to this type of 'PII detection' task.

It is also interesting to swap out the davinci engine for others such as curie and see how they perform.

this is for simulated PII use only, not for real world use *

Instructions

Clone or fork the repo
Insert your super secret openapi API key in the .env file
Source the env file with source .env
Run with python classify.py
Check the prediction!

Results

Overall, it seems like some PII fares better than others when it comes to classification. Davinci seems to err on the side of caution (which is good) but would also result in quite a few false positives. It seems to do a good job of the obvious None scenarios like "hello" and it seems to understand credit card structures and things like names and phone numbers quite well but stumbles on SSN structures and what is a valid area/group coding vs. completely random numbers. Perhaps it could be trained / fine tuned in a more focused way?

Requirements

Tested with python 3.9.7 and openai Python client library version 0.23.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env		.env
.gitignore		.gitignore
README.md		README.md
classify.py		classify.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAI Data Classification

Instructions

Results

Requirements

About

Releases

Packages

Languages

kris-hansen/openai-data-classification

Folders and files

Latest commit

History

Repository files navigation

OpenAI Data Classification

Instructions

Results

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages