The purpose of this workshop is for you to get hands on experience with how we at UNIwise use data for machine learning purposes. You will learn about the types of data we handle, the challenges we face, and ponder how to overcome them.
- Clone the repo with
git clone [email protected]:UNIwise/ml-workshop.git
or download the repo - Check you have python installed by running
python -v
in your favourite terminal.
If you do not have it installed:- Linux:
sudo apt-get install python3
- Windows: Download and install python from https://www.python.org/downloads/
- MacOS: If you have homebrew installed, run
brew install python
.
Otherwise download and install python from https://www.python.org/downloads/. Or install homebrew by following this guide https://mac.install.guide/homebrew/3.html
- Linux:
- Navigate to the root of the project and run
pip install -r requirements.txt
to install the required packages. - Run the example to make sure everything works:
- Run
python example/visualize.py
to visualize the data. - Run
python example/predict.py
to predict & visualize outliers.
- Run
Now you're ready to go!
To begin you should add your code to the predict_outliers
function in predict.py
, but feel free to add other functions or play with the visualization if you wish.
To predict with your implementation run python <difficulty>/predict.py
. E.g python hard/predict.py
to predict outliers on the hard dataste using your model.
To visualize run different variables run python <difficulty>/visualize.py <status_key>
, e.g. python hard/visualize.py FACIAL_RECOGNITION
to visualize the facial recognition variable on your predictions on the hard dataset.
For the datasets with only one variable (easy and example) just run python <difficulty>/visualize.py
.
The challenge of this workshop is detecting students acting suspicious during an exam i.e. outlier detection. In outlier detection we do not have labels or 'ground truths' to guide the training of a model, nor the evaluation of it. It should therefore be noted that outliers are not necessarily cheaters, but their data just looks different compared to their peers. At UNIwise we never call students cheaters, but we provide insights for the institutions to draw their own conclusions.
This repository contains three different datasets in increasing complexity and one used as an example, where we have supplied a simple model for outlier detection. We recommend starting with the easy dataset, but depending on your level of experience and/or fighting spirit, feel free to start with any dataset. Note that while results are cool, we would also love to talk with you about your considerations, potential problems you see with your methods or how you would handle hypothetical issues we might suggest.
In the example dataset we have generated data for 10 students doing a 3-hour exam.
The data is structured in the following manner:
time | user_id | status_key | value |
---|---|---|---|
2023-02-10 12:00:00 | 0 | CHARACTERS_TYPED | 6543 |
2023-02-10 12:05:00 | 0 | CHARACTERS_TYPED | 7534 |
2023-02-10 12:00:00 | 1 | CHARACTERS_TYPED | 3213 |
With columns:
time
: The timestamp of the datapoint.
user_id
: A unique student identifier.
status_key
: Variable the following column belongs to. In Example status_key
is always CHARACTERS_TYPED
, representing the character count up until this time.
value
: The value status_key
had at this time.
This dataset is similar to the example dataset in that it only contains one measurement, CHARACTERS_TYPED
, but this dataset contains data for more students.
The data is structured in the following manner:
time | user_id | status_key | value |
---|---|---|---|
2023-02-10 12:00:00 | 0 | CHARACTERS_TYPED | 6543 |
2023-02-10 12:05:00 | 0 | CHARACTERS_TYPED | 7534 |
2023-02-10 12:00:00 | 1 | CHARACTERS_TYPED | 3213 |
With columns:
time
: The timestamp of the datapoint.
user_id
: A unique student identifier.
status_key
: Variable the following column belongs to. In Easy status_key
is always CHARACTERS_TYPED
, representing the character count up until this time.
value
: The value status_key
had at this time.
For this dataset there are multiple possible values for status_key
. Now students can be considered outliers just according to one variable or a combination of multiple.
The data is structured in the following manner:
time | user_id | status_key | value |
---|---|---|---|
2023-02-10 12:00:00 | 0 | CHARACTERS_TYPED | 6543 |
2023-02-10 12:05:00 | 0 | CHARACTERS_TYPED | 7534 |
2023-02-10 12:00:00 | 0 | FACIAL_RECOGNITION | 95 |
2023-02-10 12:05:00 | 0 | FACIAL_RECOGNITION | 93 |
2023-02-10 12:00:00 | 0 | VOICE_DETECTION | 4 |
2023-02-10 12:05:00 | 0 | VOICE_DETECTION | 2 |
2023-02-10 12:00:00 | 1 | CHARACTERS_TYPED | 3213 |
With columns:
time
: The timestamp of the datapoint.
user_id
: A unique student identifier.
status_key
: Variable the following column belongs to. CHARACTERS_TYPED
represents the character count up until this time, FACIAL_RECOGNITION
represents a percentage match to a reference image of the student using facial recognition and VOICE_DETECTION
represents a count of sentences of spoken words since last observation.
value
: The value status_key
had at this time.
Here you will have to be creative and use the new features to help you figure out, if more people should be considered outliers based on the extra variables.
This dataset contains the same variables as the medium dataset, but this time, students from multiple exams are included and the exams were held at different times.
This means that comparing variables across exams is going to pose some problems. What is considered an outlier CHARACTERS_TYPED
value in one exam at a given time, may be different than the value at an equivalent time in a different exam.
time | user_id | exam_id | status_key | value |
---|---|---|---|---|
2023-02-10 12:00:00 | 0 | 0 | CHARACTERS_TYPED | 6543 |
2023-02-10 12:05:00 | 0 | 0 | CHARACTERS_TYPED | 7534 |
2023-02-10 12:00:00 | 0 | 0 | FACIAL_RECOGNITION | 95 |
2023-02-10 12:05:00 | 0 | 0 | FACIAL_RECOGNITION | 93 |
2023-02-10 12:00:00 | 0 | 0 | VOICE_DETECTION | 4 |
2023-02-10 12:05:00 | 0 | 0 | VOICE_DETECTION | 2 |
2023-02-10 12:00:00 | 1 | 0 | CHARACTERS_TYPED | 3213 |
2023-02-10 12:00:00 | 0 | 1 | CHARACTERS_TYPED | 3654 |
With columns:
time
: The timestamp of the datapoint.
user_id
: A unique student identifier for a given exam.
exam_id
: A unique exam identifier.
status_key
: Variable the following column belongs to. CHARACTERS_TYPED
represents the character count up until this time, FACIAL_RECOGNITION
represents a percentage match to a reference image of the student using facial recognition and VOICE_DETECTION
represents a count of sentences of spoken words since last observation.
value
: The value status_key
had at this time.
We do not expect you to solve these questions, but they are important things to consider if one were to deploy a model for finding outlier students in real exams.
- How to handle student starting late? Or ending early?
- What if some students were given extra time?
- What if not all exams have the same
status_key
s available? - How would you explain to an invigilator why a student is considered an outlier?