LLM-Detect-AI-Generated-Text

ARCHIVE CONTENTS

prepare_data.py : combine data from different sources
tokenizer_data.py : transform the data by TF-IDF
model_ensemble.py : create ensemble model
train_predict.py : code to rebuild models from scratch and generate predictions

HARDWARE: (The following specs were used to create the original solution)

Ubuntu 20.04.6 LTS
CPU RAM 30G

SOFTWARE (python packages are detailed separately in `requirements.txt`):

Python 3.10.13

DATA SETUP (assumes the Kaggle API is installed)

below are the shell commands used in each step, as run from the top level directory

mkdir -p data/
cd data/
kaggle competitions download -c  llm-detect-ai-generated-text
unzip llm-detect-ai-generated-text.zip
kaggle datasets download -d thedrcat/daigt-v2-train-dataset
unzip daigt-v2-train-dataset.zip
kaggle datasets download -d alejopaullier/argugpt
unzip argugpt.zip
kaggle datasets download -d kagglemini/train-00000-of-00001-f9daec1515e5c4b9
unzip train-00000-of-00001-f9daec1515e5c4b9.zip
kaggle datasets download -d pbwic036/commonlit-data
unzip commonlit-data.zip
kaggle datasets download -d wcqyfly/argu-train
unzip argu-train.zip
cd ..

Train and Predict

If the number of data in the test.csv is less than 5, the min_df is set to 1 and the model is not trained which only used for debugging. Conversely, when the number of data in test.csv is greater than 5, the min_df is set to 2 and the model will be trained and will generate prediction results.

python train_predict.py

or just folk the following code and run it to get submission
It should be noted that because the number of test sets is less than 3, running all directly will cause the code to report an error, but after submitting, when the test set is replaced with a hidden test set, the code will be run correctly and get the result.
https://www.kaggle.com/code/wcqyfly/fork-of-fork-of-fork-of-llm-daigt-analyse-e-db6333

The following code is used to combine data from different sources

https://www.kaggle.com/code/wcqyfly/notebook95c85fa3c6

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
7th Place Solution in Efficiency Prize-yellowleaf.pptx		7th Place Solution in Efficiency Prize-yellowleaf.pptx
Documentation.docx		Documentation.docx
Documentation.pdf		Documentation.pdf
LICENSE		LICENSE
README.md		README.md
model_ensemble.py		model_ensemble.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
tokenizer_data.py		tokenizer_data.py
train_predict.py		train_predict.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Detect-AI-Generated-Text

ARCHIVE CONTENTS

HARDWARE: (The following specs were used to create the original solution)

SOFTWARE (python packages are detailed separately in `requirements.txt`):

DATA SETUP (assumes the Kaggle API is installed)

below are the shell commands used in each step, as run from the top level directory

Train and Predict

The following code is used to combine data from different sources

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

wcqy001028/LLM-Detect-AI-Generated-Text

Folders and files

Latest commit

History

Repository files navigation

LLM-Detect-AI-Generated-Text

ARCHIVE CONTENTS

HARDWARE: (The following specs were used to create the original solution)

SOFTWARE (python packages are detailed separately in requirements.txt):

DATA SETUP (assumes the Kaggle API is installed)

below are the shell commands used in each step, as run from the top level directory

Train and Predict

The following code is used to combine data from different sources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

SOFTWARE (python packages are detailed separately in `requirements.txt`):

Packages