NLP-ADBench is a comprehensive benchmarking tool designed for Anomaly Detection in Natural Language Processing (NLP). It not only establishes a benchmark but also introduces the NLPAD datasets—8 curated and transformed datasets derived from existing NLP classification datasets. These datasets are specifically tailored for NLP anomaly detection tasks and presented in a unified standard format to support and advance research in this domain.
To ensure a robust evaluation, NLP-ADBench includes results from 19 algorithms applied to the 8 NLPAD datasets, categorized into two groups:
- 3 end-to-end algorithms that directly process raw text data to produce anomaly detection outcomes.
- 16 embedding-based algorithms, created by applying 8 traditional anomaly detection methods to text embeddings generated using two models:
- BERT's
bert-base-uncased
(BERT) - OpenAI’s
text-embedding-3-large
(OpenAI).
- BERT's
The datasets required for this project can be downloaded from the following huggingface links:
-
NLPAD Datasets: These are the datasets mentioned in NLP-ADBench paper. You can download them from:
-
Pre-Extracted Embeddings: For embedding-based algorithms, we have already extracted these embeddings. If you want to use them directly, you can download them from:
If you find this work useful, please cite our paper:
Paper Link: https://arxiv.org/abs/2412.04784
@article{li2024nlp,
title={NLP-ADBench: NLP Anomaly Detection Benchmark},
author={Li, Yuangang and Li, Jiaqi and Xiao, Zhuo and Yang, Tiankai and Nian, Yi and Hu, Xiyang and Zhao, Yue},
journal={arXiv preprint arXiv:2412.04784},
year={2024}
}
Follow these steps to set up the development environment using the provided Conda environment file:
-
Install Anaconda or Miniconda: Download and install Anaconda or Miniconda from here.
-
Create the Environment: Using the terminal, navigate to the directory containing the
environment.yml
file and run:conda env create -f environment.yml
-
Activate the Environment: Activate the newly created environment using:
conda activate nlpad
Get Pre-Extracted Embeddings
data from the huggingface link and put it in the data folder.
Place all downloaded embeddings data into the feature
folder in the ./benchmark
directory of this project.
Run the following commands from the ./benchmark
directory of the project:
If you want to run a benchmark using data embedded with BERT's bert-base-uncased
model, use this command:
python [algorithm_name]_benchmark.py bert
If you want to run a benchmark using data embedded with OpenAI's text-embedding-3-large
model, use this command:
python [algorithm_name]_benchmark.py gpt