NLP-ADBench

Overview

NLP-ADBench is a comprehensive benchmarking tool designed for Anomaly Detection in Natural Language Processing (NLP). It not only establishes a benchmark but also introduces the NLPAD datasets—8 curated and transformed datasets derived from existing NLP classification datasets. These datasets are specifically tailored for NLP anomaly detection tasks and presented in a unified standard format to support and advance research in this domain.

To ensure a robust evaluation, NLP-ADBench includes results from 19 algorithms applied to the 8 NLPAD datasets, categorized into two groups:

3 end-to-end algorithms that directly process raw text data to produce anomaly detection outcomes.
16 embedding-based algorithms, created by applying 8 traditional anomaly detection methods to text embeddings generated using two models:
- BERT's bert-base-uncased(BERT)
- OpenAI’s text-embedding-3-large(OpenAI).

NLPAD Datasets

The datasets required for this project can be downloaded from the following huggingface links:

NLPAD Datasets: These are the datasets mentioned in NLP-ADBench paper. You can download them from:
- NLP-AD Datasets
Pre-Extracted Embeddings: For embedding-based algorithms, we have already extracted these embeddings. If you want to use them directly, you can download them from:
- Pre-Extracted Embeddings

Citation

If you find this work useful, please cite our paper:

Paper Link: https://arxiv.org/abs/2412.04784

@article{li2024nlp,
  title={NLP-ADBench: NLP Anomaly Detection Benchmark},
  author={Li, Yuangang and Li, Jiaqi and Xiao, Zhuo and Yang, Tiankai and Nian, Yi and Hu, Xiyang and Zhao, Yue},
  journal={arXiv preprint arXiv:2412.04784},
  year={2024}
}

Instructions for Running the Benchmark

Environment Setup Instructions

Follow these steps to set up the development environment using the provided Conda environment file:

Install Anaconda or Miniconda: Download and install Anaconda or Miniconda from here.
Create the Environment: Using the terminal, navigate to the directory containing the environment.yml file and run:
```
conda env create -f environment.yml
```
Activate the Environment: Activate the newly created environment using:
```
conda activate nlpad
```

Import data

Get Pre-Extracted Embeddings data from the huggingface link and put it in the data folder.

Place all downloaded embeddings data into the feature folder in the ./benchmark directory of this project.

Run the code

Run the following commands from the ./benchmark directory of the project:

BERT

If you want to run a benchmark using data embedded with BERT's bert-base-uncased model, use this command:

python [algorithm_name]_benchmark.py bert

OpenAI

If you want to run a benchmark using data embedded with OpenAI's text-embedding-3-large model, use this command:

python [algorithm_name]_benchmark.py gpt

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
benchmark		benchmark
figs		figs
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-ADBench

Overview

NLPAD Datasets

Citation

Instructions for Running the Benchmark

Environment Setup Instructions

Import data

Run the code

BERT

OpenAI

About

Releases

Packages

Contributors 4

Languages

USC-FORTIS/NLP-ADBench

Folders and files

Latest commit

History

Repository files navigation

NLP-ADBench

Overview

NLPAD Datasets

Citation

Instructions for Running the Benchmark

Environment Setup Instructions

Import data

Run the code

BERT

OpenAI

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages