DTECT (Dynamic Topic Explorer & Context Tracker) is an end-to-end, open-source system designed to streamline the entire process of dynamic topic modeling.
-
Interactive Demo: Try DTECT live on Hugging Face Spaces!
-
Demo Video: Watch a walkthrough of DTECT's features.
Here is an example of how to use the DTECT preprocessing pipeline for a custom dataset:
from backend.datasets.preprocess import Preprocessor
from nltk.corpus import stopwords
import os
dataset_dir = '../data/Sample_data/'
stop_words = stopwords.words('english')
preprocessor = Preprocessor(
docs_jsonl_path=dataset_dir + 'docs.jsonl',
output_folder=os.path.join(dataset_dir, 'processed'),
use_partition=False,
min_count_bigram=5,
threshold_bigram=20,
remove_punctuation=True,
lemmatize=True,
stopword_list=stop_words,
min_chars=3,
min_words_docs=3,
)
preprocessor.preprocess()
This snippet shows an example of the training and evaluation pipeline using the CFDTM model:
from backend.datasets import dynamic_dataset
from backend.models.CFDTM.CFDTM import CFDTM
from backend.models.dynamic_trainer import DynamicTrainer
from backend.evaluation.eval import TopicQualityAssessor
# Load dataset
data = dynamic_dataset.DynamicDataset('../data/Sample_data/processed')
# Initialize model
model = CFDTM(
vocab_size=data.vocab_size,
num_times=data.num_times,
num_topics=20,
pretrained_WE=data.pretrained_WE,
train_time_wordfreq=data.train_time_wordfreq
).to("cuda")
# Train model
trainer = DynamicTrainer(model, data)
top_words, _ = trainer.train()
top_words_list = [[topic.split() for topic in timestamp] for timestamp in top_words]
train_corpus = [doc.split() for doc in data.train_texts]
# Evaluation
assessor = TopicQualityAssessor(
topics=top_words_list,
train_texts=train_corpus,
topn=10,
coherence_type='c_npmi'
)
summary = assessor.get_dtq_summary()
├── app
│ └── ui.py
├── assets
├── backend
│ ├── datasets
│ ├── evaluation
│ ├── inference
│ ├── llm
│ ├── llm_utils
│ └── models
├── data
│ └── Sample_data
├── environment.yml
├── LICENSE
├── main.py
└── requirements.txt
We list below the datasets, codebases, and evaluation resources referenced or integrated into DTECT:
- Evaluating Dynamic Topic Models: https://github.com/CharuJames/Evaluating-Dynamic-Topic-Models
We would like to acknowledge the following open-source projects that were instrumental in the development of DTECT:
🔍 TopMost Toolkit https://github.com/bobxwu/TopMost 📌 Reference: Xiaobao Wu, Fengjun Pan, and Anh Tuan Luu. 2024. Towards the TopMost: A Topic Modeling System Toolkit. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–41, Bangkok, Thailand. Association for Computational Linguistics.
📦 OCTIS https://github.com/MIND-Lab/OCTIS 📌 Reference: Silvia Terragni, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. 2021. OCTIS: Comparing and Optimizing Topic Models is Simple!. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 263–270, Online. Association for Computational Linguistics.
@misc{adhya2025dtect,
title={{DTECT}: Dynamic Topic Explorer & Context Tracker},
author={Suman Adhya and Debarshi Kumar Sanyal},
year={2025},
eprint={2507.07910},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.07910},
}