This is DOPA METER - a tool suite that subsumes a wide range of established metrics of text analysis under one coverage. It is based on Python 3 and spaCy. (Running is preferred under Linux (preferred Ubuntu) and partly under Microsoft Windows.)
The system is based by a modular architecture, including a multilingual approach. It is designed in a decentralized manner estimating features of various sources at different places and merge partial results.
Three components build the basis:
- Text corpora as the input and possible to summarize into collections such as a preprocessing pipeline,
- Feature Hub: A set of features, that compute counts and metrics of text corpora and
- A three-parted analytics section:
- Summarization mode: of simple reports for whole corpora and single documents,
- Comparison: simple comparisons (e.g., vocabulary,
$n$ -grams) via intersections and differences - Aggregation: clustering by k-means and t-SNE with DBSCAN
-
Installation
- Install Python 3
- Install spaCy language modules and other external resources via
python install_languages.py lang_install.json
- Working for German and English language and all spaCy compatible languages or languages modules.
- Warnings:
- Constituency metrics use the Berkeley Neural Parser, check if your device is CUDA compatible.
-
Starting DOPA METER
- Configure your text corpora: one corpus is set up by a directory including single text files
-
Configure your config.json
- Example configuration files
- Very simple example:
{
"corpora": {
"name_corpus": {
"path_text_data": "/path/of/your/corpus/files/",
"language": "de",
"collection": "one"
},
"name_other_corpus": {
"path_text_data": "/path/of/your/corpus/files/",
"language": "de",
"collection": "two"
},
"name_one_more_corpus": {
"path_text_data": "/path/of/your/corpus/files/",
"language": "de",
"collection": "two"
}
},
"settings": {
"tasks": ["features", "counts", "corpus_characteristics"],
"store_sources": false,
"file_format_features": ["csv"],
"file_format_dicts": "txt"
},
"output": {
"path_features": "/define/a/path/of/your/features",
"path_summary": "/define/a/path/of/your/summary",
"path_counts": "/define/a/path/of/your/counts"
},
"features": {
"token_characteristics": "default",
"surface": "default"
}
}
- Open a terminal, root in the directory of DOPA METER and type
python main.py config.json
- Installation
- Input and Data Preparation
- Functionality and Definition of Tasks
- Feature Hub
- Analytics
- Configuration and Run
DOPA METER is presented at EMNLP 2023 Demo.
Please use the following citation:
@inproceedings{lohr-hahn-2023-dopa,
title = "{DOPA} {METER} {--} A Tool Suite for Metrical Document Profiling and Aggregation",
author = "Lohr, Christina and Hahn, Udo",
editor = "Feng, Yansong and Lefever, Els",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-demo.18",
pages = "218--228",
abstract = "We present DOPA METER, a tool suite for the metrical investigation of written language, that provides diagnostic means for its division into discourse categories, such as registers, genres, and style. The quantitative basis of our system are 120 metrics covering a wide range of lexical, syntactic, and semantic features relevant for language profiling. The scores can be summarized, compared, and aggregated using visualization tools that can be tailored according to the users{'} needs. We also showcase an application scenario for DOPA METER.",
}
DOPA METER is provided as open source under the MIT License.
This work was supported by the Friedrich Schiller University Jena (JULIE Lab and FUSION group) and the University Leipzig (IMISE), such as the BMBF within the projects SMITH (grants 01ZZ1803G and 01ZZ1803A) and GeMTeX as parts of the Medical Informatics Initiative Germany.