A simple workflow for processing text. Combines a series of sequential transformations to a set of raw, unprocessed documents.
The order of transformations, as of 25/05/2023:
- Preprocess: remove contractions, fix encoding issues.
- Obtain Document-Term matrix with user specified n-gram
- Embed text with Sentence-Transformers
- Reduce embedding dimensions with UMAP
- Cluster reduced dimensions embeddings with HDBSCAN