Answers of a survey in Spanish are categorized using word-embeddings, and categorized using k-means clustering.
This project also includes dealing with multi-word expressions, by removing stopwords, and obtaining their vector-averages.
This is an example for Spanish language but it can easily be adapted for any other languages.
The number of clusters are obtained by optimal curvage finding algorithm(elbow method).
Table of Contents
This project was created with a purpose to serve people who are searching for a solution to group/categorize words or even multi-word expressions by their meaning. There are so many tools and services to run statistics and/or create diagrams of given data, but they mostly work for numbers, when it comes to deal with words or some texts, those tools seem less useful since they do not include any way to visualize them in 2D/3D space based on their usage/meaning. This repository somewhat helps to perform following operations on texts:
- Visualising texts:
- Visualizing single words using word-embedding vectors of a language;
- Visualizing multiple-word texts by obtaining average vecotrs of containing words (stopwords removed for better output quality);
- Finding the optimal number of groups/clusters/categories to split words/texts based on their meaning, using Within Cluster Sum of Squares(WCSS) to find a level-oof using elbow method;
- Grouping/Clustering texts using k-means clustering algorithm;
- Visualizing grouped texts by different colors, using patplotlib.
Programming language used:
These are the major libraries used inside Python:
- scikit-learn : A set of python modules for machine learning
- gensim: Python framework for fast Vector Space Modelling
- Matplotlib: Visualization with Python
- kneed: Knee-point detection in Python
- NumPy: The fundamental package for scientific computing with Python
- First of all, the code loads the list of words/texts from given file: input/answers.txt (it's called answers in this case, because it was an answers of a particular survey), and it obtains a vectors of those texts. An example diagram would look like this:
- Then, the code obtains the optimal number of clusters for given texts to splin into, using an elbow-method. For our example it would look like this:
- Lastly, the code categorizes the list of texts into groups by their meaning. The final result would look like this:
To use this code you should have at least a small understanding of how to run a Python code, with Python installed machine. You should also install above-mentioned necessary framework/libraries into it. There are two ways you can run this code:
-
Either clone the repo by running the commend below, and run the survey-clustering.py:
git clone https://github.com/elmurod1202/survey-clustering.git
-
Or just download only the survey-clustering.py (or survey-clustering-minimum.py if you want minimised working code without graphic visualisations) file and make some small changes like where to read the files from and where to store the results to. That's it.
IMPORTANT: This code uses a Spanish word embeddings vector file that is not inluded here due to its big size. Please download the file into the src/ folder from the link: Spanish word vectors (3.4 GB)
This code is ontended for Spanish, but it can be adapted to many other languages just by changing two files in the src/ folder:
- src/embeddings-l-model.vec : Spanish word vectors file to a word-vector file of any language;
- src/spanish-stopwords.txt : Spanish stopwords file replaced by any toher language stopwords.
Distributed under the GNU GENERAL PUBLIC LICENSE. See LICENSE.txt
for more information.
Big shoutouts to Luis for bringing this problem to the table.
We are grateful for these resources and tutorials for making this repository possible: