This repository contains the necessary code to perform the experiments for the MS thesis "Desarrollo de una medida de similaridad para Sistemas de Recomendación en sitios de Community Question Answering. Análisis desde un enfoque Big Data y usando un método de ensamble de clustering" See thesis PDF document".
This method, Ensemble method for community Question Answering sites based on cLustering, can be abbreviated as "EQuAL",
and it's developed on a Python/PySpark Big Data architecture. This method is based on Evidence Accumulation
Clustering (EAC), and it's utilized to generate similarity matrices that can be the input of a
Recommender System (RS) in a Question Community Answering site. The code is executed on a distributed
big data architecture and takes into account several algorithms of text similarity, combined by a
clustering ensemble method.
This process can be executed only once and ensures data quality of the input data set. It performs the following pre-processing tasks:
- Text to lowercase.
- Remove math formulas.
- Replace numbers with letters.
- Remove special characters.
python3 preProcessQuestions.py -quora data/quora_duplicate_questions.tsv -results data/question_pairs.csv
Parameter | Description | Required | Default Value |
---|---|---|---|
input | Raw data set path | true | -- |
results | Results data set path | true | -- |
This process, besides generating the necessary question pair samples used by the main process, calculates the distance for each question pair. This is useful for two reasons:
- The distance for each individual technique is later compared to the main process output.
- The exact same samples generated by this process will be used by the main process. This will enable a fair comparison between methods.
python3 computeDistance.py -t bow -q data/question_pairs.csv
Parameter | Description | Required | Default Value |
---|---|---|---|
t | Comparison technique. choices=['bow', 'tfidf', 'gtfidf', 'w2v', 'ft', 'sem'] | true | -- |
q | Question pair data set path. | true | -- |
w | Number of parallel processes | false | 8 |
b | Batch size. Number of question pairs distances calculated and written in one batch. | false | 100000 |
k | Total runs number. | false | -- |
n | Questions subset size that will be processed. | false | 0 -> All the questions |
results_path | Path where the result files will be saved. | false | -- |
previous | Previous results file path (to resume an unfinished experiment). | false | -- |
sample | Path to sample file (if not provided, creates a new sample) | false | -- |
sample_file_name | First part of the sample file name (the string _runnumber.csv will be appended) | false | -- |
This process is the main one of this method is based on three steps:
- Similarity matrix calculation for each of the underlying similarity algorithms.
- The generation of a partition set using k-medoids clustering technique.
- The construction of a co-association matrix from clustering labels.
The script that performs the aforementioned steps is clusterEnsemble.py
.
python3 /Users/ftesone/dev/big_data_text_similarity/clusterEnsemble.py -techniques bow,tfidf,w2v,ft,sem -questions_path "/Users/ftesone/Documents/Tesis/experiments/ensembles/inputs/w2v/100_10" -results_path "/Users/ftesone/Documents/Tesis/experiments/ensembles/results" -sample_size 100 -samples_number 10 -k 5 -clustering_runs 100 -in_progress_experiment_path "" -calc_distances_enabled -clustering_enabled -ensemble_enabled
Parameter | Description | Required | Default Value |
---|---|---|---|
techniques | Comma-separated list of techniques. Full list: bow,gtfidf,w2v,ft,sem. | true | -- |
questions_path | Path where the input sample data set of CQA question pairs is located. | true | -- |
sample_size | Sample size. It's useful for build the input path and some data set generation. | false | 0 |
samples_number | Numbers of samples that are going to be taken by the current experiment. | false | 1 |
results_path | Path where the result files will be saved. | false | -- |
k | Number of clusters (medoids). | true | 1 |
clustering_runs | Number of k-medoids clustering runs. | false | 1 |
in_progress_experiment_path | If null, creates a new folder for the results. | false | False |
calc_distances_enabled | Enabled step 1.** | false | False |
clustering_enabled | Enabled step 2.** | false | False |
ensemble_enabled | Enabled step 3.** | false | False |
start_from_sample_num | Sample number to start from, in case samples_number > 1. | false | 1 |
** These parameters are used if the parameter in_progress_experiment_path
is set. For instance, is it possibe to re-use the similarity matrices from another execution and change the number of clusters, to generate a different output. For instance
-in_progress_experiment_path "samples_size_100_count_10_k_20_runs_100_202006070134" -clustering_enabled -ensemble_enabled
This process will generate a folder for each experiment, to be able to organize several experiments with different parameters throughout the whole investigation process.
Each folder looks like samples_size_<sample_size>_count_<samples_number>_k_<clusters_number>_runs_<clustering_runs>_<timestamp>
, for example:
- Sample size: 1000
- Number of samples: 10
- Clusters number: 10
- Clustering runs: 100
samples_size_1000_count_10_k_10_runs_100_202007121800
├── 1000_1
│ ├── coassociation_matrix
│ ├── distances
│ │ ├── bow
│ │ ├── ft
│ │ ├── gtfidf
│ │ ├── sem
│ │ └── w2v
│ ├── input
│ ├── labels
│ └── pairs
├── 1000_2
│ ├── coassociation_matrix
│ ├── distances
│ │ ├── bow
│ │ ├── ft
│ │ ├── gtfidf
│ │ ├── sem
│ │ └── w2v
│ ├── input
│ ├── labels
│ └── pairs
...
└── 1000_10
├── coassociation_matrix
├── distances
│ ├── bow
│ ├── ft
│ ├── gtfidf
│ ├── sem
│ └── w2v
├── input
├── labels
└── pairs
Where each of the sub-directories are:
coassociation_matrix
: the final result after the clustering ensemble step.distances
: each of the underlying technique's similarity matrix.input
: each of the individual questions taken into account by the experiment.labels
: clustering results. Each question is assinged to a cluster. It's the clustering ensemble input.pairs
: sample which was the imput of the current experiment.
Takes the result of the main process (co-association matrix) and filters the pairs that exists in the input question-pair file. It compares the pairs one by one (input file against co-association matrix) using a threshold that shows the best results. The result of this comparison is shown as a confusion matrix.
python3 confusion_matrix_ensembles.py -runs 10 -sample_size 100 -experiment_path "ensembles/results/samples_size_100_count_10_k_5_runs_100_202006061939"
Parameter | Description | Required | Default Value |
---|---|---|---|
runs | Number of clustering runs. Useful to build the path of the input files. | true | -- |
sample_size | Sample size which the co-asociation file was built with | true | 0 -> all the questions |
experiment_path | Path of the main process output | true | -- |