Ensemble method for community Question Answering sites based on cLustering

This repository contains the necessary code to perform the experiments for the MS thesis "Desarrollo de una medida de similaridad para Sistemas de Recomendación en sitios de Community Question Answering. Análisis desde un enfoque Big Data y usando un método de ensamble de clustering" See thesis PDF document".

This method, Ensemble method for community Question Answering sites based on cLustering, can be abbreviated as "EQuAL", and it's developed on a Python/PySpark Big Data architecture. This method is based on Evidence Accumulation Clustering (EAC), and it's utilized to generate similarity matrices that can be the input of a Recommender System (RS) in a Question Community Answering site. The code is executed on a distributed big data architecture and takes into account several algorithms of text similarity, combined by a clustering ensemble method.

Data pre-processing

This process can be executed only once and ensures data quality of the input data set. It performs the following pre-processing tasks:

Text to lowercase.
Remove math formulas.
Replace numbers with letters.
Remove special characters.

Execution

python3 preProcessQuestions.py -quora data/quora_duplicate_questions.tsv -results data/question_pairs.csv

Parameters

Parameter	Description	Required	Default Value
input	Raw data set path	true	--
results	Results data set path	true	--

Samples generation

This process, besides generating the necessary question pair samples used by the main process, calculates the distance for each question pair. This is useful for two reasons:

The distance for each individual technique is later compared to the main process output.
The exact same samples generated by this process will be used by the main process. This will enable a fair comparison between methods.

Execution

python3 computeDistance.py -t bow -q data/question_pairs.csv

Parameters

Parameter	Description	Required	Default Value
t	Comparison technique. choices=['bow', 'tfidf', 'gtfidf', 'w2v', 'ft', 'sem']	true	--
q	Question pair data set path.	true	--
w	Number of parallel processes	false	8
b	Batch size. Number of question pairs distances calculated and written in one batch.	false	100000
k	Total runs number.	false	--
n	Questions subset size that will be processed.	false	0 -> All the questions
results_path	Path where the result files will be saved.	false	--
previous	Previous results file path (to resume an unfinished experiment).	false	--
sample	Path to sample file (if not provided, creates a new sample)	false	--
sample_file_name	First part of the sample file name (the string _runnumber.csv will be appended)	false	--

Main Process

This process is the main one of this method is based on three steps:

Similarity matrix calculation for each of the underlying similarity algorithms.
The generation of a partition set using k-medoids clustering technique.
The construction of a co-association matrix from clustering labels.

The script that performs the aforementioned steps is clusterEnsemble.py.

Execution

python3 /Users/ftesone/dev/big_data_text_similarity/clusterEnsemble.py -techniques bow,tfidf,w2v,ft,sem -questions_path "/Users/ftesone/Documents/Tesis/experiments/ensembles/inputs/w2v/100_10" -results_path "/Users/ftesone/Documents/Tesis/experiments/ensembles/results" -sample_size 100 -samples_number 10 -k 5 -clustering_runs 100 -in_progress_experiment_path "" -calc_distances_enabled -clustering_enabled -ensemble_enabled

Parameters

Parameter	Description	Required	Default Value
techniques	Comma-separated list of techniques. Full list: bow,gtfidf,w2v,ft,sem.	true	--
questions_path	Path where the input sample data set of CQA question pairs is located.	true	--
sample_size	Sample size. It's useful for build the input path and some data set generation.	false	0
samples_number	Numbers of samples that are going to be taken by the current experiment.	false	1
results_path	Path where the result files will be saved.	false	--
k	Number of clusters (medoids).	true	1
clustering_runs	Number of k-medoids clustering runs.	false	1
in_progress_experiment_path	If null, creates a new folder for the results.	false	False
calc_distances_enabled	Enabled step 1.**	false	False
clustering_enabled	Enabled step 2.**	false	False
ensemble_enabled	Enabled step 3.**	false	False
start_from_sample_num	Sample number to start from, in case samples_number > 1.	false	1

** These parameters are used if the parameter in_progress_experiment_path is set. For instance, is it possibe to re-use the similarity matrices from another execution and change the number of clusters, to generate a different output. For instance

-in_progress_experiment_path "samples_size_100_count_10_k_20_runs_100_202006070134" -clustering_enabled -ensemble_enabled

Results

This process will generate a folder for each experiment, to be able to organize several experiments with different parameters throughout the whole investigation process.

Each folder looks like samples_size_<sample_size>_count_<samples_number>_k_<clusters_number>_runs_<clustering_runs>_<timestamp>, for example:

Sample size: 1000
Number of samples: 10
Clusters number: 10
Clustering runs: 100

samples_size_1000_count_10_k_10_runs_100_202007121800
├── 1000_1
│   ├── coassociation_matrix
│   ├── distances
│   │   ├── bow
│   │   ├── ft
│   │   ├── gtfidf
│   │   ├── sem
│   │   └── w2v
│   ├── input
│   ├── labels
│   └── pairs
├── 1000_2
│   ├── coassociation_matrix
│   ├── distances
│   │   ├── bow
│   │   ├── ft
│   │   ├── gtfidf
│   │   ├── sem
│   │   └── w2v
│   ├── input
│   ├── labels
│   └── pairs
...
└── 1000_10
    ├── coassociation_matrix
    ├── distances
    │   ├── bow
    │   ├── ft
    │   ├── gtfidf
    │   ├── sem
    │   └── w2v
    ├── input
    ├── labels
    └── pairs

Where each of the sub-directories are:

coassociation_matrix: the final result after the clustering ensemble step.
distances: each of the underlying technique's similarity matrix.
input: each of the individual questions taken into account by the experiment.
labels: clustering results. Each question is assinged to a cluster. It's the clustering ensemble input.
pairs: sample which was the imput of the current experiment.

Validation process

Takes the result of the main process (co-association matrix) and filters the pairs that exists in the input question-pair file. It compares the pairs one by one (input file against co-association matrix) using a threshold that shows the best results. The result of this comparison is shown as a confusion matrix.

Execution

python3 confusion_matrix_ensembles.py -runs 10 -sample_size 100 -experiment_path "ensembles/results/samples_size_100_count_10_k_5_runs_100_202006061939"

Parameters

Parameter	Description	Required	Default Value
runs	Number of clustering runs. Useful to build the path of the input files.	true	--
sample_size	Sample size which the co-asociation file was built with	true	0 -> all the questions
experiment_path	Path of the main process output	true	--

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
comparators		comparators
data		data
ensembles		ensembles
internal/word2vec		internal/word2vec
utils		utils
.gitignore		.gitignore
README.md		README.md
clusterEnsemble.py		clusterEnsemble.py
computeDistance.py		computeDistance.py
computeError.py		computeError.py
confusionMatrix.py		confusionMatrix.py
confusion_matrix_ensembles.py		confusion_matrix_ensembles.py
distanceMatrix.py		distanceMatrix.py
distances.sh		distances.sh
experiments_100.sh		experiments_100.sh
experiments_1000.sh		experiments_1000.sh
experiments_1500.sh		experiments_1500.sh
experiments_2000.sh		experiments_2000.sh
experiments_500.sh		experiments_500.sh
experiments_bow_ensemble.sh		experiments_bow_ensemble.sh
experiments_w2v_ensemble.sh		experiments_w2v_ensemble.sh
fix_experiments.sh		fix_experiments.sh
performance.sh		performance.sh
preProcessQuestions.py		preProcessQuestions.py
requirements.txt		requirements.txt
sample_run.sh		sample_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ensemble method for community Question Answering sites based on cLustering

Data pre-processing

Execution

Parameters

Samples generation

Execution

Parameters

Main Process

Execution

Parameters

Results

Validation process

Execution

Parameters

About

Releases

Packages

Contributors 3

Languages

Departamento-Sistemas-UTNFRRO/big_data_text_similarity

Folders and files

Latest commit

History

Repository files navigation

Ensemble method for community Question Answering sites based on cLustering

Data pre-processing

Execution

Parameters

Samples generation

Execution

Parameters

Main Process

Execution

Parameters

Results

Validation process

Execution

Parameters

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages