Skip to content

Departamento-Sistemas-UTNFRRO/big_data_text_similarity

Repository files navigation

Ensemble method for community Question Answering sites based on cLustering

This repository contains the necessary code to perform the experiments for the MS thesis "Desarrollo de una medida de similaridad para Sistemas de Recomendación en sitios de Community Question Answering. Análisis desde un enfoque Big Data y usando un método de ensamble de clustering" See thesis PDF document".

This method, Ensemble method for community Question Answering sites based on cLustering, can be abbreviated as "EQuAL", and it's developed on a Python/PySpark Big Data architecture. This method is based on Evidence Accumulation Clustering (EAC), and it's utilized to generate similarity matrices that can be the input of a Recommender System (RS) in a Question Community Answering site. The code is executed on a distributed big data architecture and takes into account several algorithms of text similarity, combined by a clustering ensemble method.

Data pre-processing

This process can be executed only once and ensures data quality of the input data set. It performs the following pre-processing tasks:

  1. Text to lowercase.
  2. Remove math formulas.
  3. Replace numbers with letters.
  4. Remove special characters.

Execution

python3 preProcessQuestions.py -quora data/quora_duplicate_questions.tsv -results data/question_pairs.csv

Parameters

Parameter Description Required Default Value
input Raw data set path true --
results Results data set path true --



Samples generation

This process, besides generating the necessary question pair samples used by the main process, calculates the distance for each question pair. This is useful for two reasons:

  1. The distance for each individual technique is later compared to the main process output.
  2. The exact same samples generated by this process will be used by the main process. This will enable a fair comparison between methods.

Execution

python3 computeDistance.py -t bow -q data/question_pairs.csv

Parameters

Parameter Description Required Default Value
t Comparison technique. choices=['bow', 'tfidf', 'gtfidf', 'w2v', 'ft', 'sem'] true --
q Question pair data set path. true --
w Number of parallel processes false 8
b Batch size. Number of question pairs distances calculated and written in one batch. false 100000
k Total runs number. false --
n Questions subset size that will be processed. false 0 -> All the questions
results_path Path where the result files will be saved. false --
previous Previous results file path (to resume an unfinished experiment). false --
sample Path to sample file (if not provided, creates a new sample) false --
sample_file_name First part of the sample file name (the string _runnumber.csv will be appended) false --



Main Process

This process is the main one of this method is based on three steps:

  1. Similarity matrix calculation for each of the underlying similarity algorithms.
  2. The generation of a partition set using k-medoids clustering technique.
  3. The construction of a co-association matrix from clustering labels.

The script that performs the aforementioned steps is clusterEnsemble.py.

Execution

python3 /Users/ftesone/dev/big_data_text_similarity/clusterEnsemble.py -techniques bow,tfidf,w2v,ft,sem -questions_path "/Users/ftesone/Documents/Tesis/experiments/ensembles/inputs/w2v/100_10" -results_path "/Users/ftesone/Documents/Tesis/experiments/ensembles/results" -sample_size 100 -samples_number 10 -k 5 -clustering_runs 100 -in_progress_experiment_path "" -calc_distances_enabled -clustering_enabled -ensemble_enabled

Parameters

Parameter Description Required Default Value
techniques Comma-separated list of techniques. Full list: bow,gtfidf,w2v,ft,sem. true --
questions_path Path where the input sample data set of CQA question pairs is located. true --
sample_size Sample size. It's useful for build the input path and some data set generation. false 0
samples_number Numbers of samples that are going to be taken by the current experiment. false 1
results_path Path where the result files will be saved. false --
k Number of clusters (medoids). true 1
clustering_runs Number of k-medoids clustering runs. false 1
in_progress_experiment_path If null, creates a new folder for the results. false False
calc_distances_enabled Enabled step 1.** false False
clustering_enabled Enabled step 2.** false False
ensemble_enabled Enabled step 3.** false False
start_from_sample_num Sample number to start from, in case samples_number > 1. false 1

** These parameters are used if the parameter in_progress_experiment_path is set. For instance, is it possibe to re-use the similarity matrices from another execution and change the number of clusters, to generate a different output. For instance

-in_progress_experiment_path "samples_size_100_count_10_k_20_runs_100_202006070134" -clustering_enabled -ensemble_enabled

Results

This process will generate a folder for each experiment, to be able to organize several experiments with different parameters throughout the whole investigation process.

Each folder looks like samples_size_<sample_size>_count_<samples_number>_k_<clusters_number>_runs_<clustering_runs>_<timestamp>, for example:

  • Sample size: 1000
  • Number of samples: 10
  • Clusters number: 10
  • Clustering runs: 100
samples_size_1000_count_10_k_10_runs_100_202007121800
├── 1000_1
│   ├── coassociation_matrix
│   ├── distances
│   │   ├── bow
│   │   ├── ft
│   │   ├── gtfidf
│   │   ├── sem
│   │   └── w2v
│   ├── input
│   ├── labels
│   └── pairs
├── 1000_2
│   ├── coassociation_matrix
│   ├── distances
│   │   ├── bow
│   │   ├── ft
│   │   ├── gtfidf
│   │   ├── sem
│   │   └── w2v
│   ├── input
│   ├── labels
│   └── pairs
...
└── 1000_10
    ├── coassociation_matrix
    ├── distances
    │   ├── bow
    │   ├── ft
    │   ├── gtfidf
    │   ├── sem
    │   └── w2v
    ├── input
    ├── labels
    └── pairs

Where each of the sub-directories are:

  • coassociation_matrix: the final result after the clustering ensemble step.
  • distances: each of the underlying technique's similarity matrix.
  • input: each of the individual questions taken into account by the experiment.
  • labels: clustering results. Each question is assinged to a cluster. It's the clustering ensemble input.
  • pairs: sample which was the imput of the current experiment.



Validation process

Takes the result of the main process (co-association matrix) and filters the pairs that exists in the input question-pair file. It compares the pairs one by one (input file against co-association matrix) using a threshold that shows the best results. The result of this comparison is shown as a confusion matrix.

Execution

python3 confusion_matrix_ensembles.py -runs 10 -sample_size 100 -experiment_path "ensembles/results/samples_size_100_count_10_k_5_runs_100_202006061939"

Parameters

Parameter Description Required Default Value
runs Number of clustering runs. Useful to build the path of the input files. true --
sample_size Sample size which the co-asociation file was built with true 0 -> all the questions
experiment_path Path of the main process output true --

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published