Skip to content

SWLab-ICUFF/DatasetDescriptions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Presentation

This repository contains descriptions of Linked Data datasets using VoID vocabulary and prepared data to perform the empirical experiments for evaluating dataset ranking models. The directory "dataset" contains the dataset descriptions serialized as an nQuad RDF file and the directory "prepared_test_data" contains test data for experiments serialized as *.csv files.

The datasets descriptions include linksets, classes, properties and topic categories. It mashes up data from DataHub, dataset dumps, VoID files and DBpedia. The DBpedia Spotlight allowed the recognition of named entities in literal values of the datasets and the linking of datasets with a list of topic categories of DBpedia database. Named entities are directly linked to categories through the predicate dcterms:subject in DBpedia database and each topic category is subsumed by others through the predicate skos:broader. A category c is a topic category of a dataset iff there exists a property path {e dcterms:subject/skos:broader* c.} from a named entity e of the dataset to c in DBpedia.

Files in the "prepared_test_data" directory are organized in two subdirectories. The first directory called "bayesian-social_network" contains files for evaluating ranking models using algorithms based on Social Networks Analysis and Bayesian Classifiers. The second directory named "cos-j48-jrip" contains files for evaluating ranking models using algorithms based on cosine similarity and on the classifiers JRip and J48. Each of these two directories, in turn, is also organized in subdirectories that splits test data according to the types of features used in the experiments. The directory "5L" contains test data using dataset representations with five linksets, the directory "12C" contains test data using dataset representations with twelve topic categories, "5L12C" contains test data using dataset representations with five linksets and twelve topic categories, and so on.

The *.csv files in "cos-j48-jrip" is split into three series, indexed by {1, 2, 3}, which contain three types of files: Testi.csv, Traningi.csv and Relevantsi.csv. The Testi.csv contains target datasets to which the datasets in ToBeRankedi.csv should be ranked. The Relevantsi.csv contains the relevance degree rel in {0, 1, 2, 3} of each dataset in ToBeRankedi.csv with respect to the target datasets in Testi.csv, where the first column is a dataset id of a dataset in Testi.csv, the second column is a dataset id of a dataset in ToBeRankedi.csv and the third column is the relevance degree rel.

The three series of the files Testi.csv, ToBeRankesi.csv and relevantsi.csv are combinations, in a 3-fold cross-validation approach, of the 1113 datasets selected from the Datahub.

The directory "bayesian-social_network" contains exactly the same data, except that the TF-IDF of the features are replaced with the feature id itself. This files are used to evaluate ranking models based on Bayesian Classifiers and Social Network Analisys, which do n ot use TF-IDF.