Presentation

This repository contains descriptions of Linked Data datasets using VoID vocabulary and prepared data to perform the empirical experiments for evaluating dataset ranking models. The directory "dataset" contains the dataset descriptions serialized as an nQuad RDF file and the directory "prepared_test_data" contains test data for experiments serialized as *.csv files.

The datasets descriptions include linksets, classes, properties and topic categories. It mashes up data from DataHub, dataset dumps, VoID files and DBpedia. The DBpedia Spotlight allowed the recognition of named entities in literal values of the datasets and the linking of datasets with a list of topic categories of DBpedia database. Named entities are directly linked to categories through the predicate dcterms:subject in DBpedia database and each topic category is subsumed by others through the predicate skos:broader. A category c is a topic category of a dataset iff there exists a property path {e dcterms:subject/skos:broader* c.} from a named entity e of the dataset to c in DBpedia.

Files in the "prepared_test_data" directory are organized in two subdirectories. The first directory called "bayesian-social_network" contains files for evaluating ranking models using algorithms based on Social Networks Analysis and Bayesian Classifiers. The second directory named "cos-j48-jrip" contains files for evaluating ranking models using algorithms based on cosine similarity and on the classifiers JRip and J48. Each of these two directories, in turn, is also organized in subdirectories that splits test data according to the types of features used in the experiments. The directory "5L" contains test data using dataset representations with five linksets, the directory "12C" contains test data using dataset representations with twelve topic categories, "5L12C" contains test data using dataset representations with five linksets and twelve topic categories, and so on.

The *.csv files in "cos-j48-jrip" is split into three series, indexed by {1, 2, 3}, which contain three types of files: Test_i.csv, Traning_i.csv and Relevants_i.csv. The Test_i.csv contains target datasets to which the datasets in ToBeRanked_i.csv should be ranked. The Relevants_i.csv contains the relevance degree rel in {0, 1, 2, 3} of each dataset in ToBeRanked_i.csv with respect to the target datasets in Test_i.csv, where the first column is a dataset id of a dataset in Test_i.csv, the second column is a dataset id of a dataset in ToBeRanked_i.csv and the third column is the relevance degree rel.

The three series of the files Test_i.csv, ToBeRankes_i.csv and relevants_i.csv are combinations, in a 3-fold cross-validation approach, of the 1113 datasets selected from the Datahub.

The directory "bayesian-social_network" contains exactly the same data, except that the TF-IDF of the features are replaced with the feature id itself. This files are used to evaluate ranking models based on Bayesian Classifiers and Social Network Analisys, which do n ot use TF-IDF.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
dataset		dataset
prepared_test_data		prepared_test_data
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Presentation

About

Uh oh!

Releases 7

Packages

Uh oh!

License

SWLab-ICUFF/DatasetDescriptions

Folders and files

Latest commit

History

Repository files navigation

Presentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Packages