This project contains code for the summarizing a lot of linked open data.
- The
./setup/folder contains files for setting up the library. Thesettings.configsets compiler flags and it specifies the path to a Boost installation. Make sure this path refers to a valid installation of Boost. Thesetup_experiments.shfile can be run to set up the library.- Note that
setup_experiments.shcan take a-yparameter to skip all the user input, by answeringyto all. - If the path to the Boost installation does not exists,
setup_experiments.shwill ask to automatically install Boost.
- Note that
- Now
setup_experiments.shshould have created a folder using the hash of the current git commit. This hash will also be printed bysetup_experiments.shduring setup. This folder contains compiled code (for C++) along with the source code (for both C++ and Python) in the<hash>/code/folder. The<hash>/scripts/folder contains shell scripts (along with correpsondig config files) that in turn can create job scripts to run via a slurm system. These job scripts can also be run locally as a regular shell script.- These scripts can all also be run with a
-yparameter.
- These scripts can all also be run with a
- While setting up, if anaconda was available, an anaconda enviroment should have been created in
<hash>/code/python/.conda. This enviroment should contain a working Python setup that has our Python code available as a library. Alternatively, our Python code can be installed in any environment as a library by runningpip install -e .in the<hash>/code/python/folder.' - With our Python library properly installed, there are now several ways to run our code:
- In the
<hash>/scripts/folder set the settings in the the config files and run any of the scripts to run parts of the experiment. The<hash>/scripts/run_all.shscript will run the full bisimulation pipeline, along with optionally plotting results and serializing the output to RDF (seerun_all,config). - With our Python library installed, you can import a Python interface for running our code via
from summary_loader.summary_interface import SummaryInterface. This interface allows for specifiying a dataset, changing experiment settings and running experiments. - The compiled C++ code and Python scripts in
<hash>/code/could also be run directly.
- In the
The compiled C++ programs are located in <hash>/code/bin/. A copy of their source code is available in <hash>/code/src/.
preprocessor: This program takes in an nt-triples graph and splits of the IRIs from the toplogy. Literal values are encoded as one global blank node and some values are treated differently based on the settings.- Parameters
- The first positional parameter is the path to the nt-triples to be processed.
- Flags
--skipRDFlists: This flag specifies whether RDF lists should be ignored. This may be useful, as RDF-lists, due to their chain-like structure, can lead to very deep summaries.--skip_literals: This flag specifies whether triples with literal values should be ignored.--types_to_predicates: This flag specifies whether triples of the type<subject> <rdf:type> <object>should be encoded as<subject> <object> "_:rdfTypeNode".--laundromat: This flag should be set only for the LODlaundromat dataset, since it uses a (rudimentary) trig file.
- Parameters
bisimulator: This program computes the partition refinement over the vertex set of the input graph. It also generates the "refines" edges between subsequent partitions of the refinement process.- Parameters
- The first positional parameter is the mode in which the program run. Currently only
run_k_bisimulation_store_partition_condensed_timedis properly implemented. - The second positional is the binary graph representation (generated by
preprocessor)` that is to be used as input.
- The first positional parameter is the mode in which the program run. Currently only
- Flags
--outputThis optional flag allows so change the output directory (e.g.--output=./path/to/output/directory/).--typed_startThis flag specifies whether the bisimulation should start by splitting onrdf:typeor whether it should start with all vertices partitioned together.
- Parameters
create_condensed_summary_graph_from_partitions: This program takes in the computed partitions and creates the data edges between subsequent partitions. It also creates some binary mapping files, such that each block now has a unique indentifier (as opposed to reusing freed identifiers for different blocks) and that each block has a known interval for when it exists. It also computes explicitly which singletons where created at splitting blocks (this was only implicitly encoded by the bisimulator).- Parameters
- The first positional parameter specifies what directory to read outputs from. This should be the same as the output directory from the
bisimulator.
- The first positional parameter specifies what directory to read outputs from. This should be the same as the output directory from the
- Parameters
The Python library is located in <hash>/code/python/. The source code for this library is found in <hash>/code/python/summary_loader/
graph_stats.py: This program plots several statistics about the bisimulation process and its output (i.e. the partitions and the edges between them).- Parameters
- The first positional parameter specifies an experiment directory. This should be the same directory as the output of the
bisimulator.
- The first positional parameter specifies an experiment directory. This should be the same directory as the output of the
- Flags
-vThis flag indicates that the process should be more verbose. This could lead to very large dictionaries being printed.
- Parameters
serialize_to_ntriples.py: This Python file serializes the multi summary graph into RDF ntriples format.- Parameters
- The first positional parameter specifies an experiment directory. This should be the same directory as the output of the
bisimulator. - The second parameters specifies how IRIs for summary block nodes are created. It should be one of
id_set,iri_set, orhash. Thehashsetting is recommended as it prevents extremely large IRIs from being produced.
- The first positional parameter specifies an experiment directory. This should be the same directory as the output of the
- Parameters
summary_interface.py: This Python script defines an interface that can be used to work with the experiments. It can be used to run or queue experiments and check on the status of an experiment (e.g. whether the multi summary is finished).
Safe for the run_all.sh script, all scripts have settings for setting up a slurm script. We will describe the non-slurm parameters further below.
run_all.sh: This script is meant to run the other scripts sequentially. It will always run thepreprocessor.sh,bisimulator.shandsummary_graphs_creator.shscripts. Depending on its settingsresults_plotter.shandserializer.shcan also be run.- Parameters
- The first positional parameter specifies a path to an ntriples dataset.
- Flags
-ySetting this flag automatically answers all requested user input withy.- Settings
plot_statistics(default:true) This setting specifies whether theresults_plotter.shscript should be executed.serialize_to_ntriples(default:true) This setting specifies whether theserializer.shscript should be executed.
- Parameters
preprocessor.sh: This script firstly creates a directory<hash>/<dataset name>/for the experiment. It then sets up a slurm-compatible shell script that runs thepreprocessorprogram (directly or via slurm) in the specified directory.- Parameters
- The first positional parameter specifies a path to an ntriples dataset.
- Flags
-ySetting this flag automatically answers all requested user input withy.
- Settings
skipRDFlists(default:false) This setting specifies whether the flag should be set to ignore RDF-lists.skip_literals: This setting specifies whether the flag should be set to ignore literals.laundromat(default:false) This setting sets the flag required for the LODlaundromat dataset.types_to_predicates(default:false) This setting sets the flag for encoding RDF-type objects as predicates.use_lz4(default:false) This setting should be set totruewhen dealing with a compressed.nt.lz4file.lz4_command(default:/usr/local/lz4) This should specify a path to thelz4command, if it is required.
- Parameters
bisimulator.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs thebisimulatorprogram (directly or via slurm) in the specified directory.- Parameters
- The first parameter specifies a directory to read the preprocessed graph from. It will also use this directory to write its output to.
- Flags
-ySetting this flag automatically answers all requested user input withy.
- Settings
bisimulation_mode(default:run_k_bisimulation_store_partition_condensed_timed)typed_start(default:true)
- Parameters
summary_graphs_creator.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs thecreate_condensed_summary_graph_from_partitionsprogram (directly or via slurm) in the specified directory.- Parameters
- The first parameter specifies a directory to read the bisimulation output (refined partition) from. It will also use this directory to write its output to.
- Flags
-ySetting this flag automatically answers all requested user input withy.
- Parameters
results_plotter.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs thegraph_stats.pyprogram (directly or via slurm) in the specified directory.- Parameters
- The first parameter specifies a directory to read the multi summary from. It will also use this directory to write its output to.
- Flags
-ySetting this flag automatically answers all requested user input withy.
- Settings
- Parameters
serializer.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs theserialize_to_ntriples.pyprogram (directly or via slurm) in the specified directory.- Parameters
- The first parameter specifies a directory to read the multi summary from. It will also use this directory to write its output to.
- Flags
-ySetting this flag automatically answers all requested user input withy.
- Settings
iri_type(default:hash) The second parameters specifies how IRIs for summary block nodes are created. It should be one ofid_set,iri_set, orhash. Thehashsetting is recommended as it prevents extremely large IRIs from being produced.
- Parameters