Skip to content

graph-massivizer/Multi-Summaries

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

111 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scalable summarization

This project contains code for the summarizing a lot of linked open data.

Setup and run

  1. The ./setup/ folder contains files for setting up the library. The settings.config sets compiler flags and it specifies the path to a Boost installation. Make sure this path refers to a valid installation of Boost. The setup_experiments.sh file can be run to set up the library.
    • Note that setup_experiments.sh can take a -y parameter to skip all the user input, by answering y to all.
    • If the path to the Boost installation does not exists, setup_experiments.sh will ask to automatically install Boost.
  2. Now setup_experiments.sh should have created a folder using the hash of the current git commit. This hash will also be printed by setup_experiments.sh during setup. This folder contains compiled code (for C++) along with the source code (for both C++ and Python) in the <hash>/code/ folder. The <hash>/scripts/ folder contains shell scripts (along with correpsondig config files) that in turn can create job scripts to run via a slurm system. These job scripts can also be run locally as a regular shell script.
    • These scripts can all also be run with a -y parameter.
  3. While setting up, if anaconda was available, an anaconda enviroment should have been created in <hash>/code/python/.conda. This enviroment should contain a working Python setup that has our Python code available as a library. Alternatively, our Python code can be installed in any environment as a library by running pip install -e . in the <hash>/code/python/ folder.'
  4. With our Python library properly installed, there are now several ways to run our code:
    • In the <hash>/scripts/ folder set the settings in the the config files and run any of the scripts to run parts of the experiment. The <hash>/scripts/run_all.sh script will run the full bisimulation pipeline, along with optionally plotting results and serializing the output to RDF (see run_all,config).
    • With our Python library installed, you can import a Python interface for running our code via from summary_loader.summary_interface import SummaryInterface. This interface allows for specifiying a dataset, changing experiment settings and running experiments.
    • The compiled C++ code and Python scripts in <hash>/code/ could also be run directly.

The code

C++

The compiled C++ programs are located in <hash>/code/bin/. A copy of their source code is available in <hash>/code/src/.

  • preprocessor: This program takes in an nt-triples graph and splits of the IRIs from the toplogy. Literal values are encoded as one global blank node and some values are treated differently based on the settings.
    • Parameters
      • The first positional parameter is the path to the nt-triples to be processed.
    • Flags
      • --skipRDFlists: This flag specifies whether RDF lists should be ignored. This may be useful, as RDF-lists, due to their chain-like structure, can lead to very deep summaries.
      • --skip_literals: This flag specifies whether triples with literal values should be ignored.
      • --types_to_predicates: This flag specifies whether triples of the type <subject> <rdf:type> <object> should be encoded as <subject> <object> "_:rdfTypeNode".
      • --laundromat: This flag should be set only for the LODlaundromat dataset, since it uses a (rudimentary) trig file.
  • bisimulator: This program computes the partition refinement over the vertex set of the input graph. It also generates the "refines" edges between subsequent partitions of the refinement process.
    • Parameters
      • The first positional parameter is the mode in which the program run. Currently only run_k_bisimulation_store_partition_condensed_timed is properly implemented.
      • The second positional is the binary graph representation (generated by preprocessor)` that is to be used as input.
    • Flags
      • --output This optional flag allows so change the output directory (e.g. --output=./path/to/output/directory/).
      • --typed_start This flag specifies whether the bisimulation should start by splitting on rdf:type or whether it should start with all vertices partitioned together.
  • create_condensed_summary_graph_from_partitions: This program takes in the computed partitions and creates the data edges between subsequent partitions. It also creates some binary mapping files, such that each block now has a unique indentifier (as opposed to reusing freed identifiers for different blocks) and that each block has a known interval for when it exists. It also computes explicitly which singletons where created at splitting blocks (this was only implicitly encoded by the bisimulator).
    • Parameters
      • The first positional parameter specifies what directory to read outputs from. This should be the same as the output directory from the bisimulator.

Python

The Python library is located in <hash>/code/python/. The source code for this library is found in <hash>/code/python/summary_loader/

  • graph_stats.py: This program plots several statistics about the bisimulation process and its output (i.e. the partitions and the edges between them).
    • Parameters
      • The first positional parameter specifies an experiment directory. This should be the same directory as the output of the bisimulator.
    • Flags
      • -v This flag indicates that the process should be more verbose. This could lead to very large dictionaries being printed.
  • serialize_to_ntriples.py: This Python file serializes the multi summary graph into RDF ntriples format.
    • Parameters
      • The first positional parameter specifies an experiment directory. This should be the same directory as the output of the bisimulator.
      • The second parameters specifies how IRIs for summary block nodes are created. It should be one of id_set, iri_set, or hash. The hash setting is recommended as it prevents extremely large IRIs from being produced.
  • summary_interface.py: This Python script defines an interface that can be used to work with the experiments. It can be used to run or queue experiments and check on the status of an experiment (e.g. whether the multi summary is finished).

Bash scripts

Safe for the run_all.sh script, all scripts have settings for setting up a slurm script. We will describe the non-slurm parameters further below.

  • run_all.sh: This script is meant to run the other scripts sequentially. It will always run the preprocessor.sh, bisimulator.sh and summary_graphs_creator.sh scripts. Depending on its settings results_plotter.sh and serializer.sh can also be run.
    • Parameters
      • The first positional parameter specifies a path to an ntriples dataset.
    • Flags
    • -y Setting this flag automatically answers all requested user input with y.
    • Settings
      • plot_statistics (default: true) This setting specifies whether the results_plotter.sh script should be executed.
      • serialize_to_ntriples (default: true) This setting specifies whether the serializer.sh script should be executed.
  • preprocessor.sh: This script firstly creates a directory <hash>/<dataset name>/ for the experiment. It then sets up a slurm-compatible shell script that runs the preprocessor program (directly or via slurm) in the specified directory.
    • Parameters
      • The first positional parameter specifies a path to an ntriples dataset.
    • Flags
      • -y Setting this flag automatically answers all requested user input with y.
    • Settings
      • skipRDFlists (default: false) This setting specifies whether the flag should be set to ignore RDF-lists.
      • skip_literals: This setting specifies whether the flag should be set to ignore literals.
      • laundromat (default: false) This setting sets the flag required for the LODlaundromat dataset.
      • types_to_predicates (default: false) This setting sets the flag for encoding RDF-type objects as predicates.
      • use_lz4 (default: false) This setting should be set to true when dealing with a compressed .nt.lz4 file.
      • lz4_command (default: /usr/local/lz4) This should specify a path to the lz4 command, if it is required.
  • bisimulator.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs the bisimulator program (directly or via slurm) in the specified directory.
    • Parameters
      • The first parameter specifies a directory to read the preprocessed graph from. It will also use this directory to write its output to.
    • Flags
      • -y Setting this flag automatically answers all requested user input with y.
    • Settings
      • bisimulation_mode (default: run_k_bisimulation_store_partition_condensed_timed)
      • typed_start (default: true)
  • summary_graphs_creator.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs the create_condensed_summary_graph_from_partitions program (directly or via slurm) in the specified directory.
    • Parameters
      • The first parameter specifies a directory to read the bisimulation output (refined partition) from. It will also use this directory to write its output to.
    • Flags
      • -y Setting this flag automatically answers all requested user input with y.
  • results_plotter.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs the graph_stats.py program (directly or via slurm) in the specified directory.
    • Parameters
      • The first parameter specifies a directory to read the multi summary from. It will also use this directory to write its output to.
    • Flags
      • -y Setting this flag automatically answers all requested user input with y.
    • Settings
  • serializer.sh: This script takes in an experiment directory and sets up a slurm-compatible shell script that runs the serialize_to_ntriples.py program (directly or via slurm) in the specified directory.
    • Parameters
      • The first parameter specifies a directory to read the multi summary from. It will also use this directory to write its output to.
    • Flags
      • -y Setting this flag automatically answers all requested user input with y.
    • Settings
      • iri_type (default: hash) The second parameters specifies how IRIs for summary block nodes are created. It should be one of id_set, iri_set, or hash. The hash setting is recommended as it prevents extremely large IRIs from being produced.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •