Skip to content

Latest commit

 

History

History
22 lines (19 loc) · 1.73 KB

README.md

File metadata and controls

22 lines (19 loc) · 1.73 KB

Each folder cordis, sdss, oncomx holds the relevant files (i.e. seed data, synth data, and dev data) for each of the datasets. Additionally each file contains a tables.json file, which contains a json structure of the database schema including table names, column names, column data types and primary/foreign key relationships.

The following is an example of the file structure:

  • dev.json --> the manually generated development dataset
  • seed.json --> the manually generated seed dataset
  • synth.json --> the synthetically generated dataset using the seed query templates
  • tables.json --> a json representation of the schema containing:
    • the database name ("db_id"),
    • free text table names for NLP pipelines ("table_names") e.g. "Stellar spectral line indices" vs "spplines"
    • original table names ("table_names_original") i.e. the table names as they are in the database
    • free text column names for NLP pipelines ("column_names")
    • original column names ("column_names_original") i.e. the column names as they are in the database
    • column data types ("column_types"): time, text or number
    • foreign key relationships("foreign_keys")
    • primary keys ("primary_keys")

The PostgreSQL databases for each of the 3 databases used for this benchmark can be found at the following links: CORDIS SDSS OncoMX

PostgreSQL specification: DBMS: PostgreSQL (ver. 9.5.20) Case sensitivity: plain=lower, delimited=exact Driver: PostgreSQL JDBC Driver (ver. 42.5.0, JDBC4.2)