Skip to content

Directory Conventions

Tim L edited this page Oct 7, 2013 · 148 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

As described in Conversion process phase: name, csv2rdf4lod organizes third party data according to who provided it (source), what they were talking about (dataset), and when they said it (version). Short identifiers for each of these three aspects are combined to create the URI for the gathered dataset. This organizational scheme allows a data aggregator and curator to bring order to the ad hoc ways that data providers may offer their data.

To be consistent, we organize the physical filesystem directory according to the same logical organization: by source, dataset, and version. The filesystem is constructed during the Conversion process phase: retrieve.

Since the datasets that we gather are organized in the filesystem according to source, dataset, and version, the shell scripts in csv2rdf4lod-automation expect this same structure.

The logical, physical, and operational organization of the aggregated data are consistently oriented around the three essential aspects: source, dataset, and version identifiers of the dataset being collected, retrieved, converted, and published.

See Conversion process phase: retrieve for a walk through on creating a directory structure to retrieve a third party's dataset.

What we'll cover

The filesystem directory structure that csv2rdf4lod uses to organize 1) data retrieved from third parties, 2) any modifications that an aggregator may perform, and 3) the RDF conversion outputs.

Following these conventions allows others to orient with what another developer has previously done, and facilitates collaboration among developers that are curating the same data sources. The helper scripts in csv2rdf4lod-automation also assume this directory structure when they perform their activities.

Let's get to it!

To illustrate the directory convention for gathering, manipulating, and publishing third parties' data, we'll exercise the cr-pwd-type.sh script from the deepest directory back to the conversion root.

Running $CSV2RDF4LOD_HOME/bin/cr-pwd-type.sh will tell you what type of csv2rdf4lod directory you are in. First, the conversion cockpit is the place where a specific dataset is collected, manipulated, converted, and published. The higher directories are simply organizing conversion cockpits.

/opt/logd/data/source/worldbank-org/world-development-indicators/version/2011-Jul-29$ cr-pwd-type.sh ; cd ..
cr:conversion-cockpit

/opt/logd/data/source/worldbank-org/world-development-indicators/version$ cr-pwd-type.sh ; cd ..
cr:directory-of-versions

/opt/logd/data/source/worldbank-org/world-development-indicators$ cr-pwd-type.sh ; cd .. 
cr:dataset

/opt/logd/data/source/worldbank-org$ cr-pwd-type.sh ; cd ..
cr:source

/opt/logd/data/source$ cr-pwd-type.sh ; cd .. 
cr:data-root

/opt/logd/data$ cr-pwd-type.sh
Not recognized; see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Directory-Conventions

/opt/logd/data$ cr-pwd-type.sh --types
cr:data-root cr:source cr:directory-of-datasets cr:dataset cr:directory-of-versions cr:conversion-cockpit

See Conversion process phase: retrieve for a tutorial to create the directory structure from the ground up, to retrieve a dataset in preparation to convert it to RDF.

Writing code to work within the directory conventions

This section contains technical notes for how to write automation scripts to work within the directory structure.

Running $CSV2RDF4LOD_HOME/bin/util/is-pwd-a.sh will return yes if the current directory is of the given type or no if it is not. It also lists the possible types with --types.

$CSV2RDF4LOD_HOME/bin/util/pwd-not-a.sh returns ... and lists the types with --types.

Adding custom retrieval code to the directory structure

Unfortunately, many data providers do not make it straightforward to obtain their data, rendering a direct URL request inadequate.

If our source identifier, dataset identifier, and version identifier were SSS, DDD, and VVV, respectively, the directory structure becomes:

what-you-want/source/SSS/DDD/version/VVV/source/their.csv
what-you-want/source/SSS/DDD/version/VVV/source/their.csv.pml.ttl

If it took more than a URL request to get their.csv, some custom code might be required. In this case, we recommend setting up shop at:

what-you-want/source/SSS/DDD/src/
what-you-want/source/SSS/DDD/bin/

Then, when using DDD/version/2source.sh (see Automated creation of a new Versioned Dataset) to automatically obtain the source organization's data, it invokes stuff in DDD/bin or DDD/src to invoke the super special scrapers that needed to be cobbled together.

Simple checking

Some automation scripts only make sense to run within certain types of directories. For other scripts, it may make sense to do different things according to the directory type from which it is invoked.

Each directory level below the conversion root has a csv2rdf4lod directory type (from root to deepest):

  • cr:data-root (e.g. source/)
  • cr:source (e.g. source/hub-healthdata-gov)
  • cr:directory-of-datasets
  • cr:dataset
  • cr:directory-of-versions
  • cr:conversion-cockpit (e.g. source/hub-healthdata-gov/hospital-compare/version/2012-Jul-17)

(cr:bone and cr:dev are also valid tests)

$CSV2RDF4LOD_HOME/bin/cr-dataset-uri.sh shows a simple boilerplate that can be used to check that the script is running in the expected directories. This abstracts away the actual location, consolidating the logic into a single location. Calling $CSV2RDF4LOD_HOME/bin/pwd-not-a.sh will print consistent error messages for the given expected directory types.

see='https://github.com/timrdf/csv2rdf4lod-automation/wiki/CSV2RDF4LOD-not-set'
CSV2RDF4LOD_HOME=${CSV2RDF4LOD_HOME:?"not set; source csv2rdf4lod/source-me.sh or see $see"}

# cr:data-root cr:source cr:directory-of-datasets cr:dataset cr:directory-of-versions cr:conversion-cockpit
ACCEPTABLE_PWDs="cr:dataset cr:directory-of-versions cr:conversion-cockpit"
if [ `${CSV2RDF4LOD_HOME}/bin/util/is-pwd-a.sh $ACCEPTABLE_PWDs` != "yes" ]; then
   ${CSV2RDF4LOD_HOME}/bin/util/pwd-not-a.sh $ACCEPTABLE_PWDs
   exit 1
fi

TEMP="_"`basename $0``date +%s`_$$.tmp

For example, since $CSV2RDF4LOD_HOME/bin/cr-dataset-uri.sh only works in directories of type cr:dataset, cr:directory-of-versions, cr:conversion-cockpit, it will use $CSV2RDF4LOD_HOME/bin/pwd-not-a.sh to provide the following error output:

bash-3.2$ cr-dataset-uri.sh 

  Working directory does not appear to be a dataset. You can run this from a dataset.
  (e.g. $whatever/source/mySOURCE/myDATASET/).

  Working directory does not appear to be a directory of versions. You can run this from a directory of versions.
  (e.g. $whatever/source/mySOURCE/myDATASET/version/).

  Working directory does not appear to be a conversion cockpit.
  You can run this from a conversion cockpit.
  (e.g. $whatever/source/mySOURCE/myDATASET/version/VVV/).

$CSV2RDF4LOD_HOME/bin/util/cr-trim-logs.sh is an initial example for how to use the is-pwd-a.sh pattern to recursively process the entire data skeleton.

sourceID=`is-pwd-a.sh  cr:bone --id-of source`
datasetID=`is-pwd-a.sh cr:bone --id-of dataset`
versionID=`is-pwd-a.sh cr:bone --id-of version`

Note that the shell scripts provided in csv2rdf4lod-automation are not required to use csv2rdf4lod.jar. So if you don't like these directory conventions and want to start from scratch to name, retrieve, tweak, convert, and publish everything that revolves around invoking the jar, feel free!

Directory-type sensitive processing

(See this list, too.)

Parts of the pattern:

  • pwd-not-a.sh
  • is-pwd-a.sh
  • cr-pwd.sh

(Note: anything with ANCHOR in it needs to be updated to the new boilerplate)

Recursive processing

Recursive automation scripts are directory-type sensitive, but also call themselves when at a particular directory type. (starred * scripts are exemplars for the pattern; fewer (non-zero) pluses + means developed more recently).

(See this list, too.)

Any script with $0 $* is likely to be recursive.

Using dryrun.sh in a script that doesn't dry run by default:

dryRun="false"
if [ "$1" == "-n" ]; then
   dryRun="true"
   dryrun.sh $dryrun beginning
   shift
fi
...
dryrun.sh $dryrun ending

Using dryrun.sh in script that dry runs by default:

write="no"
if [[ "$1" == "-w" || "$1" == "--write" ]]; then
   write="yes"
   shift
else
   dryrun.sh 'yes beginning'
fi
...
if [ "$write" != "yes" ]; then
   dryrun.sh yes ending
fi

Scripts that work from where they are:

bin/dataset/pr-neighborlod.sh handles symlinks, builds in PATH and CLASSPATH:

[ -n "`readlink $0`" ] && this=`readlink $0` || this=$0
HOME=$(cd ${this%/*/*/*} && pwd)
export PATH=$PATH`$HOME/bin/install/paths.sh`
export CLASSPATH=$CLASSPATH`$HOME/bin/install/classpaths.sh`
HOME=$(cd ${0%/*} && echo ${PWD%/*})
me=$(cd ${0%/*} && echo ${PWD})/`basename $0`

spo-balance.sh:

VSR_HOME=$(cd ${0%/*} && echo ${PWD%/*})
me=$(cd ${0%/*} && echo ${PWD})/`basename $0`

bin/util/install-csv2rdf4lod-dependencies.sh:

this=$(cd ${0%/*} && echo $PWD/${0##*/})
base=${this%/bin/util/install-csv2rdf4lod-dependencies.sh}
base=${base%/*}

Setting classpath and path without depending on CSV2RDF4LOD_HOME being set.

# from e.g. bin/cr-retrieve.sh
HOME=$(cd ${0%/*} && echo ${PWD%/*})
export PATH=$PATH`$HOME/bin/util/cr-situate-paths.sh`
export CLASSPATH=$CLASSPATH`$HOME/bin/util/cr-situate-classpaths.sh`
CSV2RDF4LOD_HOME=${CSV2RDF4LOD_HOME:?$HOME}

# from e.g. bin/util/rdf2nt.sh (one extra /*)
HOME=$(cd ${0%/*/*} && echo ${PWD%/*})
export PATH=$PATH`$HOME/bin/util/cr-situate-paths.sh`
export CLASSPATH=$CLASSPATH`$HOME/bin/util/cr-situate-classpaths.sh`
CSV2RDF4LOD_HOME=${CSV2RDF4LOD_HOME:?$HOME}

What is next

Clone this wiki locally