Skip to content

Commit

Permalink
Merge pull request #4 from elixir-europe/simplify-data-discovery-schema
Browse files Browse the repository at this point in the history
Simplify data discovery schema & Better distribution package
  • Loading branch information
gcornut committed Aug 28, 2018
2 parents 022465e + 8de8cf7 commit a4b7832
Show file tree
Hide file tree
Showing 10 changed files with 152 additions and 226 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ __pycache__/
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
Expand Down
9 changes: 4 additions & 5 deletions HOW_TO_RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
1. Publish on master
2. (Re-)build binary package (```./build-centos6/build.sh```)
3. Create tag & release at https://github.com/elixir-europe/plant-brapi-etl-data-lookup-gnpis/releases/new
1. (Re-)build binary package (```./build-centos6/build.sh```)
2. Commit the binary package in `dist`
3. Release on [GitHub](https://github.com/elixir-europe/plant-brapi-etl-data-lookup-gnpis/releases/new)
* Use semantic versioning to name the tag and the release (X.Y.Z)
* Add changelogs in release description
* Upload the binary package file (```./build-centos6/dist-etl.tar.gz```) to the release
* Add changelog in release description
34 changes: 25 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,35 @@ Elixir plant Breeding API JSON ETL
- **T**ransform extracted data (into Elasticsearch bulk json, into JSON-LD, into RDF)
- **L**oad JSON into Elasticsearch or RDF into a virtuoso

## I. Script requirements
## I. Execution

### From linux binary distribution

You can find a binary distribution of the ETL package in [dist/plant-brapi-etl-data-lookup-gnpis.tar.gz](dist/plant-brapi-etl-data-lookup-gnpis.tar.gz).

First you will need to extract the archive:
```sh
$ tar xzf plant-brapi-etl-data-lookup-gnpis.tar.gz
```

And then you can simply call the main program:
```sh
$ ./etl/main
```

### From source code

Requirements:
- Python version 3.6.x
- Python dependencies (pip install -r requirements.txt)


The `main.py` script can be used to launch the full BrAPI to elasticsearch or BrAPI to virtuoso ETL. To get the usage help run the following command:

```sh
$ python3 main.py
```

## II. Configuration

### ETL process configurations
Expand Down Expand Up @@ -43,14 +67,6 @@ The BrAPI endpoint must implement the required calls (also listed in `./config/e
- /brapi/v1/studies/{id}
- /brapi/v1/studies/{id}/germplasm

## III. Execution

The `main.py` script can be used to launch the full BrAPI to elasticsearch or BrAPI to virtuoso ETL. To get the usage help run the following command:

```sh
python3 main.py
```

See [`README-elasticsearch.md`](README-elasticsearch.md) for specific details on BrAPI to elasticsearch ETL.

See [`README-virtuoso.md`](README-virtuoso.md) for specific details on BrAPI to virtuoso ETL.
16 changes: 5 additions & 11 deletions build-centos6/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Build self contained package

The folder contains scripts and configuration to build the ETL as a
This folder contains scripts and configuration to build the ETL as a
self contained binary package (with embedded dependencies and python VM)
for CentOS 6.

Expand All @@ -16,20 +16,14 @@ installed before running the build script**).
```

This script will:
- (if first time) Build a CentOS 6 Vagrant VM
- (On the first time) Build a CentOS 6 Vagrant VM
- Download & Install CentOS 6
- Install python
- Install dependencies and `pyinstaller`
- Run `pyinstaller` to build a binary distribution on `../etl/main.py`
- Build the binary distribution with `pyinstaller` on `../etl/main.py`
- Package the binaries and configuration files into an archive in [../dist](../dist)

The binary distribution will be available in the `dist-etl.tar.gz`
archive containing the configuration folders (`config`, `sources`).

Once un-packaged, you can run the ETL with:

```sh
./etl/main
```
See [../README.md](../README.md) to check how to use the binary distribution.

You can use `vagrant destroy` to delete the Vagrant VM if you want to
re-generate it from scratch or if you need to free disk space.
30 changes: 18 additions & 12 deletions build-centos6/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,25 @@

cd $(dirname $0)

PYTHON_VERSION=$(cat ../.python-version)
DIST_NAME=dist-etl
DIST_PKG="$DIST_NAME".tar
ls "$DIST_PKG"* | xargs -n1 rm -rf
# Should match the version in the Vagrantfile provisioning
PYTHON_VERSION=3.6.4

DIST_DIR=../dist
DIST_NAME=plant-brapi-etl-data-lookup-gnpis

TAR_FILE=${DIST_NAME}.tar
TGZ_FILE=${DIST_NAME}.tar.gz

find ${DIST_DIR} -name "${DIST_NAME}*" | xargs -n1 rm -rf

# Start VM
vagrant up

# Run in VM
COMMANDS=$(cat <<EOF
echo "Preparing environment to python $PYTHON_VERSION...";
echo "Preparing environment to python ${PYTHON_VERSION}...";
cd /vagrant/build-centos6;
pyenv shell $PYTHON_VERSION;
pyenv shell ${PYTHON_VERSION};
echo "Updating Python project dependencies...";
pip install -r ../requirements.txt;
Expand All @@ -25,21 +31,21 @@ COMMANDS=$(cat <<EOF
echo "Building a tar.gz ...";
mv dist/main etl;
tar cf "$DIST_PKG" etl;
tar cf ${TAR_FILE} etl;
EOF
)
eval "vagrant ssh -c '$COMMANDS'"
eval "vagrant ssh -c '${COMMANDS}'"

# Copy the distribution package from the VM to local FS
vagrant ssh-config > /tmp/vagrant-ssh-config.txt
scp -F /tmp/vagrant-ssh-config.txt -r default:/vagrant/build-centos6/"$DIST_PKG" ./
scp -F /tmp/vagrant-ssh-config.txt -r default:/vagrant/build-centos6/${TAR_FILE} ${DIST_DIR}/

# Stop VM
vagrant halt &

# Add config files
tar rf "$DIST_PKG" ../config
tar rf "$DIST_PKG" ../sources
gzip "$DIST_PKG"
tar rf ${DIST_DIR}/${TAR_FILE} ../config
tar rf ${DIST_DIR}/${TAR_FILE} ../sources
gzip ${DIST_DIR}/${TAR_FILE}

wait
110 changes: 35 additions & 75 deletions config/elasticsearch/datadiscovery_mapping.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@
"datadiscovery": {
"dynamic": false,
"_source": {
"includes": ["@id", "@type", "schema:*"]
"includes": [
"@id",
"@type",
"schema:*"
]
},
"properties": {
"@type": {
Expand Down Expand Up @@ -33,100 +37,56 @@
"type": "string",
"index": "not_analyzed"
},

"germplasm": {
"type": "object",
"properties": {
"facet": {
"type": "object",
"properties": {
"commonCropName": {
"cropName": {
"type": "string",
"index": "not_analyzed",
"fields": {
"suggest": {
"type": "string",
"index": "not_analyzed"
},
"species": {
"type": "string",
"index": "not_analyzed"
"index": "analyzed",
"search_analyzer": "search_suggester",
"analyzer": "index_suggester"
}
}
},
"search": {
"type": "object",
"properties": {
"cropName": {
"type": "string",
"index": "not_analyzed",
"fields": {
"suggest": {
"type": "string",
"index": "analyzed",
"search_analyzer": "search_suggester",
"analyzer": "index_suggester"
}
}
},
"germplasmList": {
"germplasmList": {
"type": "string",
"index": "not_analyzed",
"fields": {
"suggest": {
"type": "string",
"index": "not_analyzed",
"fields": {
"suggest": {
"type": "string",
"index": "analyzed",
"search_analyzer": "search_suggester",
"analyzer": "index_suggester"
}
}
},
"accession": {
"type": "string",
"index": "not_analyzed",
"fields": {
"suggest": {
"type": "string",
"index": "analyzed",
"search_analyzer": "search_suggester",
"analyzer": "index_suggester"
}
}
"index": "analyzed",
"search_analyzer": "search_suggester",
"analyzer": "index_suggester"
}
}
}
}
},

"trait": {
"type": "object",
"properties": {
"facet": {
"type": "object",
"properties": {
"observationVariableIds": {
},
"accession": {
"type": "string",
"index": "not_analyzed",
"fields": {
"suggest": {
"type": "string",
"index": "not_analyzed"
"index": "analyzed",
"search_analyzer": "search_suggester",
"analyzer": "index_suggester"
}
}
}
}
},

"study": {
"trait": {
"type": "object",
"properties": {
"facet": {
"type": "object",
"properties": {
"dataSet": {
"type": "string",
"index": "not_analyzed"
},
"location": {
"type": "string",
"index": "not_analyzed"
}
}
"observationVariableIds": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Loading

0 comments on commit a4b7832

Please sign in to comment.