Project state for the TREC-PM 2019 Submission
This is the exact state in which the submissions for the TREC-PM2019 submission were created.
Note that this state can only include code and the LtR models. The ElasticSearch indices and the document Postgres database are missing, of course. Another missing resource are the FastText
embeddings used to create document embeddings for LtR features. Those can be recreated by:
- Run the BANNER gene tagger from jcore-projects, version>=2.4 on the Medline/PubMed 2019 baseline.
- Extract the document text from those document with at least one tagged gene in them. This should be around 8 million documents. The text is the title plus abstract text (e.g. by using the JCoRe PubMed reader and the JCoRe To TXT consumer in the
DOCUMENT
mode). No postprocessing (which should be done for better models but hasn't been done on the used embeddings). - Create
FastText
word embeddings with a dimension of 300. We used the.bin
output for LtR features.
The databases can be re-created using the the components in the uima
subdirectory.
All UIMA pipelines have been created and run by the JCoRe Pipeline Components in version 0.4.0
.
- Install
ElasticSearch 5.4
andPostgres >= 9.6
. Used for the experiments wasPostgres 9.6.13
. - Change into the
uima
directory on the command line and execute./gradlew install-uima-components
. this must successfully run through in order to complete the following steps. Note that Gradle is only used for scripting, the projects are all build with Maven. Thus, check the Maven output for success or failure messages. Gradle may report success despite Maven failing. - Run the
pm-to-xmi-db-pipeline
and thect-to-xmi-db-pipeline
with theJCoRE Pipeline Runner
. Before you actually run those, check thepipelinerunner.xml
configuration files in both projects for the number threads being used. Adapt them to the capabilities of your system, if necessary. - Configure the
preprocessing
andpreprocessing_ct
with theJCoRe Pipeline Builder
to active nearly all (explained in a second) components. Some are deactivated in this release. Note that there are some components specific toBANNER
gene tagging andFLAIR
gene tagging. Use theBANNER
components, Flair hasn't been used in our submitted runs. You might also leave theLingScope
andMutationFinder
components off because those haven't been used either. Configure theuima/costosys.xml
file in all pipelines to point to your Postgres database. Run the components. They will write the annotation data into the Postgres database. We used multiple machines for this, employing the SLURM scheduler (not required). All in all we had 96 CPU cores available. Processing time was in the hours, much less than a day for PubMed. The processing will accordingly take longer or shorter depending on the resources at your disposal. - Configure the
pubmed-indexer
andct-indexer
projects to work with your ElasticSearch index using theJCoRe Pipeline Builder
. Executemvn package
in both pipeline directories to build the indexing code, which is packaged as ajar
and automatically put into thelib
directory of the pipelines. Run the components.
If all steps have been performed successfully, the indices should now be present in your ElasticSearch instance. To run the experiments, also configure the <repository root>/config/costosys.xml
file to point to your database. Then run the at.medunigraz.imi.bst.trec.LiteratureArticlesExperimenter´ and
at.medunigraz.imi.bst.trec.ClinicalTrialsExperimenter` classes.