Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input files for pgxmine #5

Open
rykovan opened this issue Oct 28, 2022 · 4 comments
Open

Input files for pgxmine #5

rykovan opened this issue Oct 28, 2022 · 4 comments

Comments

@rykovan
Copy link
Contributor

rykovan commented Oct 28, 2022

@jakelever could you please elaborate on how to prepare input files for pgxmine? There is pubmed_26736037.bioc.xml file in "example" folder but it is not clear how you obtained it. I tried to use BioText project but its output doesn't contain that file. What am I doing wrong?

$ snakemake --cores 1 downloaded.flag
$ snakemake --cores 1 converted.flag
$ snakemake --cores 1 pubtator_downloaded.flag
$ snakemake --cores 1 pubtator.flag
$ cd biocxml
$ grep '26736037' *.xml #nothing
$ cd ../pubtator
$ grep '26736037' *.xml #nothing

I'm trying to run pgxmine with some of the files outputed by BioTex but the result is empty:

$ ls -l example1
pubmed_test.bioc.xml -> ../../biotext/pubtator/pmc_baseline.oa_comm_xml.PMC008xxxxxx.baseline.2022-09-03_36.bioc.xml
$ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml \
    --filterTermsFile pgx_filter_terms.txt \
    --outBioc example1/pubmed_test.sentences.bioc.xml

$ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml \
    --outJSONGZ example1/pubmed_test.mesh.json.gz

$ python createKB.py \
    --trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml \
    --inBioC example1/pubmed_test.sentences.bioc.xml \
    --selectedChemicals data/selected_chemicals.json \
    --dbsnp data/dbsnp_selected.tsv \
    --variantStopwords stopword_variants.txt \
    --genes data/gene_names.tsv \
    --relevantMeSH example1/pubmed_test.mesh.json.gz  \
    --outKB example1/pubmed_test.kb.tsv

$ python filterAndCollate.py \
    --inData example1 \
    --outUnfiltered example1/mini_unfiltered.tsv \
    --outCollated example1/mini_collated.tsv \
    --outSentences example1/mini_sentences.tsv

Output:

+ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml --filterTermsFile pgx_filter_terms.txt --outBioc example1/pubmed_test.sentences.bioc.xml
Found 0 candidate sentences
+ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml --outJSONGZ example1/pubmed_test.mesh.json.gz
Loaded PMIDs from corpus file...
Searching for MeSH terms in:  ['Adolescent', 'Adult', 'Aged', 'Birth Cohort', 'Child', 'Child, Preschool', 'Infant', 'Infant, Newborn', 'Middle Aged', 'Pediatrics', 'Young Adult']

Found 0 PubMed ID(s) with relevant MeSH terms
+ python createKB.py --trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml --inBioC example1/pubmed_test.sentences.bioc.xml --selectedChemicals data/selected_chemicals.json --dbsnp data/dbsnp_selected.tsv --variantStopwords stopword_variants.txt --genes data/gene_names.tsv --relevantMeSH example1/pubmed_test.mesh.json.gz --outKB example1/pubmed_test.kb.tsv
Loaded chemical, gene and variant data
Loaded mesh PMIDs for pediatric/adult terms
Creating classifier for star_rs
Predicted 0 association(s) for star_rs variants
Creating classifier for other
Predicted 0 association(s) for other variants
+ python filterAndCollate.py --inData example1 --outUnfiltered example1/mini_unfiltered.tsv --outCollated example1/mini_collated.tsv --outSentences example1/mini_sentences.tsv
Found 1 PubMed files
Found 0 PMC files
0 records filtered to 0 sentences and collated to 0 chemical/variant associations
Written to example1/mini_sentences.tsv and example1/mini_collated.tsv
@rykovan
Copy link
Contributor Author

rykovan commented Oct 30, 2022

I tried to run it with snakemake as it is mentioned in README but it creates empty (with header only) pgxmine/test_working/pgxmine_* files:

$ MODE=test snakemake --cores 1

Running it in full mode also produces empty files:

$ MODE=full BIOTEXT=../biotext/biocxml snakemake --cores 10
$ cat working/pgxmine_sentences.tsv | wc -l
1

@jakelever
Copy link
Owner

Hey, I've added some documentation on how the single test file was created. I've also merged the example and test_data folders so there is a single test file used by run_example and the Snakemake script. Have another try of it.

Your full run shouldn't be giving no results (and should take a long long time to run). I put in an error check to make sure that there are the expected input files, so hopefully that will help to.

@rykovan
Copy link
Contributor Author

rykovan commented Nov 4, 2022

Thank you for your effort! Do you know what is the reason the CI/CD test is failing?

@rykovan
Copy link
Contributor Author

rykovan commented Nov 11, 2022

@jakelever any updates on this? Apparently something is not working properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants