Pubmed to JSON

Convert pubmed XML to json and store in ArangoDB

This processes the Pubmed XML files and stores the string version of each Pubmed Article record as XML and a converted JSON format in two separate ArangoDB collections (xml, json). This is to make re-processing specific Pubmed XML into JSON easier. Longer term, I'll probably drop the XML storage.

With 10 processes running on a 32core, 96Gb RAM XEON Ubuntu server: I get about 1800 docs per second loaded. That would probably go up to 3000+ if not storing the XML for each pubmed record.

Setup

Install poetry
Install Arangodb
Copy sample.env to .env and update the env vars
poetry install
Setup download of pubmed xml files to (I use lftp to mirror the files locally)
Run main.py to start processing baseline (using multi-processing) andd then updatefiles one at a time

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
archive		archive
pm		pm
tests		tests
.gitignore		.gitignore
NOTES.md		NOTES.md
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
profile.out		profile.out
program.prof		program.prof
pubmed.db		pubmed.db
pyproject.toml		pyproject.toml
sample.env		sample.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pubmed to JSON

Setup

About

Releases

Packages

Languages

belbio/pubmed2json

Folders and files

Latest commit

History

Repository files navigation

Pubmed to JSON

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages