UMLS thesaurus #1206
Replies: 9 comments
-
In general I'd like to do more to support medical text processing -- so, maybe! However at first glance it seems that the UMLS data isn't easy to license? This makes me much less inclined to include native functionality for it in the main library. What do you need, mostly? If you want to annotate texts with concepts from UMLS, I think the |
Beta Was this translation helpful? Give feedback.
-
Update: Now that spaCy v2.0 is out, this might be a good use case for a custom pipeline extension! https://spacy.io/usage/processing-pipelines#custom-components |
Beta Was this translation helpful? Give feedback.
-
Currently I am thinking about writing a component for spaCy which could write annotations to an UIMA xml format, would this be feasible/makes sense? Would like to combine spaCy with Apache cTakes components (e.g. UMLS dictionary lookup). Not sure how compatible annotations are ... An alternative approach would be to use cTakes to create databases from UMLS and re-implement the cTakes database lookup module as a spaCy module. |
Beta Was this translation helpful? Give feedback.
-
We're actually about to undertake extending our tool, BioMedICUS, to integrate its pipeline components with spaCy. We are definitely open to collaboration in the endeavor, especially with respect to UMLS integration and decoupling from UIMA XMI CAS uglitude. |
Beta Was this translation helpful? Give feedback.
-
Currently I am using QuickUMLS for concept extraction, which uses spaCy. QuickUMLS has a tight integration with dependencies leveldb and simstring, which are not very windows compatible. To create a QuickUMLS extension component for spaCy would require rewriting of QuickUMLS. For now I created a dockerized version of QuickUMLS with json TCP in/out, will send QuickUMLS a PR soon. Doing the same atm for pyContextNLP. Next to using QuickUMLS I started mapping spaCys output to a xml format, to couple spaCy to other NLP libraries and programming languages. First I had a look at UIMA XMI CAS, but picked the NAF document format (xml). Here my WIP of a dockerized version of spaCy with TCP output in NAF. The NAF xml document can be mapped to the CAS namespace within Java (cTakes namespace example). I stopped my attempt to couple spaCy to cTakes dictionary lookup when I found out that (cTakes dictionary lookup requires a chunker), which I would need for the Dutch language. QuickUMLS was my quick alternative ;). @GregSilverman Writing the modules as custom pipeline extension for spaCy would make the usage more accessible but less reusable. |
Beta Was this translation helpful? Give feedback.
-
@putssander, at this point we are evaluating options. I also thought about use of Docker. I'm currently deep in a similar project using a Kubernetes cluster and just wrapping up on the local testing of BioMedICUS -> ElasticSearch piece (we plan on running BioMedICUS, Clamp, cTakes and MetaMap all in parallel, once we deploy this remotely). See NLP-ADAPT... Please excuse the lack of proper documentation (it's all in our private Slack channel and will eventually be a full Wiki). We should definitely stay in touch regarding this project. |
Beta Was this translation helpful? Give feedback.
-
@putssander, I discussed this my colleagues today and will give it a whirl. I am going to make a microservice in docker that does specific tagged pattern matching in spaCy (based on a gold standard manual annotated set of documents). I'll then take the output of that and convert it to proper annotated CAS xmi format (how? at this point, no idea!) and then combine these annotated artifacts with the other artifacts selected from the other NLP engines running in NLP-ADAPT using a tool one of out developers created, called AMICUS. We're planning on evaluating the spaCy results against the TagEx pipeline component in BioMedICUS. |
Beta Was this translation helpful? Give feedback.
-
@GregSilverman Let me know if you make progress with your spaCy integration. Noticed from the AMICUS repository that one of your developers already has extended knowledge of the CAS xmi approach. Not sure if CAS xmi is required as an intermediate format. From the official UIMA website I understand it should be possible to combine the languages within the UIMA C++ framework (not sure how much CAS xmi serialization is done there). I don’t want to dive into too much into UIMA, I prefer to use SCDF kubernetes + NAF as I described in an earlier comment. Here is another approach of integrating spaCy with UIMA by exposing spaCy over REST. If you want to try out my approach, let me know, I can help you to help to get it running. |
Beta Was this translation helpful? Give feedback.
-
Also see scispaCy's UMLS linker: https://github.com/allenai/scispacy#umlsentitylinker-alpha-feature |
Beta Was this translation helpful? Give feedback.
-
Any thoughts on the possibility of integrating spacy with the UMLS thesaurus https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/ for classification of medical texts?
Beta Was this translation helpful? Give feedback.
All reactions