Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does client send .tei.xml files preferentially? #4

Open
jameshowison opened this issue Jul 21, 2023 · 5 comments
Open

Does client send .tei.xml files preferentially? #4

jameshowison opened this issue Jul 21, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request implemented The issue/feature has been implemented

Comments

@jameshowison
Copy link

Looking at https://github.com/softcite/tutorials/blob/master/process_all_of_plos.md seems that one wants to convert to tei.xml files first, before sending the file to the server.

Looking at the code, though, client.py:197 says "only PDF for the moment" ... is that still true? Reading further it seems that the client prefers to send pdf over the tei.xml? But the annotate method prefers the tei.xml? In the tutorial linked above you use repo-in and I think pass a folder that only has tei files in it?

Context of the question is where I'm up to in the https://github.com/howisonlab/screenit-softcite repo :)

@kermitt2
Copy link
Collaborator

Hi @jameshowison !

I didn't updated the comments in the code sorry. The client supports PDF and XML as input. The client has no preference... if it finds a PDF in the input repository it calls the PDF processing endpoint, if it finds a TEI (just checking the extension .tei.xml, nothing more sophisticated), it calls the TEI endpoint, and for other XML it calls the XML endpoint that will use Pub2TEI to identify and transform the right XML formats (among all the supported one).

It's possible to pass a repository with a mixture of PDF, TEI XML and other XML formats, the client will make the right calls to the server accordingly. Fastest will be TEI, then PDF and other XML will be slower because they require a one shot application of the XSLT from Pub2TEI - which take 2-3 seconds just to load. This is why it's largely preferable to transform all the XML in batch with Pub2TEI before software mention extraction, because the XSLT will be loaded just one time for all.

@kermitt2
Copy link
Collaborator

I updated the comments/readme with 7bec989

I forget to add the following to answer your question: the client processes all the PDF and an XML it finds, which means that for the same document, at the same place in the input repository, if there are at the same time a PDF and an XML fulltext, it will process both in parallel, and the last one to update the json software annotation file wins. I have only really considered so far the scenario where we only have one full text file for a document.

So probably the best would be to process only one version of the full text, preferably in this order for quality reasons: tei, xml, pdf.
We could also add a parameter to manage the preferences.

@jameshowison
Copy link
Author

jameshowison commented Jul 22, 2023 via email

@kermitt2
Copy link
Collaborator

Btw I got things working on my laptop (without GPU, M1 Mac troubles) and it
was super fast.

Imagine what it would be with a GPU :D

@kermitt2
Copy link
Collaborator

I have updated software_mentions_client to version 0.1.9 on PyPi to prefer TEI XML files over PDF when both are available.

@kermitt2 kermitt2 added the implemented The issue/feature has been implemented label Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request implemented The issue/feature has been implemented
Projects
None yet
Development

No branches or pull requests

2 participants