-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does client send .tei.xml files preferentially? #4
Comments
Hi @jameshowison ! I didn't updated the comments in the code sorry. The client supports PDF and XML as input. The client has no preference... if it finds a PDF in the input repository it calls the PDF processing endpoint, if it finds a TEI (just checking the extension It's possible to pass a repository with a mixture of PDF, TEI XML and other XML formats, the client will make the right calls to the server accordingly. Fastest will be TEI, then PDF and other XML will be slower because they require a one shot application of the XSLT from Pub2TEI - which take 2-3 seconds just to load. This is why it's largely preferable to transform all the XML in batch with Pub2TEI before software mention extraction, because the XSLT will be loaded just one time for all. |
I updated the comments/readme with 7bec989 I forget to add the following to answer your question: the client processes all the PDF and an XML it finds, which means that for the same document, at the same place in the input repository, if there are at the same time a PDF and an XML fulltext, it will process both in parallel, and the last one to update the json software annotation file wins. I have only really considered so far the scenario where we only have one full text file for a document. So probably the best would be to process only one version of the full text, preferably in this order for quality reasons: tei, xml, pdf. |
Ah, got it. That was my read of the code as well. Yeah, by default
preferentially doing tei.xml makes sense :) flag to manually choose between
Btw I got things working on my laptop (without GPU, M1 Mac troubles) and it
was super fast.
…On Sat, Jul 22, 2023 at 05:14 Patrice Lopez ***@***.***> wrote:
I updated the comments/readme with 7bec989
<7bec989>
I forget to add the following to answer your question: the client
processes all the PDF and an XML it finds, which means that for the same
document, at the same place in the input repository, if there are at the
same time a PDF and an XML fulltext, it will process both in parallel, and
the last one to update the json software annotation file wins. I have only
really considered so far the scenario where we only have one full text file
for a document.
So probably the best would be to process only one version of the full
text, preferably in this order for quality reasons: tei, xml, pdf.
We could also add a parameter to manage the preferences.
—
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWOUXJ76J5C3IWFNHK32DXRORWTANCNFSM6AAAAAA2TJOANQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Imagine what it would be with a GPU :D |
I have updated software_mentions_client to version |
Looking at https://github.com/softcite/tutorials/blob/master/process_all_of_plos.md seems that one wants to convert to tei.xml files first, before sending the file to the server.
Looking at the code, though, client.py:197 says "only PDF for the moment" ... is that still true? Reading further it seems that the client prefers to send pdf over the tei.xml? But the
annotate
method prefers the tei.xml? In the tutorial linked above you use repo-in and I think pass a folder that only has tei files in it?Context of the question is where I'm up to in the https://github.com/howisonlab/screenit-softcite repo :)
The text was updated successfully, but these errors were encountered: