Does client send .tei.xml files preferentially? #4

jameshowison · 2023-07-21T20:39:58Z

Looking at https://github.com/softcite/tutorials/blob/master/process_all_of_plos.md seems that one wants to convert to tei.xml files first, before sending the file to the server.

Looking at the code, though, client.py:197 says "only PDF for the moment" ... is that still true? Reading further it seems that the client prefers to send pdf over the tei.xml? But the annotate method prefers the tei.xml? In the tutorial linked above you use repo-in and I think pass a folder that only has tei files in it?

Context of the question is where I'm up to in the https://github.com/howisonlab/screenit-softcite repo :)

The text was updated successfully, but these errors were encountered:

kermitt2 · 2023-07-22T09:56:47Z

Hi @jameshowison !

I didn't updated the comments in the code sorry. The client supports PDF and XML as input. The client has no preference... if it finds a PDF in the input repository it calls the PDF processing endpoint, if it finds a TEI (just checking the extension .tei.xml, nothing more sophisticated), it calls the TEI endpoint, and for other XML it calls the XML endpoint that will use Pub2TEI to identify and transform the right XML formats (among all the supported one).

It's possible to pass a repository with a mixture of PDF, TEI XML and other XML formats, the client will make the right calls to the server accordingly. Fastest will be TEI, then PDF and other XML will be slower because they require a one shot application of the XSLT from Pub2TEI - which take 2-3 seconds just to load. This is why it's largely preferable to transform all the XML in batch with Pub2TEI before software mention extraction, because the XSLT will be loaded just one time for all.

kermitt2 · 2023-07-22T10:13:51Z

I updated the comments/readme with 7bec989

I forget to add the following to answer your question: the client processes all the PDF and an XML it finds, which means that for the same document, at the same place in the input repository, if there are at the same time a PDF and an XML fulltext, it will process both in parallel, and the last one to update the json software annotation file wins. I have only really considered so far the scenario where we only have one full text file for a document.

So probably the best would be to process only one version of the full text, preferably in this order for quality reasons: tei, xml, pdf.
We could also add a parameter to manage the preferences.

jameshowison · 2023-07-22T14:11:43Z

Ah, got it. That was my read of the code as well. Yeah, by default preferentially doing tei.xml makes sense :) flag to manually choose between Btw I got things working on my laptop (without GPU, M1 Mac troubles) and it was super fast.

…

On Sat, Jul 22, 2023 at 05:14 Patrice Lopez ***@***.***> wrote: I updated the comments/readme with 7bec989 <7bec989> I forget to add the following to answer your question: the client processes all the PDF and an XML it finds, which means that for the same document, at the same place in the input repository, if there are at the same time a PDF and an XML fulltext, it will process both in parallel, and the last one to update the json software annotation file wins. I have only really considered so far the scenario where we only have one full text file for a document. So probably the best would be to process only one version of the full text, preferably in this order for quality reasons: tei, xml, pdf. We could also add a parameter to manage the preferences. — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAWOUXJ76J5C3IWFNHK32DXRORWTANCNFSM6AAAAAA2TJOANQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kermitt2 · 2023-07-22T14:38:20Z

Btw I got things working on my laptop (without GPU, M1 Mac troubles) and it
was super fast.

Imagine what it would be with a GPU :D

kermitt2 · 2023-07-27T22:55:58Z

I have updated software_mentions_client to version 0.1.9 on PyPi to prefer TEI XML files over PDF when both are available.

kermitt2 self-assigned this Jul 22, 2023

kermitt2 added the enhancement New feature or request label Jul 22, 2023

kermitt2 mentioned this issue Jul 27, 2023

software.json files created but no mentions (and no metadata)? howisonlab/screenit-softcite#6

Open

kermitt2 added the implemented The issue/feature has been implemented label Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does client send .tei.xml files preferentially? #4

Does client send .tei.xml files preferentially? #4

jameshowison commented Jul 21, 2023

kermitt2 commented Jul 22, 2023

kermitt2 commented Jul 22, 2023

jameshowison commented Jul 22, 2023 via email

kermitt2 commented Jul 22, 2023

kermitt2 commented Jul 27, 2023

Does client send .tei.xml files preferentially? #4

Does client send .tei.xml files preferentially? #4

Comments

jameshowison commented Jul 21, 2023

kermitt2 commented Jul 22, 2023

kermitt2 commented Jul 22, 2023

jameshowison commented Jul 22, 2023 via email

kermitt2 commented Jul 22, 2023

kermitt2 commented Jul 27, 2023