You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).
At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.
The text was updated successfully, but these errors were encountered:
As already said though email, I don't think this should be done in hepcrawl (contents of the email follow).
There are several cases that can arise:
The publisher makes available a full structured reference
The publisher makes available a list of unstructured references
The publisher does not make any reference available in the metadata
In case 1., we don't need need refextract as Hepcrawl can do the
conversion from the publisher's reference format to ours, whereas in
case 3. there is nothing Hepcrawl can do besides providing the PDF.
So case 2. remains, but I think it would be better to have Hepcrawl
populate the raw references in the record, and run refextract (or in
the future maybe Grobid) in the workflow as is done curently to extract
references from PDF. We should have a task there that does reference
extraction from raw references in case they have been provided but
there are no parsed references. In this way, we cleanly separate the
task of translating between metadata formats (Hepcrawl) and parsing
references (refextract) and in the future we can easily swap refextract
for Grobid when it is mature enough.
When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).
At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.
The text was updated successfully, but these errors were encountered: