Using refextract for unstructured references #156

fschwenn · 2017-07-07T07:13:33Z

When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).

At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.

michamos · 2017-07-10T07:26:05Z

As already said though email, I don't think this should be done in hepcrawl (contents of the email follow).

There are several cases that can arise:

The publisher makes available a full structured reference
The publisher makes available a list of unstructured references
The publisher does not make any reference available in the metadata

In case 1., we don't need need refextract as Hepcrawl can do the
conversion from the publisher's reference format to ours, whereas in
case 3. there is nothing Hepcrawl can do besides providing the PDF.

So case 2. remains, but I think it would be better to have Hepcrawl
populate the raw references in the record, and run refextract (or in
the future maybe Grobid) in the workflow as is done curently to extract
references from PDF. We should have a task there that does reference
extraction from raw references in case they have been provided but
there are no parsed references. In this way, we cleanly separate the
task of translating between metadata formats (Hepcrawl) and parsing
references (refextract) and in the future we can easily swap refextract
for Grobid when it is mature enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using refextract for unstructured references #156

Using refextract for unstructured references #156

fschwenn commented Jul 7, 2017

michamos commented Jul 10, 2017

Using refextract for unstructured references #156

Using refextract for unstructured references #156

Comments

fschwenn commented Jul 7, 2017

michamos commented Jul 10, 2017