Authors' data improvement #18

gtarasconi · 2019-12-05T20:04:16Z

gtarasconi
Dec 5, 2019

Feature description

First of all congratulations for the wonderful work!
I'd just highlight some possible 'easy' improvements in data quality:

Et Al in authors names (see fi npl_publn_id = 24384)
some Xp999999 in names (see 114551)
newline/non ascii chars in text (see 227120)

I could be preparing some code in coming weeks (months?) and in case I can share it if you think it could be useful

--

cverluise · 2019-12-05T20:32:06Z

cverluise
Dec 5, 2019
Maintainer

Hello @gtarasconi,

thanks for your remarks.

That's nice to have a precise reporting of these errors.

Issue tracker 🖲

It can be partly solved on a discretionary basis. In this case, our approach is the following:

the scicit.issues module is used to flag issues at the row level
then, scicit.npl_citation.solve_issues() implements fix, if any

Feel free to have a look at it and build on it.

We will be glad to receive your push request and collaborate with you!

Improve Grobid model 🎯

Another approach is to train Grobid model (the parsing library we rely on) on systematic errors (see issue #14 for example). Then, the idea is to parse and consolidate flagged citations once again. That's certainly something we will do in 2020.

We might well create a labelling app to crowd-source this important task. Any suggestion welcome.

Other ideas

Also, as it seems that you have a particular interest in authors, you might be interested in getting ORCID identifiers. We did not add them yet. They are available in Crossref though. Note that there are Crossref bulks available online (see https://github.com/greenelab/crossref) and the baseline schema to ingest the database on BigQuery is available in schema/.

Hope it helps,

Cheers

0 replies

cverluise · 2020-10-22T09:39:32Z

cverluise
Oct 22, 2020
Maintainer

Hello,
authors' data are largely improved in v03 (dev). When there is a doi match, the authors' data (as any data reported) is now from crossref database (editor quality).
Not closing because the still relevant for non matched bibliographical reference.
Cheers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Authors' data improvement #18

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Authors' data improvement #18

gtarasconi Dec 5, 2019

Feature description

Replies: 2 comments

cverluise Dec 5, 2019 Maintainer

Issue tracker 🖲

Improve Grobid model 🎯

Other ideas

cverluise Oct 22, 2020 Maintainer

gtarasconi
Dec 5, 2019

cverluise
Dec 5, 2019
Maintainer

cverluise
Oct 22, 2020
Maintainer