-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize monomial names to spot more duplicates #892
Comments
Note that duplicates which are in very different classifications would still be considered different! |
Note2: quite a few names in the XR are linked to the incertae sedis taxon a7c5566c-247f-4e91-908d-0d636d158772 which is a good thing to get rid of! |
just updated the files and added a new column |
Notes for me to remember how the files were generated: create table name (
ID text,
parentID text,
status text,
rank text,
scientific_name text,
authorship text,
normalised_name text,
type text,
nidx_id int,
nidx_canon_id int,
taxgroup text,
classification text,
primary key (ID)
);
create index on name(status) where status in ('accepted', 'provisionally accepted');
create index on name(parentID);
create index on name(nidx_id);
create index on name(scientific_name);
create table name_xr (like name including all);
\copy name from ../col12/NameUsageOut.tsv with null as 'null'
\copy name_xr from ../colXR/NameUsageOut.tsv with null as 'null'
\copy ( select * from ( select id, num, normalised_name, type, taxgroup, rank, scientific_name, authorship, status, cnt, array_length(ARRAY( SELECT DISTINCT unagg FROM unnest(agg) as unagg), 1) as distinct_names, array_length(ARRAY( SELECT DISTINCT untg FROM unnest(tgagg) as untg), 1) as distinct_groups, nidx_id, nidx_canon_id, classification from ( select name.*, row_number() over w as num, array_agg(scientific_name) over w as agg, array_agg(taxgroup) over w as tgagg, count(*) over w as cnt from name WHERE status in ('accepted', 'provisionally accepted') WINDOW w AS (partition by nidx_id) ) n where cnt > 1) as dupe order by distinct_names desc, cnt, nidx_id, num ) to 'col12-dupes.tsv' WITH HEADER
\copy ( select * from ( select id, num, normalised_name, type, taxgroup, rank, scientific_name, authorship, status, cnt, array_length(ARRAY( SELECT DISTINCT unagg FROM unnest(agg) as unagg), 1) as distinct_names, array_length(ARRAY( SELECT DISTINCT untg FROM unnest(tgagg) as untg), 1) as distinct_groups, nidx_id, nidx_canon_id, classification from ( select name_xr.*, row_number() over w as num, array_agg(scientific_name) over w as agg, array_agg(taxgroup) over w as tgagg, count(*) over w as cnt from name_xr WHERE status in ('accepted', 'provisionally accepted') WINDOW w AS (partition by nidx_id) ) n where cnt > 1) as dupe order by distinct_names desc, cnt, nidx_id, num ) to 'colXR-dupes.tsv' WITH HEADER
|
and here is colXR-dupes.tsv.zip that groups all names by their canonical names index id. This catches a lot more and is being used by the matching to select candidates. |
The botanical code defines parahomonyms in §53.2 which are to be treated as homonyms:
The zoological code unfortunately does allow a single character difference to be accepted as a different (genus) name. I can see Brommella and Bromella accepted in COL and other names which are very alike. Often these are in clearly disparate groups and we can keep them apart during merges. But some might be very close and we would wrongly merge information. This is very rare and can be fixed in x-configs, but it might be too invasive? On the other hand we have many, many misspelled duplicates. But we might want to require an authorship for those to be considered the same. Sth we don't often have for genus or higher names... |
new results uploaded here and linked at the top: |
Only bi/trinomial names are currently normalised during matching to avoid wrongly merging very close genus names.
This was also done to avoid the stemming to break suffices of higher ranks.
Not doing any of the misspelling normalisations (double letters, iy, silent h) leads to many misspelled duplicates in the XR. Some are also present in the base release.
As changing the normalisation of names in the names index will have serious impact, I did a preview of all the uninomial names in the latest COL24.12 release and the January XRelease. Both results only contain accepted names which are considered to have duplicates according to the new rules.
From what I can see this looks very good and there appears to be no wrongly merged name in the lists.
If @camiplata, @DianRHR and @olafbanki could confirm I would change it for real in production.
The text was updated successfully, but these errors were encountered: