Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize monomial names to spot more duplicates #892

Open
mdoering opened this issue Jan 6, 2025 · 7 comments
Open

Normalize monomial names to spot more duplicates #892

mdoering opened this issue Jan 6, 2025 · 7 comments
Assignees
Labels
feedback User feedback xrelease extended release

Comments

@mdoering
Copy link
Member

mdoering commented Jan 6, 2025

Only bi/trinomial names are currently normalised during matching to avoid wrongly merging very close genus names.
This was also done to avoid the stemming to break suffices of higher ranks.

Not doing any of the misspelling normalisations (double letters, iy, silent h) leads to many misspelled duplicates in the XR. Some are also present in the base release.

As changing the normalisation of names in the names index will have serious impact, I did a preview of all the uninomial names in the latest COL24.12 release and the January XRelease. Both results only contain accepted names which are considered to have duplicates according to the new rules.

From what I can see this looks very good and there appears to be no wrongly merged name in the lists.
If @camiplata, @DianRHR and @olafbanki could confirm I would change it for real in production.

@mdoering mdoering added feedback User feedback xrelease extended release labels Jan 6, 2025
@mdoering
Copy link
Member Author

mdoering commented Jan 6, 2025

Note that duplicates which are in very different classifications would still be considered different!

@mdoering
Copy link
Member Author

mdoering commented Jan 6, 2025

Note2: quite a few names in the XR are linked to the incertae sedis taxon a7c5566c-247f-4e91-908d-0d636d158772 which is a good thing to get rid of!

@mdoering
Copy link
Member Author

mdoering commented Jan 6, 2025

just updated the files and added a new column distinct_names which counts the number of unique scientific_names per index group. Thus the ones that are higher than one do have some variation due to the new normalisation.

@mdoering
Copy link
Member Author

mdoering commented Jan 6, 2025

Notes for me to remember how the files were generated:

create table name (
	ID text,
	parentID text,
	status text,
	rank text,
	scientific_name text,
	authorship text,
	normalised_name text,
	type text,
	nidx_id int,
	nidx_canon_id int,
	taxgroup text,
	classification text,
	primary key (ID)
);
create index on name(status) where status in ('accepted', 'provisionally accepted');
create index on name(parentID);
create index on name(nidx_id);
create index on name(scientific_name);

create table name_xr (like name including all);

 \copy name from ../col12/NameUsageOut.tsv with null as 'null'
 \copy name_xr from ../colXR/NameUsageOut.tsv with null as 'null'

\copy ( select * from (	select id, num, normalised_name, type, taxgroup, rank, scientific_name, authorship, status, cnt, array_length(ARRAY( SELECT DISTINCT unagg FROM unnest(agg) as unagg), 1) as distinct_names, array_length(ARRAY( SELECT DISTINCT untg FROM unnest(tgagg) as untg), 1) as distinct_groups, nidx_id, nidx_canon_id, classification from ( select name.*, row_number() over w as num, array_agg(scientific_name) over w as agg, array_agg(taxgroup) over w as tgagg, count(*) over w as cnt from name WHERE status in ('accepted', 'provisionally accepted') WINDOW w AS (partition by nidx_id) ) n where cnt > 1) as dupe order by distinct_names desc, cnt, nidx_id, num ) to 'col12-dupes.tsv' WITH HEADER

\copy ( select * from (	select id, num, normalised_name, type, taxgroup, rank, scientific_name, authorship, status, cnt, array_length(ARRAY( SELECT DISTINCT unagg FROM unnest(agg) as unagg), 1) as distinct_names, array_length(ARRAY( SELECT DISTINCT untg FROM unnest(tgagg) as untg), 1) as distinct_groups, nidx_id, nidx_canon_id, classification from ( select name_xr.*, row_number() over w as num, array_agg(scientific_name) over w as agg, array_agg(taxgroup) over w as tgagg, count(*) over w as cnt from name_xr WHERE status in ('accepted', 'provisionally accepted') WINDOW w AS (partition by nidx_id) ) n where cnt > 1) as dupe order by distinct_names desc, cnt, nidx_id, num ) to 'colXR-dupes.tsv' WITH HEADER

@mdoering
Copy link
Member Author

mdoering commented Jan 6, 2025

and here is colXR-dupes.tsv.zip that groups all names by their canonical names index id. This catches a lot more and is being used by the matching to select candidates. Ismariidae and Ismaridae Thomson, 1858 is in this one, see #889

mdoering added a commit to CatalogueOfLife/backend that referenced this issue Jan 7, 2025
@mdoering
Copy link
Member Author

mdoering commented Jan 7, 2025

The botanical code defines parahomonyms in §53.2 which are to be treated as homonyms:

53.2. When two or more names of genera or species based on different types are so similar that they are likely to be confused (because they are applied to related taxa or for any other reason) they are to be treated as homonyms (see also Art. 61.5). If established practice has been to treat two similar names as homonyms, this practice is to be continued if it is in the interest of nomenclatural stability.

The zoological code unfortunately does allow a single character difference to be accepted as a different (genus) name.

I can see Brommella and Bromella accepted in COL and other names which are very alike. Often these are in clearly disparate groups and we can keep them apart during merges. But some might be very close and we would wrongly merge information. This is very rare and can be fixed in x-configs, but it might be too invasive? On the other hand we have many, many misspelled duplicates. But we might want to require an authorship for those to be considered the same. Sth we don't often have for genus or higher names...

@mdoering
Copy link
Member Author

mdoering commented Jan 7, 2025

new results uploaded here and linked at the top:

col12-dupes.tsv.zip
colXR-dupes.tsv.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback User feedback xrelease extended release
Projects
Development

No branches or pull requests

3 participants