Normalize monomial names to spot more duplicates #892

mdoering · 2025-01-06T14:44:01Z

Only bi/trinomial names are currently normalised during matching to avoid wrongly merging very close genus names.
This was also done to avoid the stemming to break suffices of higher ranks.

Not doing any of the misspelling normalisations (double letters, iy, silent h) leads to many misspelled duplicates in the XR. Some are also present in the base release.

As changing the normalisation of names in the names index will have serious impact, I did a preview of all the uninomial names in the latest COL24.12 release and the January XRelease. Both results only contain accepted names which are considered to have duplicates according to the new rules.

From what I can see this looks very good and there appears to be no wrongly merged name in the lists.
If @camiplata, @DianRHR and @olafbanki could confirm I would change it for real in production.

mdoering · 2025-01-06T14:45:06Z

Note that duplicates which are in very different classifications would still be considered different!

mdoering · 2025-01-06T14:47:07Z

Note2: quite a few names in the XR are linked to the incertae sedis taxon a7c5566c-247f-4e91-908d-0d636d158772 which is a good thing to get rid of!

mdoering · 2025-01-06T18:50:15Z

just updated the files and added a new column distinct_names which counts the number of unique scientific_names per index group. Thus the ones that are higher than one do have some variation due to the new normalisation.

mdoering · 2025-01-06T18:52:06Z

Notes for me to remember how the files were generated:

create table name (
	ID text,
	parentID text,
	status text,
	rank text,
	scientific_name text,
	authorship text,
	normalised_name text,
	type text,
	nidx_id int,
	nidx_canon_id int,
	taxgroup text,
	classification text,
	primary key (ID)
);
create index on name(status) where status in ('accepted', 'provisionally accepted');
create index on name(parentID);
create index on name(nidx_id);
create index on name(scientific_name);

create table name_xr (like name including all);

 \copy name from ../col12/NameUsageOut.tsv with null as 'null'
 \copy name_xr from ../colXR/NameUsageOut.tsv with null as 'null'

\copy ( select * from (	select id, num, normalised_name, type, taxgroup, rank, scientific_name, authorship, status, cnt, array_length(ARRAY( SELECT DISTINCT unagg FROM unnest(agg) as unagg), 1) as distinct_names, array_length(ARRAY( SELECT DISTINCT untg FROM unnest(tgagg) as untg), 1) as distinct_groups, nidx_id, nidx_canon_id, classification from ( select name.*, row_number() over w as num, array_agg(scientific_name) over w as agg, array_agg(taxgroup) over w as tgagg, count(*) over w as cnt from name WHERE status in ('accepted', 'provisionally accepted') WINDOW w AS (partition by nidx_id) ) n where cnt > 1) as dupe order by distinct_names desc, cnt, nidx_id, num ) to 'col12-dupes.tsv' WITH HEADER

\copy ( select * from (	select id, num, normalised_name, type, taxgroup, rank, scientific_name, authorship, status, cnt, array_length(ARRAY( SELECT DISTINCT unagg FROM unnest(agg) as unagg), 1) as distinct_names, array_length(ARRAY( SELECT DISTINCT untg FROM unnest(tgagg) as untg), 1) as distinct_groups, nidx_id, nidx_canon_id, classification from ( select name_xr.*, row_number() over w as num, array_agg(scientific_name) over w as agg, array_agg(taxgroup) over w as tgagg, count(*) over w as cnt from name_xr WHERE status in ('accepted', 'provisionally accepted') WINDOW w AS (partition by nidx_id) ) n where cnt > 1) as dupe order by distinct_names desc, cnt, nidx_id, num ) to 'colXR-dupes.tsv' WITH HEADER

mdoering · 2025-01-06T21:31:53Z

and here is colXR-dupes.tsv.zip that groups all names by their canonical names index id. This catches a lot more and is being used by the matching to select candidates. Ismariidae and Ismaridae Thomson, 1858 is in this one, see #889

mdoering · 2025-01-07T11:08:33Z

The botanical code defines parahomonyms in §53.2 which are to be treated as homonyms:

53.2. When two or more names of genera or species based on different types are so similar that they are likely to be confused (because they are applied to related taxa or for any other reason) they are to be treated as homonyms (see also Art. 61.5). If established practice has been to treat two similar names as homonyms, this practice is to be continued if it is in the interest of nomenclatural stability.

The zoological code unfortunately does allow a single character difference to be accepted as a different (genus) name.

I can see Brommella and Bromella accepted in COL and other names which are very alike. Often these are in clearly disparate groups and we can keep them apart during merges. But some might be very close and we would wrongly merge information. This is very rare and can be fixed in x-configs, but it might be too invasive? On the other hand we have many, many misspelled duplicates. But we might want to require an authorship for those to be considered the same. Sth we don't often have for genus or higher names...

mdoering · 2025-01-07T11:27:28Z

new results uploaded here and linked at the top:

col12-dupes.tsv.zip
colXR-dupes.tsv.zip

mdoering added feedback User feedback xrelease extended release labels Jan 6, 2025

mdoering assigned camiplata and DianRHR Jan 6, 2025

mdoering mentioned this issue Jan 6, 2025

Duplicate family- to be block anr report to xrelease source #889

Closed

mdoering added this to the XRelease Public milestone Jan 6, 2025

mdoering added this to Software Development Jan 6, 2025

mdoering moved this to Todo in Software Development Jan 6, 2025

mdoering self-assigned this Jan 6, 2025

mdoering added a commit to CatalogueOfLife/backend that referenced this issue Jan 7, 2025

Test normilisation of monomial names, see CatalogueOfLife/data#892

961e9a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize monomial names to spot more duplicates #892

Normalize monomial names to spot more duplicates #892

mdoering commented Jan 6, 2025 •

edited

Loading

mdoering commented Jan 6, 2025

mdoering commented Jan 6, 2025

mdoering commented Jan 6, 2025

mdoering commented Jan 6, 2025 •

edited

Loading

mdoering commented Jan 6, 2025 •

edited

Loading

mdoering commented Jan 7, 2025 •

edited

Loading

mdoering commented Jan 7, 2025

Normalize monomial names to spot more duplicates #892

Normalize monomial names to spot more duplicates #892

Comments

mdoering commented Jan 6, 2025 • edited Loading

mdoering commented Jan 6, 2025

mdoering commented Jan 6, 2025

mdoering commented Jan 6, 2025

mdoering commented Jan 6, 2025 • edited Loading

mdoering commented Jan 6, 2025 • edited Loading

mdoering commented Jan 7, 2025 • edited Loading

mdoering commented Jan 7, 2025

mdoering commented Jan 6, 2025 •

edited

Loading

mdoering commented Jan 6, 2025 •

edited

Loading

mdoering commented Jan 6, 2025 •

edited

Loading

mdoering commented Jan 7, 2025 •

edited

Loading