8243 improve language controlled vocab #10481

landreev · 2024-04-10T18:49:15Z

What this PR does / why we need it:

This PR updates the Citation metadata block sanitizing the Controlled Vocabulary for the field "language". It replaces the draft PR #10197 made late last year.

The nature of the cleanup is as follows:
The CVV list is brought up to the following standard:

all the languages from he ISO 639-1 list are in it;
all the corresponding 2-letter codes from ISO 639-1 are in the vocab as "alternates" (in practice this means that, for example, both <dc:language>English</dc:language> and <dc:language>en</dc:language> in oai_dc are importable);
similarly, the 3-letter codes from ISO 639-2 and ISO 639-3 for the languages above should also be there as alternates. in other words, <dc:language>eng</dc:language> is also valid for the purposes of import;
if multiple 2- or 3-letter codes exist for the same language, all should be in the CVV as alternates as well;
for some languages, alternative variants of the language names may now be considered preferable, these extra names have been added as well. For example: language Slovene 146 slv sl Slovenian;
as discussed in Sanitize languages controlled vocabulary values #10197, no changes have been made to the "main" CVV values (such as the name Slovene in the last example). This is to accommodate our model of metadata block updates which relies on matching on that name. If it ever becomes necessary to change any of the main names for whatever practical reason, we will have to do that via a Flyway update. The plan is to avoid that, at least for now; EDIT: We actually had a solution for this problem all along! The CVV entries in the block have a field reserved for the "identifier", that will match the value to its replacement, even with the main value changed. We apparently never used this field for the languages, but as of this PR, it is now populated (using the first 3-letter code available for the language as such permanent identifier). This means that going forward, we'll be able to change "Slovene" to "Slovenian", etc., if it ever becomes necessary.
the PR addresses the issue with some of the languages in our block where multiple alternative names have been listed in the main CVV itself. For example:
language Navajo, Navaho 109 nav nv. For such cases both forms have been separately added as alternates, making either Navajo or Navaho importable on their own, as in:
language Navajo, Navaho 109 nav nv Navajo Navaho
there is a separate issue discussing further expanding the list of supported languages; there are thousands of entries on the full ISO 639-3 list, so it doesn't seem practical to just add all of them to the block, so some alternative mechanisms would be needed - either handle it on a case by case basis on demand, or via a Remote CV (?). That issue is NOT being addressed in this PR.

Which issue(s) this PR closes:

Closes #8243

Please note the issue is old and much of the information in the opening description already obsolete. The description above should be accurate.

Special notes for your reviewer:

Suggestions on how to test this:

I'm not sure what QA for this should be like, tbh. One could try and actually test that the extra language designations added in this PR are now importable; take a test json file - like the standard dataset-create-new-all-default-fields.json and try entries like

{
    "typeName": "language",
    "multiple": true,
    "typeClass": "controlledVocabulary",
    "value": [
       "Navajo",
       "Slovenian",
       "kir",
       ... etc.
    ]
}

... but a better test would likely be to look at the database entries for the DatasetFieldType "language" in the controlledvocabularyvalue and controlledvocabalternate tables, before and after the metadatablock update, and confirm that
a. the number of entries and the values in controlledvocabularyvalue are identical before and after
b. more values have been added to controlledvocabalternate, and all the values from before are still there
c. that the end result (after) is identical to what's in these tables when the new block is installed on a brand new, empty database.

Not sure if it's worth going to such trouble - but figured I'd leave it here.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

…nto a local branch; #8243

…er conversation with requestor. #8243

… - I'm leaving the main name intact (so that the block update will still works), but adding both versions as extra alternative names, so that either is importable. #8243

…he ISO 639-1 and -2 codes for Nuosu. #8243

…Wikipedia) #8243

…the order in which they are listed in the current ISO 639-3 table) #8243

pdurbin

Should we add a change to Malayalam? Please see my comment in this review.

pdurbin · 2024-05-02T16:20:59Z

scripts/api/data/metadatablocks/citation.tsv

-	language	Malay		100	may	msa	ms
+	language	Malay		100	msa	may	ms
 	language	Malayalam		101	mal	ml


There's a conversation going on where Youn Noh at Yale is proposing these changes:

ORIG: language Malayalam 101 mal ml NEW: language Malayalam mal 4016 mal mal ml

Should we get this change in as well?

Please see https://groups.google.com/g/dataverse-community/c/rGkaHuzA2x4/m/GeyOF1kpAgAJ

@pdurbin The change above doesn't make much sense, so no, we certainly don't want it as is. I'm going to add a comment in the google group thread reiterating an earlier suggestion for the author to take a closer look at the opening comment in this PR, since it answers some of the questions asked there.

younnoh · 2024-05-06T14:45:43Z

iqss_pr_10481.zip contains a sample script for citation.tsv in its original form and as modified by the script. The ISO 639-3 code tables are available for download from SIL; the input file from them is included in the attachment as well. (Please let me know if there are any issues with the attachment.) I think this clarifies the change we would like to see. I used the ISO 639-3 language code as the value for identifier and adhered to the existing column order in columns following displayOrder for mappings to previous versions of ISO 639. As I understand it, the mappings are needed to maintain compatibility with metadata standards such as OAI DC that use those versions. I had thought about excluding the rows from the ISO table where Scope and Language have a value of "S" but ended up leaving them to remain faithful to the standard. (IMPORTANT: We need to switch to ISO 639-3 to meet the request of a professor of linguistics who has graciously served as one of our beta testers. The languages needed by her research group are not in ISO 639-2.)

landreev · 2024-05-06T14:48:05Z

@younnoh Thank you, I'll take a look.

landreev · 2024-05-06T14:53:41Z

@younnoh If you haven't yet, please read carefully the opening comment for this PR that explains its scope and limitations.

younnoh · 2024-05-06T16:50:32Z

"there is a separate issue discussing further expanding the list of supported languages" - Could you please provide a link?
Is it within the scope of this PR or (1) to change the value in the identifier column to the language code? I believe this is standard practice (e.g., see languages here - https://id.loc.gov/). Names or codes from previous versions could be mapped via lookup tables.
Could you please identify any additional problems with the changes I've proposed with regard to this PR or (1)? Thank you!

landreev · 2024-05-06T20:13:51Z

@younnoh The issue in question is #8578. It's been sitting untouched for a while, but please note that it's been recently tagged as "sprint ready", which means it is being scheduled to be revisited and worked on sometime soon (probably as soon as this PR is merged), and we'll use that issue to figure out what to do with/how to handle the full ISO 639-3 list.

I'm in the middle of something now, but I'll try to get back to you with more info/answers to your other questions ASAP.

One quick thing: please note that in the context of this issue, or the other issue referenced, we are talking about making changes to the "official" version of citation.tsv that we distribute with the application. But the beauty of keeping our "metadata blocks" configurable, and not hard-coded in the application, is that you don't really have to use the configuration that we supply. Within the core dev. team we've had somewhat of a consensus in that we don't want to add the full 639-3 list to the CV, in part because we don't believe that most users will want that pulldown Language menu on the Edit Dataset page to have 8K entries in it. But, if that's something your users do want to have, nobody stops you from adding these CVVs to the tsv file and adding them on your own instance.

younnoh · 2024-05-07T15:43:05Z

Thank you for the link to the issue and the update on your future sprint. I would like to raise the points below for your consideration as you decide how to proceed.

For a number of researchers, I think that the language field describes a key aspect of their research commensurate in importance to (for example) coordinates in geospatial metadata.
Not only to keep abreast with current standards, but also to add diversity by including underrepresented languages and linguistic communities, I think it would be beneficial to use ISO 639-3.
If resources permit, I think the usability issue presented by a long list of terms could be addressed through an autocomplete feature that would be generally useful for other controlled vocabularies that are not readily scrollable on a range of devices.
While the flexibility of having customizable metadata blocks is a wonderful feature of Dataverse, as for everyone, the resources here are limited. I would prefer to try to get our DevOps and Software Engineering teams involved in community work on Dataverse integrations and experimental metadata blocks (e.g., for computational workflows). I also think it might be a better use of collective resources to distribute the change to the controlled vocabulary through an official release, thereby allowing community members to leverage their resources where they can make a unique contribution, for example, by providing access to feedback from researchers working on a wide range of research topics.

landreev · 2024-05-07T16:30:09Z

Is it within the scope of this PR or (1) to change the value in the identifier column to the language code?

Short answer: yes, I'll do this.
Longer/more embarrassing answer: I didn't know/had no memory about that "identifier" column, and what its purpose was. When I said, in the opening comment:

as discussed in #10197, no changes have been made to the "main" CVV values (such as the name Slovene in the last example). This is to accommodate our model of metadata block updates which relies on matching on that name...

- I just wasn't aware that we had this identifier, for the very purpose of solving this problem: to able to change the primary CVV when updating a metadata block. In my defense, nobody on our dev. team here corrected me on that.
This doesn't change the fact that, for the purposes of this PR, we still can't touch the main values (as opposed to the alternates), since the identifiers are not there yet. But it does make sense to add them in this PR, so that this isn't going to be a problem going forward.
And yes, using the 3 letter ISO codes as the identifiers makes perfect sense too.

pdurbin · 2024-05-07T17:40:28Z

@younnoh just a quick question on this point:

If resources permit, I think the usability issue presented by a long list of terms could be addressed through an autocomplete feature that would be generally useful for other controlled vocabularies that are not readily scrollable on a range of devices.

It's already possible to type a few characters as in the example below. Are you looking for something else?

younnoh · 2024-05-07T18:07:58Z

Thanks, Phil. That's exactly what I meant. (I just requested greater permissions here to be able to do more testing before I comment.)

landreev · 2024-05-07T19:01:01Z

@younnoh @pdurbin
The way the existing menu works - with the contents of the menu filtered/reduced as the user starts typing - you start with the pulldown fully populated; since this conversation was in the context of supporting a CV with 8K+ values, I still feel somewhat strongly that we don't want the default version of that menu to contain that many entries, for all Dataverse users. The way this is implemented with External Vocabs may be a better fit for this use case.

pdurbin · 2024-05-09T13:16:12Z

@younnoh no problem. I'm not sure what you mean by permissions but you should be able to test the language field over at https://demo.dataverse.org and see the same UI (and get the same UX) as the screenshot above.

By the way, I started a thread on autocomplete here: https://dataverse.zulipchat.com/#narrow/stream/434038-ux-wg/topic/autocomplete.20for.20language/near/437500258

younnoh · 2024-05-09T13:37:13Z

@pdurbin Thanks for the reminder about the demo Dataverse and the link to the thread on autocomplete. Autocomplete works well from the login page for the demo when you want to select your institution. (Sorry for any confusion caused by my previous comment. I just got around to creating an account for my institution's test Dataverse instance and will use it do further testing.)

qqmyers · 2024-05-09T13:50:55Z

FWIW: The autocomplete in the author affiliation field is from an external vocabulary script using the https://ror.org vocab/api. The autocomplete for the lang field is Dataverse's internal one for CVVs listed in the metadata blocks.

younnoh · 2024-05-09T15:29:29Z

Is there a date by which it will be determined if (and when) ISO 639-3 will be adopted as citation metadata as part of an official release or update? I need to get back to my team on when we'll know to wait or to load as custom. Thanks!

pdurbin · 2024-05-09T20:14:05Z

@younnoh no, but please keep an eye on #8578. It sounds like we might re-title it to be about ISO 639-3 in general.

landreev · 2024-05-15T22:39:47Z

Sorry, I got distracted/forced to work on some urgent things instead. But I will finish it shortly (just wanted to populate the Identifier field in the CVV, otherwise leaving the PR as is; I believe it is ready to be merged otherwise).

I suggest we take the discussion of whether, or how to support the full extended 639-3 list to #8578. Somebody else from the team is looking into it/some activity is happening there.

coveralls · 2024-05-15T23:16:50Z

coverage: 20.587% (-0.003%) from 20.59%
when pulling 93fb1f5 on 8243-improve-language-controlled-vocab
into da3dd95 on develop.

…dates easier. Used the first 3-letter code as the identifier for each of the 185 supported languages. #8243

landreev · 2024-05-16T01:42:19Z

OK, populated the Identifier field for the supported languages. Moving back to "ready for QA"; will try to coax somebody to merge it soon-ish, maybe even tomorrow.

qqmyers · 2024-05-22T14:56:54Z

QA: Verified that after the update all cvv's except "Not applicable" now have an id, that additional alternates have been added, that deleting from the db and re-loading the block recreates the same number of values in both tables, spot checked for specific values, verified that adding vals works in the UI, that editMetadata api works with the strValue of the cvv.

I'll go ahead and merge.

Note the one failed test is due to the know confirmEmail issue fixed in a different PR.

landreev added 9 commits February 12, 2024 10:46

pushing the language controlled vocab additions suggested in #10197 i…

8796a1d

…nto a local branch; #8243

Merge branch 'develop' into 8243-improve-language-controlled-vocab

56f3b4b

Alternative long spellings for the Slovene and Kyrgyz languages. #8243

4f8a74c

Merge branch 'develop' into 8243-improve-language-controlled-vocab

e3aa0e8

Dropping the "macrolanguage" from the names of 2 language families, p…

f2f1ce1

…er conversation with requestor. #8243

Addresses the cases of languages with the names like "Navajo, Navaho"…

5acb13a

… - I'm leaving the main name intact (so that the block update will still works), but adding both versions as extra alternative names, so that either is importable. #8243

Fixed the tabs in the entries for Catalan and Haitian Creole; added t…

6324477

…he ISO 639-1 and -2 codes for Nuosu. #8243

One extra alternative name for Nuosu (from the ISO 893-2 articles on …

091ad21

…Wikipedia) #8243

a few minor cosmetic improvements (rearranging the 3 letter codes in …

d7a727f

…the order in which they are listed in the current ISO 639-3 table) #8243

landreev mentioned this pull request Apr 11, 2024

Sanitize languages controlled vocabulary values #10197

Closed

landreev added 2 commits April 12, 2024 13:12

a release note. #8243

8072d91

Merge branch 'develop' into 8243-improve-language-controlled-vocab

a84d577

landreev marked this pull request as ready for review April 12, 2024 17:19

stevenwinship self-assigned this Apr 12, 2024

stevenwinship approved these changes Apr 12, 2024

View reviewed changes

stevenwinship removed their assignment Apr 12, 2024

cmbz added GREI 2 Consistent Metadata Size: 3 A percentage of a sprint. 2.1 hours. labels May 2, 2024

pdurbin reviewed May 2, 2024

View reviewed changes

landreev mentioned this pull request May 7, 2024

Figure out whether, or how to support the extended ISO 639-3 list of languages #8578

Closed

landreev self-assigned this May 8, 2024

Merge branch 'develop' into 8243-improve-language-controlled-vocab

93fb1f5

Populated the identifier column in the Language CV, to make future up…

1073333

…dates easier. Used the first 3-letter code as the identifier for each of the 185 supported languages. #8243

landreev removed their assignment May 16, 2024

qqmyers self-assigned this May 22, 2024

qqmyers merged commit 6895034 into develop May 22, 2024
1 of 2 checks passed

qqmyers removed their assignment May 22, 2024

pdurbin added this to the 6.3 milestone May 23, 2024

cmbz mentioned this pull request Jun 4, 2024

GREI 2: HDV Task - Improve Dataverse Biomedical Metadata Support IQSS/dataverse-pm#174

Open

13 tasks

landreev mentioned this pull request Jul 22, 2024

Support the full ISO 639-3 list of languages #10578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8243 improve language controlled vocab #10481

8243 improve language controlled vocab #10481

landreev commented Apr 10, 2024 •

edited

Loading

pdurbin left a comment

pdurbin May 2, 2024

landreev May 6, 2024

younnoh commented May 6, 2024

landreev commented May 6, 2024

landreev commented May 6, 2024

younnoh commented May 6, 2024 •

edited

Loading

landreev commented May 6, 2024

younnoh commented May 7, 2024

landreev commented May 7, 2024

pdurbin commented May 7, 2024

younnoh commented May 7, 2024

landreev commented May 7, 2024

pdurbin commented May 9, 2024

younnoh commented May 9, 2024

qqmyers commented May 9, 2024

younnoh commented May 9, 2024

pdurbin commented May 9, 2024

landreev commented May 15, 2024

coveralls commented May 15, 2024

landreev commented May 16, 2024 •

edited

Loading

qqmyers commented May 22, 2024

8243 improve language controlled vocab #10481

8243 improve language controlled vocab #10481

Conversation

landreev commented Apr 10, 2024 • edited Loading

pdurbin left a comment

Choose a reason for hiding this comment

pdurbin May 2, 2024

Choose a reason for hiding this comment

landreev May 6, 2024

Choose a reason for hiding this comment

younnoh commented May 6, 2024

landreev commented May 6, 2024

landreev commented May 6, 2024

younnoh commented May 6, 2024 • edited Loading

landreev commented May 6, 2024

younnoh commented May 7, 2024

landreev commented May 7, 2024

pdurbin commented May 7, 2024

younnoh commented May 7, 2024

landreev commented May 7, 2024

pdurbin commented May 9, 2024

younnoh commented May 9, 2024

qqmyers commented May 9, 2024

younnoh commented May 9, 2024

pdurbin commented May 9, 2024

landreev commented May 15, 2024

coveralls commented May 15, 2024

landreev commented May 16, 2024 • edited Loading

qqmyers commented May 22, 2024

landreev commented Apr 10, 2024 •

edited

Loading

younnoh commented May 6, 2024 •

edited

Loading

landreev commented May 16, 2024 •

edited

Loading