Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8243 improve language controlled vocab #10481

Merged
merged 13 commits into from
May 22, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Apr 10, 2024

What this PR does / why we need it:

This PR updates the Citation metadata block sanitizing the Controlled Vocabulary for the field "language". It replaces the draft PR #10197 made late last year.

The nature of the cleanup is as follows:
The CVV list is brought up to the following standard:

  • all the languages from he ISO 639-1 list are in it;
  • all the corresponding 2-letter codes from ISO 639-1 are in the vocab as "alternates" (in practice this means that, for example, both <dc:language>English</dc:language> and <dc:language>en</dc:language> in oai_dc are importable);
  • similarly, the 3-letter codes from ISO 639-2 and ISO 639-3 for the languages above should also be there as alternates. in other words, <dc:language>eng</dc:language> is also valid for the purposes of import;
  • if multiple 2- or 3-letter codes exist for the same language, all should be in the CVV as alternates as well;
  • for some languages, alternative variants of the language names may now be considered preferable, these extra names have been added as well. For example: language Slovene 146 slv sl Slovenian;
  • as discussed in Sanitize languages controlled vocabulary values #10197, no changes have been made to the "main" CVV values (such as the name Slovene in the last example). This is to accommodate our model of metadata block updates which relies on matching on that name. If it ever becomes necessary to change any of the main names for whatever practical reason, we will have to do that via a Flyway update. The plan is to avoid that, at least for now; EDIT: We actually had a solution for this problem all along! The CVV entries in the block have a field reserved for the "identifier", that will match the value to its replacement, even with the main value changed. We apparently never used this field for the languages, but as of this PR, it is now populated (using the first 3-letter code available for the language as such permanent identifier). This means that going forward, we'll be able to change "Slovene" to "Slovenian", etc., if it ever becomes necessary.
  • the PR addresses the issue with some of the languages in our block where multiple alternative names have been listed in the main CVV itself. For example:
    language Navajo, Navaho 109 nav nv. For such cases both forms have been separately added as alternates, making either Navajo or Navaho importable on their own, as in:
    language Navajo, Navaho 109 nav nv Navajo Navaho
  • there is a separate issue discussing further expanding the list of supported languages; there are thousands of entries on the full ISO 639-3 list, so it doesn't seem practical to just add all of them to the block, so some alternative mechanisms would be needed - either handle it on a case by case basis on demand, or via a Remote CV (?). That issue is NOT being addressed in this PR.

Which issue(s) this PR closes:

Closes #8243

Please note the issue is old and much of the information in the opening description already obsolete. The description above should be accurate.

Special notes for your reviewer:

Suggestions on how to test this:

I'm not sure what QA for this should be like, tbh. One could try and actually test that the extra language designations added in this PR are now importable; take a test json file - like the standard dataset-create-new-all-default-fields.json and try entries like

{
    "typeName": "language",
    "multiple": true,
    "typeClass": "controlledVocabulary",
    "value": [
       "Navajo",
       "Slovenian",
       "kir",
       ... etc.
    ]
}

... but a better test would likely be to look at the database entries for the DatasetFieldType "language" in the controlledvocabularyvalue and controlledvocabalternate tables, before and after the metadatablock update, and confirm that
a. the number of entries and the values in controlledvocabularyvalue are identical before and after
b. more values have been added to controlledvocabalternate, and all the values from before are still there
c. that the end result (after) is identical to what's in these tables when the new block is installed on a brand new, empty database.

Not sure if it's worth going to such trouble - but figured I'd leave it here.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@landreev landreev marked this pull request as ready for review April 12, 2024 17:19
@stevenwinship stevenwinship self-assigned this Apr 12, 2024
@stevenwinship stevenwinship removed their assignment Apr 12, 2024
@cmbz cmbz added GREI 2 Consistent Metadata Size: 3 A percentage of a sprint. 2.1 hours. labels May 2, 2024
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a change to Malayalam? Please see my comment in this review.

Comment on lines 241 to 242
language Malay 100 may msa ms
language Malay 100 msa may ms
language Malayalam 101 mal ml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a conversation going on where Youn Noh at Yale is proposing these changes:

ORIG:
	language	Malayalam		101	mal	ml

NEW:
	language	Malayalam	mal	4016	mal	mal	ml

Should we get this change in as well?

Please see https://groups.google.com/g/dataverse-community/c/rGkaHuzA2x4/m/GeyOF1kpAgAJ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdurbin The change above doesn't make much sense, so no, we certainly don't want it as is. I'm going to add a comment in the google group thread reiterating an earlier suggestion for the author to take a closer look at the opening comment in this PR, since it answers some of the questions asked there.

@younnoh
Copy link

younnoh commented May 6, 2024

iqss_pr_10481.zip contains a sample script for citation.tsv in its original form and as modified by the script. The ISO 639-3 code tables are available for download from SIL; the input file from them is included in the attachment as well. (Please let me know if there are any issues with the attachment.) I think this clarifies the change we would like to see. I used the ISO 639-3 language code as the value for identifier and adhered to the existing column order in columns following displayOrder for mappings to previous versions of ISO 639. As I understand it, the mappings are needed to maintain compatibility with metadata standards such as OAI DC that use those versions. I had thought about excluding the rows from the ISO table where Scope and Language have a value of "S" but ended up leaving them to remain faithful to the standard. (IMPORTANT: We need to switch to ISO 639-3 to meet the request of a professor of linguistics who has graciously served as one of our beta testers. The languages needed by her research group are not in ISO 639-2.)

@landreev
Copy link
Contributor Author

landreev commented May 6, 2024

@younnoh Thank you, I'll take a look.

@landreev
Copy link
Contributor Author

landreev commented May 6, 2024

@younnoh If you haven't yet, please read carefully the opening comment for this PR that explains its scope and limitations.

@younnoh
Copy link

younnoh commented May 6, 2024

  1. "there is a separate issue discussing further expanding the list of supported languages" - Could you please provide a link?
  2. Is it within the scope of this PR or (1) to change the value in the identifier column to the language code? I believe this is standard practice (e.g., see languages here - https://id.loc.gov/). Names or codes from previous versions could be mapped via lookup tables.
  3. Could you please identify any additional problems with the changes I've proposed with regard to this PR or (1)? Thank you!

@landreev
Copy link
Contributor Author

landreev commented May 6, 2024

@younnoh The issue in question is #8578. It's been sitting untouched for a while, but please note that it's been recently tagged as "sprint ready", which means it is being scheduled to be revisited and worked on sometime soon (probably as soon as this PR is merged), and we'll use that issue to figure out what to do with/how to handle the full ISO 639-3 list.

I'm in the middle of something now, but I'll try to get back to you with more info/answers to your other questions ASAP.

One quick thing: please note that in the context of this issue, or the other issue referenced, we are talking about making changes to the "official" version of citation.tsv that we distribute with the application. But the beauty of keeping our "metadata blocks" configurable, and not hard-coded in the application, is that you don't really have to use the configuration that we supply. Within the core dev. team we've had somewhat of a consensus in that we don't want to add the full 639-3 list to the CV, in part because we don't believe that most users will want that pulldown Language menu on the Edit Dataset page to have 8K entries in it. But, if that's something your users do want to have, nobody stops you from adding these CVVs to the tsv file and adding them on your own instance.

@younnoh
Copy link

younnoh commented May 7, 2024

Thank you for the link to the issue and the update on your future sprint. I would like to raise the points below for your consideration as you decide how to proceed.

  1. For a number of researchers, I think that the language field describes a key aspect of their research commensurate in importance to (for example) coordinates in geospatial metadata.
  2. Not only to keep abreast with current standards, but also to add diversity by including underrepresented languages and linguistic communities, I think it would be beneficial to use ISO 639-3.
  3. If resources permit, I think the usability issue presented by a long list of terms could be addressed through an autocomplete feature that would be generally useful for other controlled vocabularies that are not readily scrollable on a range of devices.
  4. While the flexibility of having customizable metadata blocks is a wonderful feature of Dataverse, as for everyone, the resources here are limited. I would prefer to try to get our DevOps and Software Engineering teams involved in community work on Dataverse integrations and experimental metadata blocks (e.g., for computational workflows). I also think it might be a better use of collective resources to distribute the change to the controlled vocabulary through an official release, thereby allowing community members to leverage their resources where they can make a unique contribution, for example, by providing access to feedback from researchers working on a wide range of research topics.

@landreev
Copy link
Contributor Author

landreev commented May 7, 2024

  1. Is it within the scope of this PR or (1) to change the value in the identifier column to the language code?

Short answer: yes, I'll do this.
Longer/more embarrassing answer: I didn't know/had no memory about that "identifier" column, and what its purpose was. When I said, in the opening comment:

as discussed in #10197, no changes have been made to the "main" CVV values (such as the name Slovene in the last example). This is to accommodate our model of metadata block updates which relies on matching on that name...

- I just wasn't aware that we had this identifier, for the very purpose of solving this problem: to able to change the primary CVV when updating a metadata block. In my defense, nobody on our dev. team here corrected me on that.
This doesn't change the fact that, for the purposes of this PR, we still can't touch the main values (as opposed to the alternates), since the identifiers are not there yet. But it does make sense to add them in this PR, so that this isn't going to be a problem going forward.
And yes, using the 3 letter ISO codes as the identifiers makes perfect sense too.

@pdurbin
Copy link
Member

pdurbin commented May 7, 2024

@younnoh just a quick question on this point:

If resources permit, I think the usability issue presented by a long list of terms could be addressed through an autocomplete feature that would be generally useful for other controlled vocabularies that are not readily scrollable on a range of devices.

It's already possible to type a few characters as in the example below. Are you looking for something else?

Screenshot 2024-05-07 at 1 38 45 PM

@younnoh
Copy link

younnoh commented May 7, 2024

Thanks, Phil. That's exactly what I meant. (I just requested greater permissions here to be able to do more testing before I comment.)

@landreev
Copy link
Contributor Author

landreev commented May 7, 2024

@younnoh @pdurbin
The way the existing menu works - with the contents of the menu filtered/reduced as the user starts typing - you start with the pulldown fully populated; since this conversation was in the context of supporting a CV with 8K+ values, I still feel somewhat strongly that we don't want the default version of that menu to contain that many entries, for all Dataverse users. The way this is implemented with External Vocabs may be a better fit for this use case.

@landreev landreev self-assigned this May 8, 2024
@pdurbin
Copy link
Member

pdurbin commented May 9, 2024

@younnoh no problem. I'm not sure what you mean by permissions but you should be able to test the language field over at https://demo.dataverse.org and see the same UI (and get the same UX) as the screenshot above.

By the way, I started a thread on autocomplete here: https://dataverse.zulipchat.com/#narrow/stream/434038-ux-wg/topic/autocomplete.20for.20language/near/437500258

@younnoh
Copy link

younnoh commented May 9, 2024

@pdurbin Thanks for the reminder about the demo Dataverse and the link to the thread on autocomplete. Autocomplete works well from the login page for the demo when you want to select your institution. (Sorry for any confusion caused by my previous comment. I just got around to creating an account for my institution's test Dataverse instance and will use it do further testing.)

@qqmyers
Copy link
Member

qqmyers commented May 9, 2024

FWIW: The autocomplete in the author affiliation field is from an external vocabulary script using the https://ror.org vocab/api. The autocomplete for the lang field is Dataverse's internal one for CVVs listed in the metadata blocks.

@younnoh
Copy link

younnoh commented May 9, 2024

Is there a date by which it will be determined if (and when) ISO 639-3 will be adopted as citation metadata as part of an official release or update? I need to get back to my team on when we'll know to wait or to load as custom. Thanks!

@pdurbin
Copy link
Member

pdurbin commented May 9, 2024

@younnoh no, but please keep an eye on #8578. It sounds like we might re-title it to be about ISO 639-3 in general.

@landreev
Copy link
Contributor Author

Sorry, I got distracted/forced to work on some urgent things instead. But I will finish it shortly (just wanted to populate the Identifier field in the CVV, otherwise leaving the PR as is; I believe it is ready to be merged otherwise).

I suggest we take the discussion of whether, or how to support the full extended 639-3 list to #8578. Somebody else from the team is looking into it/some activity is happening there.

@coveralls
Copy link

Coverage Status

coverage: 20.587% (-0.003%) from 20.59%
when pulling 93fb1f5 on 8243-improve-language-controlled-vocab
into da3dd95 on develop.

…dates easier. Used the first 3-letter code as the identifier for each of the 185 supported languages. #8243
@landreev
Copy link
Contributor Author

landreev commented May 16, 2024

OK, populated the Identifier field for the supported languages. Moving back to "ready for QA"; will try to coax somebody to merge it soon-ish, maybe even tomorrow.

@landreev landreev removed their assignment May 16, 2024
@qqmyers qqmyers self-assigned this May 22, 2024
@qqmyers
Copy link
Member

qqmyers commented May 22, 2024

QA: Verified that after the update all cvv's except "Not applicable" now have an id, that additional alternates have been added, that deleting from the db and re-loading the block recreates the same number of values in both tables, spot checked for specific values, verified that adding vals works in the UI, that editMetadata api works with the strValue of the cvv.

I'll go ahead and merge.

Note the one failed test is due to the know confirmEmail issue fixed in a different PR.

@qqmyers qqmyers merged commit 6895034 into develop May 22, 2024
1 of 2 checks passed
@qqmyers qqmyers removed their assignment May 22, 2024
@pdurbin pdurbin added this to the 6.3 milestone May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GREI 2 Consistent Metadata Size: 3 A percentage of a sprint. 2.1 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: Sanitize languages controlled vocabulary values
7 participants