Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0514 ml usage tags #516

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

0514 ml usage tags #516

wants to merge 17 commits into from

Conversation

glenrobson
Copy link
Member

For consideration & further discussion related to #514

Moved to the cookbook repo to build preview. Original Pull request #515

@glenrobson glenrobson mentioned this pull request Jul 12, 2024
@glenrobson glenrobson linked an issue Jul 12, 2024 that may be closed by this pull request
]
},
"rights": "http://creativecommons.org/licenses/by-sa/3.0/",
"requiredStatement": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think requiredStatement can be an array I think it needs to be a JSON object. So to get it to validate if you can remove the [] in required statement it should get past the validation and start deploying the preview.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, sorry for the extra [], removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm being dim. Its not valid JSON now without the [] and looking at the presentation API:

https://iiif.io/api/presentation/3.0/#requiredstatement

I think its only possible to have one required statement. For this example do you want to remove the attribution one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you, just sent through another updated version with the attribution statement as a second value, hope that’s alright.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exemplifies the problem with the approach. Now you have two values with one label, but the label only applies to the first. This should be called out in the recipe description, and that this is a field intended for humans to read rather than machines to process. Even the best intentioned machine agent won't know what to do with this... at which point, just merge the two parts into one statement.

Copy link

@alliomeria alliomeria Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @azaroth42.

My reading of the specs is indeed that the requiredStatement property is really only for human readable and displayable statements (recipe states 'for humans', but could be more verbose about this in the general description text as well). I'm thinking of this initial recipe as complementary to and a first step towards the more (intentionally) machinable rights --have a note about the WIP here:

* URIs to be pursued for machineable interactions, pending further discussions within the IIIF and wider repository communities during Summer 2024 and onwards.

(FWIW, also have a brief note about the actionability in general)

This is a bit out of scope for this particular recipe, but It might be worth having a larger discussion about requiredStatement only permitting a single object. I can see the use case for multiple objects/statements (regardless of these potential tags), such as repositories that often have both standardized and local rights statements, where it would helpful to users to see the text for the standardized statement in full alongside the more local statements. Or where multiple language labels and paired value statements could be useful.

Remove errant brackets
Object structure change for validation
@alliomeria
Copy link

Hi @glenrobson, looking at the build validation errors right now. Will follow up and try to correct those in just a few moments.

- Mirador
- UV
topic:
- text
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topics can only be one of the following

  • basic
  • property
  • note
  • structure
  • annotation
  • image
  • AV
  • realWorldObject
  • geo-recipes
  • content-state

I think I would go for note at the moment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, just changed to note, thanks Glen.

Correct topic for validation; update line references as needed for manifest.json changes
@@ -12,7 +12,7 @@
"<p>Picture taken by the <a href=\"https://github.com/glenrobson\">IIIF Technical Coordinator</a></p>"
]
},
"rights": "http://creativecommons.org/licenses/by-sa/3.0/",
"rights": "https://www.wikidata.org/wiki/Q127518037",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Context update (:
@alliomeria
Copy link

Ah, I see I'm not getting the Validation right for @context. Did I misinterpret the extension mechanism?

@alliomeria
Copy link

alliomeria commented Jul 17, 2024

Follow up notes: thinking the Policy Extension Registry for listing that Wikidata Q ref is not actually actionable as written (does not set true context.json; this would not either).

Also, it seems the validator itself has set parameters for rights:
https://github.com/IIIF/presentation-validator/blob/c3283776ef60161e40677ac96632fa507602c0aa/schema/iiif_3_0.json#L248-L267

Can anyone point me to an example of valid usage of URIs in 'rights' that are not CC or RightsStatements.org? Any Local Contexts, Traditional Knowledge references, for example?

@DiegoPino
Copy link

DiegoPino commented Jul 17, 2024

@alliomeria I agree. The JSON schema is fixed on those 3 base URLs which differs from what the human readable specs say.
JSON schema syntax allows for conditional oneOf (or any listings) based on e.g another value found somewhere else which would allow, under certain conditions to allow any external URL if e.g a different context was provided .. but .. that leads to
a question: the Presentation API (3.0) specs suggest the extension mechanism for other URLs outside of the rights statements/cc domain, but it is not clear how? the extension mechanism would have any effect at all on a "value" of an existing base property or how could that be resolved in a validation at all. From my understanding the extension mechanist would allow to use other JSON keys/properties, like any additional @context, according to JSON-LD, would allow, but a mapping to another vocab/ontology does not necessarily define a "different" value for an existing key, specially on a key like "rights" which is already mapped to a very permissive (good) dcterms:rights in terms of what goes there.

Maybe there is space for interpretation in the specs for this?

@azaroth42
Copy link
Member

I think the specification is somewhere between misleading and flat out wrong here. We need a registry of additional rights URIs with an explanation of what a client should do when it encounters them.

Created IIIF/api#2309 to this end.

Thanks @alliomeria for pushing into this somewhat unknown territory! :)

@DiegoPino
Copy link

DiegoPino commented Jul 17, 2024

@alliomeria @azaroth42 to allow this recipe to validate against 3.0 while 4.0 figures this out, would an alternative additional JSON/JSON-LD property coming from e.g schema.org (I'm thinking in specific of usageInfo) be used for the Wikidata ML tags? That way rights could still be, while on 3.0, CC based, and [usageInfo] (https://schema.org/usageInfo) serve as a stub in the meantime? Might be a stretch (sorry, don't want to derail this great effort) and I don't know if schema.org is even valid in the IIIF specs as a registered extension.

Just an end of the day idea but it might cover the machinable part since I am pretty sure some crawlers like Google do know how to map/parse read that complete @context and thus their properties and values.

@azaroth42
Copy link
Member

You could create a JSON-LD context document that defines a new property for IIIF (either de novo or by mapping from an existing ontology like schema) for sure. It would then fall into the extensions part of the spec directly and you could put whatever values you wanted in it.

@alliomeria
Copy link

alliomeria commented Jul 18, 2024

I think the specification is somewhere between misleading and flat out wrong here. We need a registry of additional rights URIs with an explanation of what a client should do when it encounters them.

Created IIIF/api#2309 to this end.

Thanks @alliomeria for pushing into this somewhat unknown territory! :)

Thanks for pushing this into a discussion of potentially revising the specs for 4.0 (*1.4.0/1.3.0 = Archipelago versioning 🤓 ), @azaroth42. :) I was a bit puzzled try to piece out how to work with the extensions mechanism as currently described.

For 3.0, I appreciate your suggestions @DiegoPino for a potential actionable way to address right now. Happy to give that mode an attempt...

Diego, Rob, Glen, or anyone watching this issue, what else might you suggest as ways to work within the current specs to provide valid, actionable rights statement (broader sense of this terminology) declarations?

Also, what exactly is the functional result of a client hitting the current version of rights URIs from CC/RightsStatements.org?

@alliomeria
Copy link

Looking around a bit more, came across this: https://github.com/IIIF/api/blob/main/source/registry/rights/index.md

Maybe timely for proposing filling out for CC/RightsStatements.org, and inclusion of alternate rights statements, including Local Contexts notices, Traditional Knowledge labels, and others (like these 🙃 )?

Or perhaps the Rights Registry is an unused/abandoned area?

@glenrobson
Copy link
Member Author

Discussed in cookbook meeting. Suggested way forward:

If this gets through TRC look at adding other rights statments to the registry

Return to normal @context
Update statements related to Wikidata, rights, registries
To sidestep validation error, per Glen
remove indentation
@kirschbombe
Copy link
Contributor

Just FYI on the Registry pages, we've been working on adding Registry pages and moving things from the annex to the Registry as we make update. I haven't worked on it in a bit, but probably have a branch with the in-progress changes. If the group comes to together on agreed URIs to add, let me know and I will add them to my draft. I can also prioritize work to update the Rights registry page.

both lines (:
@kirschbombe
Copy link
Contributor

Here's the draft PR for the Registry page with a preview link: IIIF/api#2248

Bump down JSON snippets for `rights`
- Adjust order or rights & requiredStatement
- Add more detailed information about the URIs, potential current & potential future machinable, and notes detailed the  Example shown
@alliomeria
Copy link

Based on the feedback received on this PR and during today’s cookbook call, I made a few additional updates to the recipe. Check it out here: https://preview.iiif.io/cookbook/0514-ml-usage-tags/recipe/0514-ml-usage-tags/

Thank you very much to everyone who shared helpful feedback and recommendations so far, I really appreciate your time and consideration. Looking forward to continuing to work with the community to discuss this and potentially move further along through the official pipelines.

@alliomeria
Copy link

Hello everyone watching this pull request 👋

During yesterday’s IIIF AI + ML Group Meeting, I had the opportunity to present again on this proposal for ML/AI Usage statements in IIIF Manifests. In the presentation follow up discussions, Ellen Van Keer shared an interesting question/comment about how these statements might work within the context of the recently enacted EU AI Act. Specifically, Ellen was concerned that EU organizations might not be able to apply these usage statements if they were not the primary copyright holder for a given object/resource.

Ellen, thank you so much for raising this important topic during the call. If possible, could you please provide references supporting the concern that institutions may not be able to apply any kind of 'opt-out' or usage statement unless they are the primary copyright holder for a work? I'm also curious how this might apply to other usage/rights statements already at play currently.

From my what I am reading stateside, I am not seeing the text for opt-out mechanisms or usage statements described in that particular way. In the law itself (EN version here), I see this text "Where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightsholders if they want to carry out text and data mining over such works."

In legal analyses and practical discussions, such as the ones noted here and a few more below this message (sorry for the many links, trying to go through due diligence), it seems like the EU AI Act and TDM Directive can be interpreted differently and are still not yet fully defined.

From this source:

"The EU AI Act contains a provision that equates AI/machine learning with “text-and-datamining” (TDM) under the EU Text and Data Mining Directive.[1] Consequently, “machine learning” is allowed, provided that:

  • the person programming the machine-learning functionality has had lawful access to the content for the purpose of text and data extraction; and
  • the owner of the copyright and related rights and/or the database owner have not expressly reserved the extraction of text and data (the so-called opt-out mechanism).

The EU AI Act is expected to enter into force in 2024 and will fully apply 24 months thereafter. However, the TDM exception under the EU Text and Data Mining Directive already exists. Therefore, the TDM exception for machine learning can already be enforced in anticipation of the EU AI Act’s interpretation.”

Would anyone be willing to share their perspective on the potential procedures or enforcement mechanisms at play for the EU AI Act and TDM Provision in terms of opt-out requests, any kind of usage statements applicability? Is this an area where it is anticipated there is going to be variance between institutions and local policies?

In any case, I think that these ML/AI Usage Statements could still be useful, and even provide an actionable mechanism for helping an institution comply with an "opt out" request that was "expressly reserved in an appropriate manner".

Thanks again for bringing up this important related potential factor, Ellen. And thanks to everyone who has been taking this proposal into consideration and sharing feedback. I really appreciate everyone's time and expertise.

Additional Links:

@alliomeria
Copy link

Hello everyone who may watching this repo/issue, checking in to see if anyone might have time to add some follow up comments to the issue discussion noted here. It would be great to have more perspective about the EU AI Act & TDM Provision considerations. Thanks for your time! (I also gave a IIIF Slack ping, so apologies for the redundancy in messages related to this.)

@veesalu
Copy link

veesalu commented Aug 20, 2024

I believe Ellen referred to the DSM (Digital Single Market Strategy) which also applies to digital repositories of cultural heritage institutions in the EU.

The copyrights and licenses question is rather firmly regulated and only the rights holder has the right to assign licenses or access/usage terms to the works in copyright. CHIs in the EU can't legally assign licenses for works for which we don't own the rights.

There are different ways for getting material into the collections but for National Library of Estonia that is based on the Legal Deposit Copy Act. A publication must be submitted to NLE and we have the obligation to preserve it long term and make it available in accordance with the Copyright Act. During the act of deposit and based on the Copyright Act the rights holder assigns licenses and/or terms (in our case either CC or RightsStatements) based on their wishes and intentions. We also don't have the right to change access and usage terms of oprhan and out-of-commerce works based on our own judgement, we use EUIPO's portals in order to get the grounds to make them available.

Now, I guess it depends on how the ML usage tags are defined in the landscape of access and rights. The first idea that also came to my mind during the call was that we could use the tags for our own publications and to others we could add "check with the rights holder". Like Ellen, I doubt we could legally bindingly apply the tags for other rights holders' works if the ML usage tags are defined as licenses or usage conditions similar to CC or RightsStatements.

Regarding the TDM, data and text mining of works in copyright can only be done on the premises of NLE and outside researchers can only leave the premises with cleaned and worked on data. Also, the research must be done in "motivated" amount, meaning they can't collect and use our entire collection. In order for a researcher or research institution to get access to the data, they need to file an application in which they present their research, justify the need for the data and explain what they do with it. So, the TDM simply doesn't mean that anyone who says that they do research have automatic access to the data. There was a legal analysis carried out for our digital lab, there's a summary in English -> https://digilab.rara.ee/wp-content/uploads/2023/03/Virtual-LAB_eng_oigusanaluus.pdf

Another issue that I have been thinking about (and have yet to reach a conclusion also for myself) is that in Europe, we are in the make-it-accessible-and-reuse-freely stage. European Commission is geared towards accessibility, popularisation and re-use of cultural heritage. It's quite impossible to get funding for infrastructure or software if you don't plan "smart solutions". At the NLE we are currently applying for funding for three (two EC funded) AI/data science projects and we are building our own ML solution for automatic cataloguing. We have the European Collaborative Cloud for Cultural Heritage and Common European Data Space for Cultural Heritage, which both aim to reduce duplication of data and improve collaboration. I'm not sure how easy sell the ML usage restrictions could be in Brussels. But this already is a whole other discussion in itself.

@alliomeria
Copy link

Thank you for your thoughtful follow up veesalu, and for providing a link to the NLE's analysis that informs your institution's approach to TDM. It's really interesting to read through the perspective and the context you're all working with. Good luck with your pursuits of AI/data science projects and ML cataloging assistance tools. Looking forward to reading about your outcomes down the road. :)

Related to this topic, I wanted to note that CC is moving towards being receptive to the idea that perhaps creators should be able to have additional options within the CC licensing framework, loosely termed as "preference signals" in this blog piece where they discuss their early explorations on this concept: https://creativecommons.org/2024/07/24/preferencesignals/. (I will be reaching out to CC to ask about this, maybe there's some space for collaboration at a shared table.)

I understand that applying nuance within the frameworks of open sharing culture can be challenging. That said, I still think there are ways we can better attune our practices to the complexities around the considerations artists, authors, and other creators and content caretakers are facing in the modern AI/ML internet landscape.

@veesalu
Copy link

veesalu commented Sep 13, 2024

That was very interesting reading, thanks!

I lean towards agreeing that tags are of use, the issue is that in the EU only the rightsholders can opt-in or opt-out and in most of the cases, we are not the rightsholders. So implementing those is not just in-house project of deciding that from now on we do it like that, it needs bigger change in processes in general. I'm a little bit on the fence about this issue, because, on the one hand, I feel that we should make as much of CH accessible as possible, but on the other hand, we need to consider reasonable infrastructure loads, etc.

Another issue in using or not using CH data in ML is that right now LLMs don't really speak small languages (like Estonian) and the text corpora held in our collections is valuable material for training the models.

I do agree that this is a discussion that we should continue.

@alliomeria
Copy link

alliomeria commented Sep 16, 2024

This is definitely a topic and practice area with a lot of nuanced considerations at play, for cultural heritage related and other fields that may end using ML technology. I think the true usefulness of ML assisted tools really remains to be seen in many ways, and I hope that there are will be actual comparative studies conducted in CH/GLAM for analyzing the effectiveness of ML tools compared with traditional practices and other technical approaches. I also hope that we can have an impact on how particular ML technical applications are developed for our field.

Thanks again for your sharing your time and feedback related to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ML/AI Usage Tags Recipe
6 participants