Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve transformation of who attributes #28

Merged
merged 6 commits into from
Sep 2, 2024
Merged

Improve transformation of who attributes #28

merged 6 commits into from
Sep 2, 2024

Conversation

cmil
Copy link
Member

@cmil cmil commented Aug 27, 2024

This PR is aims to make the transformation of who attributes and castList elements in the theatre-classique sources into FreDraCor particDesc elements more robust. Instead of trying to fix the mistakes and oddities in the sources programmatically it attempts to implement the rules inherent in them. From now on, in the sources, we expect the following:

  • who attributes to be correctly spelled
  • speakers can be identified by words or phrases. These can include spaces, dashes, numbers, accented characters which will all be normalised when transformed into XML IDs
  • multiple speakers in the same who attribute are separated by either comma, underscore or forward slash (an " et " will not be considered a separator)
  • the same speaker must be identified by the same word or phrase

Deviations from these rules need to be corrected on the dracor branch of the theatre-classique repo. In fact, most of the TEI changes in this PR have already been originated by changes in the dracor branch. See https://github.com/dracor-org/theatre-classique/compare/dracor for details.

resolve #15
resolve #20

@cmil cmil self-assigned this Aug 27, 2024
@cmil cmil force-pushed the 20-speakers branch 2 times, most recently from 590a57e to 474ef0f Compare August 31, 2024 08:06
Before we only used the speaker tags when no who attribute values
were found at all. This has never been the case recently because all
documents as far as they include 'sp' elements also provide who
attributes. Just sometimes these may be empty.
We now recognize the underscore as an additional separator of individual speaker IDs
when multiple speakers occur in an sp. This requires fixing the the use of underscores
in other cases on the dracor branch.

We also use a more sophisticated approach to transform IDs starting with a number, which we now move to the end of the ID.
This should be done on the dracor branch of the sources.
@cmil cmil merged commit 9d0ab6b into main Sep 2, 2024
2 checks passed
@cmil cmil deleted the 20-speakers branch September 2, 2024 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unresolved pair characters with 'et' in many French plays Optimize matching of castItems to particDesc
2 participants