Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging phrases in speech-db #457

Open
2 tasks done
fbanados opened this issue Sep 5, 2024 · 10 comments
Open
2 tasks done

Merging phrases in speech-db #457

fbanados opened this issue Sep 5, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request question Further information is requested requires-linguist-work

Comments

@fbanados
Copy link
Member

fbanados commented Sep 5, 2024

Some times there are multiple entries for the same phrase. There should be an interface to merge them, and a script to automatically merge entries whose transcription and translation are the same.
(Split from #444)

  • Exact same phrases (transcription+translations) should be merged automatically
  • An interface for manually merging phrases would be useful.
@fbanados
Copy link
Member Author

fbanados commented Sep 5, 2024

@fbanados
Copy link
Member Author

fbanados commented Sep 5, 2024

This example raises interesting issues, as analysis and translations are different:
Screenshot 2024-09-05 at 4 24 19 PM

@fbanados
Copy link
Member Author

fbanados commented Sep 5, 2024

Question: What are the total set of fields that define two phrases as "the same" here, as in "it would be ok to automatically merge them and pick any of them"? I'm wondering in particular about those fields in the database beyond transcription and translation:

  • field_transcription
  • analysis
  • comment
  • status
  • semantic class (RW)
  • modifier

I'm currently asking for all fields to be the same, but that is too detailed in some cases. I am inclined to disregard differences in field_transcription (which arises, e.g., when there's been a change in the transcription that makes them now the same), modifier (person that last touched the entry), and semantic class (RW needs to be regenerated anyways. For the others I don't know, this would require a linguist decision.

@fbanados fbanados added enhancement New feature or request question Further information is requested requires-linguist-work labels Sep 5, 2024
@fbanados fbanados self-assigned this Sep 5, 2024
@aarppe
Copy link
Collaborator

aarppe commented Sep 5, 2024

I had originally been thinking that if the transcription (in its latest state, so not necessarily the field transcription) and the translation (excluding spaces at the edges) are exactly the same, then the entries could be merged. For the other fields, if they only occur for one entry or not another, or are exactly the same for both entries, then one could use that common value. For other fields that do not match, one could combine them for the merged entry.

But would this result in ambiguous cases?

@fbanados
Copy link
Member Author

I don't think it results in ambiguous cases. I was designing an interface for automatically listing all possible candidates, but that will be unneccessary once all the ambiguous cases are dealt with. Also I would not be surprised if that would lead the interface to take too long to load and timeout, so I think it's better to have the automatic merging done separately. In general, automatically merging can be done with a manage.py command, and we can keep the interface just for search and merge.

@fbanados
Copy link
Member Author

Code is ready, action to decide on running django command on server to be discussed via email.

@aarppe
Copy link
Collaborator

aarppe commented Sep 19, 2024

Adding the linguist-administrator role is needed for linguists to undertake merging of individual items. That would be useful as checking the behavior with indidivual entries, before/instead of running merging whole-sale computationally.

@aarppe aarppe reopened this Sep 19, 2024
@aarppe
Copy link
Collaborator

aarppe commented Nov 8, 2024

@fbanados I don't think we need to delay this any further, as I've been able to observe that the merging of individual entries has worked properly - so we can proceed with computationally merging all the entries for which the transcriptions and translations are exactly the same.

@fbanados
Copy link
Member Author

fbanados commented Nov 8, 2024

I will make a database backup before merging

@fbanados
Copy link
Member Author

fbanados commented Nov 8, 2024

Entries merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested requires-linguist-work
Projects
None yet
Development

No branches or pull requests

2 participants