Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark up text for automated string extraction #1477

Open
21 of 28 tasks
emmajclegg opened this issue Jul 12, 2024 · 23 comments
Open
21 of 28 tasks

Mark up text for automated string extraction #1477

emmajclegg opened this issue Jul 12, 2024 · 23 comments
Assignees
Labels
Multilingual Interface ODS Issue initiated by ODS
Milestone

Comments

@emmajclegg
Copy link
Collaborator

emmajclegg commented Jul 12, 2024

This is a first step to translating IATI Publisher's interface into French and Spanish, following the approach discussed here: #1420

YI will prepare text for automated extraction from IATI Publisher, ODS will review and get it translated, then YI will reintegrate text back into IATI Publisher.

Tasks

  • Understanding the requirement for automated string extraction
  • Create an API for updating the locale language #1560
  • Create API for sending the translated texts to the FE #1562
  • Create UI for updating the system language
  • Add placeholders in public pages and add english texts from those pages in php files with proper folder structure
  • Write a script to read the language files and download xls file in 3 columns for English, French, Spanish
  • Upload the translated files in the language files by reading the XLS file
  • Test the public pages for translation
  • Add placeholders in remaining pages, common buttons and notifications and add english texts from those pages in a json file
  • Upload the translated files in the language files by reading the XLS file
  • Complete the translation of the system
  • Document how to add translation for future development and changes

Test extraction by modules

@emmajclegg emmajclegg added the ODS Issue initiated by ODS label Jul 12, 2024
@emmajclegg emmajclegg self-assigned this Jul 12, 2024
@emmajclegg emmajclegg changed the title (Placeholder) Mark up text for automated string extraction Mark up text for automated string extraction Jul 15, 2024
@emmajclegg emmajclegg added Translation Work to provide a multi-lingual IATI Publisher interface and removed Translation Work to provide a multi-lingual IATI Publisher interface labels Jul 29, 2024
@emmajclegg emmajclegg added this to the Multilingual interface milestone Jul 29, 2024
@emmajclegg emmajclegg assigned praweshsth and unassigned emmajclegg Jul 30, 2024
@robredpath
Copy link
Collaborator

Following our conversation this week, I wanted to share an outline of the process that you'll need to follow.

It's important that this is an automated process that is part of your standard workflow, so that any changes to text are detected and translated quickly in the future.

In my experience, this is usually achieved by marking up the text in some way. See, for example, this Django template from one of our projects, which wraps each sentence in {% blocktrans %} tags which signal to Django's i18n module that the string should be included in the translation process. Our workflow uses .pot/.po files.

I can see that there's already some files with lists of strings that look like a translation mechanism, which may be how this is already starting to be implemented? My instinct is that this is potentially quite a fragile and high-effort way to work, but it's up to you!

Either way, once the system is in place, then each time we make an update, the process is:

  • Extract strings
  • Package strings that have changed in a .pot file
  • Send to translators
  • Wait
  • Receive translated strings as a .po file
  • Re-integrate strings into the software

Apart from a few manual steps to authorise the translation and review the software output to make sure that nothing went wrong, this is an entirely manual process.

I don't think that IATI Publisher necessarily has to have a .pot/po file - based process, but if you were to build one then it would be very close to what we're going to need once we work out the file details with the translation company.

Does that make sense? I'm very happy to provide any more detail if it's useful.

@emmajclegg
Copy link
Collaborator Author

Thanks @robredpath - I assume you meant "Apart from a few manual steps to authorise the translation..., this is an entirely automated process" ? Otherwise, no questions from me

@praweshsth praweshsth modified the milestones: Multilingual interface, August 2024 Aug 5, 2024
@praweshsth praweshsth assigned Sanilblank and unassigned praweshsth Aug 5, 2024
@Sanilblank
Copy link
Collaborator

Hello @robredpath
Based on the template you have provided, it seems that you are trying to mention the localization mechanism of Django framework. For the translation process in IATI Publisher, when we initially started the process, we used a similar approach. Laravel, the framework used for the backend development, also provides a similar approach for localization and if the templating engine of laravel had been used for the frontend part as well, the template would look very much similar to what you have given as a template. However, since we are using Vue js, a little bit more complexity is added for displaying the text in the frontend side even though a similar templating structure is used by Vue as well.
Similar to how you have mentioned the use of .pot/.po files, we (in laravel) also use similar files that contain array of data for saving the translations in different files as required. The process and mechanism for both is quite similar, so I don't think we will need to go through the process of incorporating .pot/.po files directly into the system as it will require a lot more research on how it can be achieved. The translations will be stored in files within the system but since providing those files directly to you for translation could cause confusion about how to process them as they fall more into the technical category, we were thinking of writing a script that would take all the strings and place them in an excel file which would be provided to you and then you would add the translated strings and provide the file back to us, and finally a script to take those translations and put them in the format required by the system.
If you will required .pot/.po files for us to send the strings to be translated, we will need a bit of time to research how the files can be created, and then the rest of the process will be similar i.e. a script will generate the file which will be sent to you, you will add the translations and send the file back to us, and a script will take the translations and put them into the system.
We are a bit confused on the use of the word 'automated' in your title and description.
By automate I am assuming you mean that a file will contain all the english strings present in the system and if any changes occurs in the file, a script will detect the change and then generate the required excel or .pot/.po file which will be sent to you for translation. If my understanding correct? If it is not, could you provide a bit more explanation for this part.
Hope I have made things clear here, if there is something that seems a bit confusing, I would be glad to go a bit deeper in the explanation.

cc. @praweshsth @PG-Momik

@emmajclegg
Copy link
Collaborator Author

Thanks for the information here @Sanilblank . @robredpath is away until Aug 29th unfortunately, but I will see if anyone else in our team can help with the file format question in the meantime.

We are a bit confused on the use of the word 'automated' in your title and description.
By automate I am assuming you mean that a file will contain all the english strings present in the system and if any changes occurs in the file, a script will detect the change and then generate the required excel or .pot/.po file which will be sent to you for translation. If my understanding correct?

Yes, that's correct to my understanding. We remain in control of how often and when we run the re-translation, but your system should be capable of detecting what English text has and hasn't changed since the last translation.

By the extraction and re-integration of text into IATI Publisher being done in an automated way, we mean via a script as opposed to any manual copy and pasting.

@Sanilblank
Copy link
Collaborator

@emmajclegg Thanks for the clarification.
I have another question regarding the extraction part. The system will generate the excel or another format file containing the english strings to be extracted which will be sent yo you. We will need a way to inform you about which strings have been added/changed since the last translation was done. So, we could do this in many ways, the first being generating only the strings which have been either added or updated (which will require translation) in the file, another way could be generating all the string in the file along with previous translations as well which will allow you to update the translations even for the ones which were already done previously. The second approach would give you more flexibility but it may be more difficult for you to see which texts actually require translating.
If you have any other ideas regarding this subject, we are open to hear them. Please have a look and confirm the process which we should move forward with.

cc. @praweshsth @PG-Momik

@emmajclegg
Copy link
Collaborator Author

Hi @Sanilblank - I don't want to give wrong information on this so will check with @robredpath once he's back (Aug 29th) and update here.

@robredpath
Copy link
Collaborator

Hi @Sanilblank! Thanks for this - it's really useful to understand what you're thinking.

The exact format of the files doesn't really matter too much for us - I suggested .pot/.po files as they're fairly standard in our other applications and are straightforward to work with, but an .xlsx file would also be fine. The main thing is that the process is automated and repeatable.

By "automated" what I mean is that we expect the list of strings to be generated directly from the source code by software, without any manual steps - and for the translated strings to similarly be re-integrated automatically. This means that the process is easily repeatable, so that a small update can be made easily and large updates aren't too much of a problem.

We don't expect the automation to require zero human contact, but we want to make sure that everything gets translated as part of the regular updating process for the software: every time a form or button changes, or we add some new explanatory text, it should be translated promptly.

By way of example, for our documentation platform we run one command to generate the .pot files that we send to the translators, and then we check in the translated files to git and re-run the build process to generate the multi-lingual website. This gives us a very high level of repeatability and consistency, and it's easy for us to do which encourages us to do it often - even for very small changes.

In our documentation work we send the whole documentation site each time, and the translation platform figures out what's changed, and gets that translated. We then re-import the whole translated file back in. Our experience is that it's easier that way, rather than trying to manage lists of things that have changed. Ultimately, it is up to you, but that's our experience and recommendation.

Hope that helps - do let me know if you have any further questions

@emmajclegg emmajclegg modified the milestones: August 2024, September 2024 Sep 2, 2024
@Sanilblank
Copy link
Collaborator

Hi @robredpath
I think I understand what you are trying to say and I feel that we are on the same page regarding the automation process. As mentioned previously, we will be writing a script which will be responsible for checking the translations maintained in the system and will be able to generate an excel file consisting of all texts either translated or requiring translation which will be sent to you. You will perform the translations as required and will send the file back to us and we will use a script to simply take the translations and insert them into the system.

cc. @praweshsth @PG-Momik

@Sanilblank
Copy link
Collaborator

Hi @robredpath
The above parts discussed between us are very clear now. For the part where we send the data present in the backend to the frontend, we researched online and found two methods.

  1. The entire data is loaded in the app.blade.php file and saved as global data. Then, the FE uses the data for showing the texts throughout the system. This increase the load in each page, so we could put this in cache and then use the data everywhere. Still, the data stored in cache would be very high and so this process may not be feasible. Also, when the language is changed by the user, the cache data is delete.
  2. The BE will have apis for sending the translated texts to the FE. When a page loads, the apis for the required translated texts will be called. When an api for a certain set of translations is called, they will be store on the backend cache in redis so that next time no processing part is required. When a user changes the language, the cache data will be deleted. This process will not send all the translated texts data to the FE immediately which will help to reduce the load. We are leaning towards using this method.

This message is just to update you regarding the findings we have had and to give you an update about how we are proceeding for this feature.

cc. @praweshsth @PG-Momik

@emmajclegg emmajclegg added this to the December 2024 milestone Dec 3, 2024
@emmajclegg
Copy link
Collaborator Author

To summarise from this morning's call,

We don't mind what order text is extracted from different modules of the system - @PG-Momik suggested choosing a "simple" module to start. To reconfirm, only user-facing interface text and messages will need translating, nothing that only super-admins can see.

The public-facing pages and registration workflow text was extracted by @Sanilblank last month. I'm summarising our feedback below from the email conversation:

  • can common strings, that appear multiple times in the IATI Publisher interface, be reused in the code so that we are not translating them multiple times. This will help ensure similar text is translated consistently across the tool and reduce the time for translation. Note, if common strings appear as part of a wider sentence, then we should not split the sentence up.
  • let's avoid complex HTML in the extracted text as far as possible. Simple HTML, like basic links or styling tags, is fine but anything more complicated could affect translation. Removing line breaks was suggested as a minimum way to simplify the more complicated HTML examples.
  • some sentences were split up if they contained an email address, for example. We have a glossary of terms that the translators should not translate, so it is preferable to leave email addresses / names / other variables in the extracted text to make translation of the entire sentence possible.
  • whitespace, punctuation or other symbols at the beginning of extracted text can easily get lost in translation so we suggest avoiding these.

We expect to go through several test-runs of the text extraction, review, translation & reintegration process to resolve small problems that come up. This will be necessary before we release the French & Spanish interface to end users.

@BibhaT @PG-Momik @Sanilblank - I'm aware this is a big task and I'm worried that we don't have a good handle on timelines (considering it was something we were aiming to complete by the end of the year). Aside from bugs and user support issues, this translation work takes priority over any new work in the "proposed user story / task list".

Any questions, please let me know.

cc' @robredpath

@Sanilblank
Copy link
Collaborator

Hello @emmajclegg ,
As Momik must have mentioned yesterday, I was on leave for some days because of health issues and did not have a chance to properly work on the translation tasks.
From the recent comment it is clear what changes need to be implemented based on the sample of the public pages translation which was sent previously.
The process of performing the translation now is mostly a manual task and requires the copying of the texts in different sections, pasting in the language files and replacing the original files with the placeholders. Since most of the remaining task is manual work, it is difficult to give an exact estimation about how long it will take to complete the entire system translation.
Since you have mentioned that this task takes priority, we will be giving more time and will be working towards finishing the tasks in a quick and effective manner. Also, if the estimations are required, we could finish a module like settings or organizations and then move towards creating an estimation based on the time taken to complete that one particular module.
Thank you,
Sanil Manandhar

cc. @PG-Momik @BibhaT

@emmajclegg
Copy link
Collaborator Author

Thanks @Sanilblank - I appreciate you've been off recently, that's no problem.

I don't need an exact estimation for this at this point - the main thing is I'd like us to be making visible progress on it, rather than letting it potentially drag on for months.

Just let me know when you've picked the module to work on first and when roughly I should be expecting to receive a text file to review (as it helps me plan). I expect we'll want to test run the entire extraction, translation, reintegration process on a single module first which, as you say, will help us all understand time and effort required for the remaining ones.

@emmajclegg
Copy link
Collaborator Author

emmajclegg commented Jan 7, 2025

To summarise where I think we got to on the questions from today's call:

  • IATI Standard element names - if these appear as part of a sentence in IATI Publisher, we will extract and translate the whole sentence (I think this makes sense for users' understanding). If the element names are standalone in the interface, we will still extract the strings but are unlikely to translate these into French and Spanish in the short term. We may translate them eventually, so want to keep this as an option.

  • IATI codelists - ODS are likely to translate IATI-maintained codelists into French & Spanish, but not externally maintained ones (e.g OECD DAC lists). If and when we do get the codelists translated, we want to make sure IATI Publisher can display these. We worked on syncing codelists in issue Check use of updated IATI codelists #1407 , so I believe IATI Publisher does detect automatically when IATI Standard codelists change ?

  • Import templates & PDF guidance files - we should prioritise translation of the IATI Publisher interface. Import templates and guidance are being looked at in separate issues, so are likely to change and translation can come later.

Any other questions, or anything to add, just let us know @PG-Momik @BibhaT

cc' @robredpath

@emmajclegg emmajclegg modified the milestones: December 2024, January 2025 Jan 7, 2025
@PG-Momik
Copy link
Collaborator

PG-Momik commented Jan 8, 2025

@emmajclegg Adding to this. On the call I mentioned that I wasn't sure if other codelist were besides OrganizationRegistrationAgency being sync'd. I've confirmed that the codelist are being synced as well. 👍

@PG-Momik
Copy link
Collaborator

PG-Momik commented Jan 8, 2025

@emmajclegg A sheet has been shared with the current extracted contents for the completed modules. Please have a look.

cc: @BibhaT

@emmajclegg
Copy link
Collaborator Author

emmajclegg commented Jan 8, 2025

Ok thanks a lot @PG-Momik - I'll have a look over the next few days, prioritising the sheets labelled green in the extracted text spreadsheet.

One question - I see that a few standalone IATI-specific strings like "IATI organisation identifier", "publisher ID" and "default language" are appearing in the sheets multiple times, though the key field is similar in each case. Can you clarify if there's already been an attempt at deduplication here? I'm wondering how we avoid translating these important strings multiple times.

cc'ing @robredpath for info. Rob - let me know what you think we need to run a first test of the translation and reintegration loop. "adminHeader" is the simplest sheet in that extracted text workbook, if useful as an easy example. Otherwise I'll let you know when there's a few sheets ready with de-duplicated and reviewed English text.

@emmajclegg
Copy link
Collaborator Author

@PG-Momik - to update, I've looked over the remaining green (i.e. nearly finished) sheets in the extracted text spreadsheet and have left a few more comments.

I haven't edited any of the English text yet as it sounds like it make sense for YI to re-extract an updated version of the spreadsheet before I do that (to save me re-doing the review before @robredpath sends the text for translation).

Happy to discuss any questions tomorrow.

@emmajclegg
Copy link
Collaborator Author

@PG-Momik - thanks for sharing the latest spreadsheet of extracted text (Extracted Sheet - Jan 17). I assume this was for me to have a look.

Again, I've left a few comments to check where certain text appears in the interface and flag a few areas where it could be simplified.

  • Cells highlighted orange - I'm questioning whether these contain user-facing text, as they look more like log/error messages.
  • Cell highlighted yellow - suspected duplication (I will have left a comment in the sheet)

Happy to discuss more on Wednesday (I'm not around tomorrow, Tues 21st)

@emmajclegg
Copy link
Collaborator Author

@PG-Momik @BibhaT - I'm summarising points from today's call below - anything that doesn't look right, let me know. (cc' @robredpath )

Restricting translation to user-facing text:

  • It is not always clear to the developers marking up text for extraction whether it is user facing or not. YI would need extra time to check in some cases
  • I advised that we remove text from Extracted Sheet - Jan 17 that we're 100% sure is not user-facing in the IATI Publisher UI (e.g. API messages). I think it's fine for YI to spend a few extra hours checking cases that aren't clear (as we want to minimise the text we need to translate), but it's not worth spending days on it.
  • To avoid unnecessary translation, it makes sense to remove generic Laravel messages on the General sheet from extraction if they are not actively being used in IATI Publisher

Simplifying common messages that contain element names:

  • We discussed different approaches to simplifying cases where the same text appears often, just with a different IATI element name in it - e.g. cells C28-C148 of the "activity_detail" sheet.
  • In these cases, we can either 1) remove the element name from the text to make it generic, or 2) include the element name in the text as a :var variable
  • Approach 1 would be sufficient for the data entry form messages on the activity_detail sheet.
  • Validation messages on the General sheet, however, are more specific to IATI elements and attributes. Approach 2 (introducing variables) may help reduce the amount of text here, but it is unclear whether this is worth the effort. As a minimum, I was planning to make sure the validation messages are user-friendly by removing @ symbols from element/attribute names to make the text more readable to users.

Testing the translation and text re-integration process:

  • We would like to test the translation and re-integration process soon, to identify any problems and get feedback from the translators
  • The easiest modules to start with would be "public", "adminHeader" and "footer"
  • YI will let ODS know when these modules are ready (i.e. unlikely to change further in the spreadsheet), then I will edit the English text where needed (in a copy of the spreadsheet, and without adding/deleting rows), and share with Rob for translation.

@emmajclegg
Copy link
Collaborator Author

@PG-Momik - to update, it's looking increasingly likely that we'll remove or significantly reduce the text on IATI Publisher's public facing pages, based on conversation in #1543, so I wouldn't spend a lot of time tidying up the "Public" sheet text in the meantime.

We can confirm next week and discuss if other sheets in the Extracted Text spreadsheet would be more sensible to use for initial testing of the translation process.

@PG-Momik
Copy link
Collaborator

Hello @emma, As mentioned on the previous call, sheet: Header, Footer and Public wont need a lot of tyding up and are okay for review + translation, since they are most likely to not change content-wise. I don't think changes in #1543 affects the content of the public sheet. We can discuss it on the next call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Multilingual Interface ODS Issue initiated by ODS
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

8 participants