Skip to content

Commit

Permalink
Integrate code from April to November 2023 (#38)
Browse files Browse the repository at this point in the history
* improve test coverage

* update aiohttp

* improve online translator

* upload current work on french and english translators

* upload current work on french and english translators

* fix test

* fine tune nllb helper even further

* add some more jobs, fix some bugs in nllb translation

* transform api.translation.v2.function.definition into separate modules grouped by method

* fix unit tests following refactoring

* add rabbitmq workers

* add rabbitmq async_put to read_write cache layer

* fix nonetype errors

* rm unused script

* upload current work

* put rabbitmq in api

* fix rabbbitmq lib

* fix remove_enrichment_artefacts, improve irc bot to use rabbitmq

* refactor translation_v2 to externalise publisher part, add rabbitmq.py lib

* fix: prefix in translator, fix producer

* adjust consumers and producer, init webconsumer

* initialise inside separate thread, handle channel closed errors

* apply some fixes on rmq workers and producers

* fix requirements for python 3.10

* fix config by adding new keys

* add postprocessor to accept from a given source wiki

* fix: source wiki is in first element

* fix: minor fixes and enhancements on rmq functions

* fix: add classical syriac language label

* do not call same function twice if the arguments are the same

* add json to allow build_language_statistics to run

* also filter ou the {{l}}-labelled lists

* fix: enwikt entry processor: skip language section if a language code was not found

* save changes on old laptop

* fix: duplicated files by error

* fix: add ability to set queue in webconsumer

* fix: light fixes

* fix: temp_triage_2 -> translated

* fix: add haproxy config

* fix: use haproxy as load balancer

* fix: add postgrest in /bin

* fix: conf -> ini

* fix: rearrange install/test script to add the installation of haproxy and supervisor

* bugfix: bug on definition found by unit test, let's see if it's actually genuine

* fix: change supervisor "root" placeholder to "user"

* add ctranslator in docker containers

* add rabbitmq entry translator

* rearrange supervisor script

* add jenkins

* add nllb backend override

* fix conf to add a user jenkins instance

* add 1 more entry translator worker

* copy postgrest

* supervisor: add container + tweak autostart options

* Revert "add nllb backend override"

This reverts commit 6693736.

* solve conflicts

* rm old parameters

* invert condition

* invert condition: do so as sudo

* reload supervisor without -y

* fix: renderer tests were not run

* fix: more unit tests on section delete

* fix: disable debug at this time, use 'edit' queue for create_page_from_list.py

* fix: filter output for nllb translation bugs

* feat: translate individual words

* fix: do not translate if word already in malagasy

* fix: use whiteilst for translation

* fix: wrong variable

* fix: re-enable one-word translations

* fix: strip whitespaces for whitelist checks

* build(deps): bump redis from 3.5.3 to 4.4.4

Bumps [redis](https://github.com/redis/redis-py) from 3.5.3 to 4.4.4.
- [Release notes](https://github.com/redis/redis-py/releases)
- [Changelog](https://github.com/redis/redis-py/blob/master/CHANGES)
- [Commits](redis/redis-py@3.5.3...v4.4.4)

---
updated-dependencies:
- dependency-name: redis
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Publish developments from April-July 2023 (#34)

* improve test coverage

* update aiohttp

* improve online translator

* upload current work on french and english translators

* upload current work on french and english translators

* fix test

* fine tune nllb helper even further

* add some more jobs, fix some bugs in nllb translation

* transform api.translation.v2.function.definition into separate modules grouped by method

* fix unit tests following refactoring

* add rabbitmq workers

* add rabbitmq async_put to read_write cache layer

* fix nonetype errors

* rm unused script

* upload current work

* put rabbitmq in api

* fix rabbbitmq lib

* fix remove_enrichment_artefacts, improve irc bot to use rabbitmq

* refactor translation_v2 to externalise publisher part, add rabbitmq.py lib

* fix: prefix in translator, fix producer

* adjust consumers and producer, init webconsumer

* initialise inside separate thread, handle channel closed errors

* apply some fixes on rmq workers and producers

* fix requirements for python 3.10

* fix config by adding new keys

* add postprocessor to accept from a given source wiki

* fix: source wiki is in first element

* fix: minor fixes and enhancements on rmq functions

* fix: add classical syriac language label

* do not call same function twice if the arguments are the same

* add json to allow build_language_statistics to run

* manala ny anaran'ny mpiasa-miasa wiki (:

---------

Co-authored-by: Rado Andrianjanahary <>
Co-authored-by: Rado <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Rado Andrianjanahary <>
Co-authored-by: Rado <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  • Loading branch information
3 people committed Dec 10, 2023
1 parent 0a0e385 commit 50f0a23
Show file tree
Hide file tree
Showing 27 changed files with 100,933 additions and 344 deletions.
2 changes: 2 additions & 0 deletions api/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,7 @@ def get(self, key, section='global'):
self.specific_config_parser.get(section, key)
except configparser.NoSectionError:
return self.default_config_parser.get(section, key)
except KeyError:
raise KeyError(f'No key {key} in section {section}')
else:
return self.default_config_parser.get(section, key)
12 changes: 12 additions & 0 deletions api/decorator.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,18 @@ def wrap(*args, **kwargs):
return wrap


def reraise_exceptions(exceptions, new_exception_type):
def wrapper_catch_exceptions(f):
def wrapper(*args, **kwargs):
try:
return f(*args, **kwargs)
except exceptions as exc:
raise new_exception_type from exc

return wrapper
return wrapper_catch_exceptions


def catch_exceptions(*exceptions):
def wrapper_catch_exceptions(f):
def wrapper(*args, **kwargs):
Expand Down
4 changes: 4 additions & 0 deletions api/entryprocessor/wiki/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import os
import re
from typing import List
from api.config import BotjagwarConfig

data_file = os.getcwd() + '/conf/entryprocessor/'

Expand All @@ -21,6 +22,9 @@ def __init__(self, test=False, verbose=False):
self.content = None
self.Page = None
self.verbose = verbose

self.configuration = BotjagwarConfig('wiktionary_processor')
self.debug = self.configuration.get('debug').lower() == 'true'
self.text_set = False

def process(self, page=None):
Expand Down
13 changes: 11 additions & 2 deletions api/entryprocessor/wiki/en.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,8 +189,14 @@ def get_all_entries(
last_language_code = self.lang2code(language_name)
last_part_of_speech = None # Reset part of speech for the language section
except KeyError:
# print(f"Could not determine code: {language_name}")
pass
if self.debug:
print(f"Could not determine code: {language_name}")

last_language_code = None

# Skip the language if a code couldn't be found to avoid assignment of the entry to the wrong language code.
if not last_language_code:
continue

# Add to the lines per language
if last_language_code in lines_by_language:
Expand Down Expand Up @@ -228,6 +234,9 @@ def get_all_entries(
else:
definitions[last_language_code][last_part_of_speech] = [definition]

if self.debug:
print(f"{line_number} [{last_part_of_speech}|{last_language_code}] : {line}")

# entries may be definition-less or definition formatting is inconsistent
for language_code in definitions:
if get_additional_data and language_code in lines_by_language:
Expand Down
9 changes: 5 additions & 4 deletions api/importer/wiktionary/en.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,15 @@ def retrieve_subsection(wikipage_, regex):
wikipage_ = wikipage_[pos1:]

# More often than we'd like to admit,
# the section level for the given sub-section is one level deeper than expected.
# As a consequence, a '=<newline>' can appear before the sub-section content.
# the section level for the given subsection is one level deeper than expected.
# As a consequence, a '=<newline>' can appear before the subsection content.
# That often happens for references, derived terms, synonyms, etymologies and part of speech.
# We could throw an Exception,
# but there are 6.5M pages and God knows how many more cases to handle;
# so we don't: let's focus on the job while still keeping it simple.
# Hence, the hack below can help the script fall back on its feet while still doing its job
# of fetching the subsection's content.
# I didn't look for sub-sections that are actually 2 levels or more deeper than expected.
# I didn't look for subsections that are actually 2+ levels deeper than expected.
# Should there be any of that, copy and adapt the condition.
# I didn't do it here because -- I.M.H.O -- Y.A.G.N.I right now.
# My most sincere apologies to perfectionists.
Expand Down Expand Up @@ -151,6 +151,7 @@ class ReferencesImporter(SubsectionImporter):
'[[category:', # Category section caught
'==', # Section caught
'{{c|', # Categorisation templates
'{{l|', # List element templates
'{{comcatlite|' # Commons category
]

Expand Down Expand Up @@ -323,7 +324,7 @@ def get_data(self, wikipage: str, language: str, page_name: str = '') -> List[Tr
:return:
"""

# Main regex to retrieve a given translation. Most of entries use this format
# Main regex to retrieve a given translation. Most entries use this format
regex = r'\{\{t[\+]?\|([A-Za-z]{2,3})\|(.*?)\}\}'

translations = {}
Expand Down
Loading

0 comments on commit 50f0a23

Please sign in to comment.