Skip to content
@bitextor

Bitextor Team

Translation memories generator

Pinned Loading

  1. bitextor Public

    Bitextor generates translation memories from multilingual websites

    Python 291 43

  2. bicleaner Public

    Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.

    Python 155 22

  3. bifixer Public

    Tool to fix bitexts and tag near-duplicates for removal

    Python 30 3

  4. biroamer Public

    Utility that will help you to ROAM (Random Omit Anonymize and Mix) your parallel corpus.

    Python 10 2

  5. pdf-extract Public

    PDF parser and converter to HTML

    Java 85 14

  6. warc2text Public

    Extracts plain text, language identification and more metadata from WARC records

    C++ 21 5

Repositories

Showing 10 of 29 repositories
  • monocleaner Public
    Python 7 GPL-3.0 1 1 0 Updated Mar 18, 2025
  • bicleaner-hardrules Public

    Pre-filtering step for bicleaner

    Python 4 GPL-3.0 2 1 0 Updated Mar 18, 2025
  • biroamer Public

    Utility that will help you to ROAM (Random Omit Anonymize and Mix) your parallel corpus.

    Python 10 GPL-3.0 2 0 1 Updated Mar 3, 2025
  • warc2text Public

    Extracts plain text, language identification and more metadata from WARC records

    C++ 21 MIT 5 7 3 Updated Mar 3, 2025
  • bifixer Public

    Tool to fix bitexts and tag near-duplicates for removal

    Python 30 GPL-3.0 3 0 0 Updated Feb 5, 2025
  • cld2 Public Forked from CLD2Owners/cld2

    Compact Language Detector 2

    C++ 0 Apache-2.0 133 0 0 Updated Feb 4, 2025
  • monocleaner-data Public

    Monocleaner models repository

    1 GPL-3.0 0 0 0 Updated Jan 8, 2025
  • scrawl Public

    Playwright-based web crawler

    Python 1 GPL-3.0 0 0 0 Updated Nov 14, 2024
  • bitextor Public

    Bitextor generates translation memories from multilingual websites

    Python 291 GPL-3.0 43 3 4 Updated Nov 11, 2024
  • pdf-extract Public

    PDF parser and converter to HTML

    Java 85 GPL-3.0 14 4 1 Updated Oct 3, 2024