grammr
is a language learning tool that aims to make understanding the grammar of a variety
of languages easier.
- 🌍 Translate texts
- 📖 Get translations for every word in the text in-context
- 🔎 Get a grammatical analysis for every word in the text
- 💡 Conjugate and decline words
- 📚 Create flashcards to memorize words and phrases
- 🌈 Many more features coming! Check out the Roadmap for more information.
👉🏼 Try it out.
While there are too many great language learning apps to count, standardized grammatical references for languages can be hard to come by. It is possible to find dictionaries and inflection tables for different languages by googling; however, there is no standardized tool so far that does this for a variety of languages.
The aim of this project is to create a unified API to provide several tools for language learners across multiple languages:
- Translations of texts, including the literal translations of individual words
- Morphological morphology of words, including their part-of-speech tags and features
- Inflection (conjugation and declension) of words
- Anki Flashcard export to make learning easier
I have more ideas for this project, which you can find in the Features and Roadmap sections.
This project, therefore, does not take a didactic approach to learning languages, and should not be compared to an app like Duolingo; it rather aims to be a comprehensive, open reference tool that can be arbitrarily extended for different languages.
- Translate sentences across arbitrary languages
- Get literal translations for words
- Could use some improvements: Instead of solely relying on LLMs for translation, maybe fetch translations via Wiktionary or a related project and use them as a reference for the LLMs. For one, all meanings of a word could be covered; also this would protect against wonky LLM responses.
- Inflection tables (supported: 🇷🇺🇮🇹🇫🇷🇪🇸🇵🇹)
- Anki flash card export
- Flashcard creation
- Flashcard editing
- Flashcard export
- Named-entity recognition and explanation of terms
- Translations of phrases and sayings (which often do not translate literally and may or may not have equivalents in different languages)
⏰ More will be added as the project progresses. Check back later!
Beyond the features mentioned above, I'm also looking to make the technology stack more robust. I believe that this application should really be able to scale to complexity, so I'm investing a lot of time and energy into keeping the architecture as clean as possible.
Features and improvements will include:
- A proper CI/CD pipeline, including automated deployment (halfway there)
- Adding several more languages, which means creating additional sidecars for morphological analyis.
- Building a frontend. Possibly web-app, possibly cross-platform apps.
-
ExtractDelete the Telegram Bot entirely from the core service. It was useful for prototyping, but should not be tied to the main application. - Versioning and release management
- Creating multiple environments for testing and development
- Benchmarking against quality regression of prompts
Listed here are projects that I am either using or would consider integrating into this project.
- spaCy: Morphological morphology for a variety of languages.
- pymorphy3: Fork of the currently unmaintained pymorphy2. Provides inflections for Russian.
Interesting related projects:
- textile: Inspiration for the UI
- mathigatti/spanish_inflections: Inflections for Spanish
- DuyguA/DEMorphy: Inflections for German. Technically, this library only provides morphological analysis, but it does use a comprehensive lexicon under the hood, so creating inflections should be possible.
- TimoBechtel/satzbau: Creation of natural language German texts, including declension and conjugation.
- verbecc: Verb conjugation for a variety of roman languages.
I tried to make running the project yourself as straightforward as possible. What you'll need:
- A Telegram bot token. You can get one by talking to the BotFather.
- An OpenAI API key. You can get one by signing up here.
Given those, you can run the project with a local configuration or by deploying the packaged
Helm chart. My recommendation is using an .envrc
file so you always have your environment variables
handy, but do whatever works best for you.
By running ./scripts/run.sh
, a Postgres container and the Python sidecar running spaCy will be
launched. The Spring application will be run via mvn spring-boot:run
.
If you choose to run ./scripts/deploy.sh
, a Helm chart will be created and deployed to your
Kubernetes cluster. This will require a running Kubernetes cluster (you could try using
Minikube or k3s) and
Helm to be installed. You should set a DATASOURCE_PASSWORD environment variable,
which will be used to create a secret in the cluster for authentication with Postgres.
PRs are welcome! I'm happy to help you get started with the project, so feel free to reach out.
You can run unit tests with ./scripts/unit-test.sh
and integration tests with ./scripts/integration-test.sh
.
Alternatively, run both with ./scripts/qa.sh
. This will also ensure the most accurate test coverage
report is generated, which you can access at target/site/jacoco/index.html
.
Different projects use different terms to describe similar concepts. For instance, while
pymorphy2 refers to the root form of a word as the lexeme
,
spaCy refers to it as the lemma
. I've tried to standardize the terms used in
a unified domain language that is to be used across the application and in the APIs it exposes.
The design is a work in progress, but I do think it offers some genuine value.
-
Token
refers to a singular word of a phrase, that contains -
Phrase
refers to a collection ofTokens
that form a sentence. -
A
SemanticTranslation
is aPhrase
that is a translation of anotherPhrase
. The meaning of the original phrase is preserved as well as possible in the translation. It consists of asource_phrase
and atarget_phrase
. -
A
LiteralTranslation
is a direct, literal translation of a phrase. This can help users better understand how phrases are constructed. It consists of thesource phrase
and a collection ofTokenTranslations
, which are a key-value pair of tokens from the source phrase and their directly translated counterparts. For example, "Ich bin ein Student" would contain four translated tokens:(Ich, I)
,(bin, am)
,(ein, a)
,(Student, student)
. -
Tokens
are aggregated through the process of literally translating and grammatically analyzing phrases. These processes return Sets ofTokenTranslations
andTokenMorphology
respectively, which are coalesced intoTokens
that ultimately make up aPhrase
. Therefore, aAnalysis
consists of a set ofTokens
holding their literal translation and grammatical morphology, as well as the semantic translation of the phrase.
I've decided to use the Universal Dependencies as reference for my part-of-speech and feature definitions as well as the domain language defined above.
-
Part of Speech
(POS) is a grammatical category of words that have similar grammatical properties. Read more: https://universaldependencies.org/u/pos/index.html -
Features
are morphological and syntactic properties of words. They are used to distinguish between different grammatical categories of words. Read more: https://universaldependencies.org/u/feat/index.html