-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model different versions of grammatical number in word forms #57
Comments
The varying number of forms per table cell motivated the List-based model of Considering this basic idea, I suggest to add There's an additional pitfall here: The word form processor must process the singular forms and the plural forms differently, as the gender index numbers do not necessarily correspond to the plural form: In the Dschungel example, the singular forms are 1=MASC, 2=NEUT, 3=FEM, but the plural forms are 1=MASC/NEUT, 2=FEM. So apart from a grammatical-number-specific parsing process, we need to define how to set the plural genders for case 1. I see the following options: 1. use MASC, 2. introduce a MASC_NEUT type, 3. use null, 4. duplicate the form, one with MASC, one with NEUT. I currently lean towards 1, but let's discuss. Below is a sample conversion for the 20 word forms for Dschungel (using option 1):
What do you think? |
I'm not quite sure what you mean here:
What are "Genus lines"?
If I understand you correctly, you suggest to add gender property to the word form and resolve it during parsing? Then one "column" in the declination table will be defined by the number+gender, correct? I am not quite sure if this will work in all cases. There may be cases where the same gender+number have different declinations. I don't have an example at hand, but I think this is possible. I'll have to experiment to find out. If it works, this is a definitely good way. |
I'm not quite sure if I understand you correctl. Do you mean that And you know this how - because you're a native speaker, not from the data, correct? If so then I would say this is an error in the data. We can't (easily) understand to which genus does a plural form belong. I think I'd first check how many cases like this do we have. I think we can check this by checking if number of genera == number of plurals.
An then deside what we take as genus for each case. But we're not there yet. |
So here's a plan. I'll try the following:
What do you think? |
I checked Wiktionary a bit, and found that the "Genus 1=m"-like lines only apply to the singular forms (basically only to choose the correct article). Since the articles of plural forms are regular, there is no connection to the "genus index"/raw form number for plurals.
Yes. It is "der Dschungel" (MASC, SING), "die Dschungel" (FEM, SING), "das Dschungel" (NEUT, SING), "die Dschungel (MASC + NEUT, PL), and "die Dschungeln" (FEM, PL) for the nominative case. It's not really incorrect in Wiktionary, as the "genus" is only used for the articles (which is "die" for all cases). But of course this raises some problems when analyzing the data. See https://de.wiktionary.org/wiki/Vorlage:Deutsch_Substantiv_%C3%9Cbersicht and the source code for the templates to see how inflection tables are rendered. So my conclusions are so far:
|
Ok, thank you for the clarifications. I fully support adding gender to word forms. I'll file an issue and start working on it (at least for singular forms). As for plural forms I don't think I will be able to implement this straight away. |
One more questions: is it OK if I only implement this for German? |
Yes.
|
I have a simple idea here: what if we introduce something like a "gender index"? To distinguish |
Please check #58. |
@chmeyer Seems I've encountered a problem. Please see the following word: https://de.wiktionary.org/wiki/Fels It has two singulars, both masculine. With three alternatives in In this case it will not be possible to correctly group even singular word forms knowing only their respective gender. All singular forms will have So it seems the solution we were pursuing would not be universally sufficient. It still makes sense adding gender to word forms (and I'm pretty far with the implementation). I'm not sure how common or frequent this case is, though, did not do any statistics yet. But it really seems that just adding gender won't quite solve my problem - which is grouping word forms per "declension" (i.e. detecting columns in the declension table):
What I'm thinging about right now is maybe introducing an additional structure like In this way we will not expose the interna ( What do you think? I could sketch the API in yet another branch. |
That's indeed a problem. From the wiki code, it will not be possible to reliably assign the plural form, as it applies only to one of the forms. Regarding your Given the entire discussion, we should maybe get back to your original proposal of just making the index accessible in the forms? This would be both easy and versatile, although not overly comfortable. I am thinking about a For the Fels example, we would thus generate 14 word forms with the following properties:
What do you think? |
Addition: I don't like "rawFormIndex" (or similar) too much, as this makes it difficult to handle for users. That's why I used "inflectionGroup" which seems a bit more logical to me. |
I'm totally fine with |
Context: I am using JWKTL to work with declension tables for German nouns.
There's a feature I need (and can implement) but I'd like to first discuss, what would be the best way to model it.
Basically, I want to be able to produce the declension if a given German noun.
Input:
Antwort
, output:die Antwort, Genitiv der Antwort, Dativ der Antwort, Akkusativ die Antwort
, something along the lines. So essentially this boils down to generating the full declension table or its columns.For most cases (ca. 90%) this is pretty straightforward. Two grammatical numbers, four grammatical cases - 8 word forms. Sometimes few forms are missing, sometimes there are two versions for one number/case, but it's pretty trivial.
But in some cases it gets more complicated. Some words may have several genders and sometimes there are different singular and plural forms. The most extreme example is Eponym with two genders (
m
,n
), two singular and two plural declinations and up to 3 variations per number/case giving a total of 28 word forms.But apart from that extreme example, the case with several grammatical numbers is rare, around 4%.
To process such cases, I need to know which words belong to the same "number". Let us take
Dschungel
for example:To create this declination table I have to know not just the basic grammatical number (
SINGULAR
orPLURAL
). I have to know if it'sSingular 1
orSingular 2
etc. Then I can group word forms into a column of the declension table.However, at the moment JWKTL (quite logically) only models grammatical number
SINGULAR
orPLURAL
. At the moment I can't know if it wasSingular 1
orSingular 2
which is my problem.I would like to add this information to
IWiktionaryWordForm
, but I am not sure which would be the preferred way to model this. My suggestion would be to simply add the stringrawGrammaticalNumber
property. There's already something similar inIWiktionaryEntry.getRawHeadwordLine()
, so the concept should not be completely out of its way.Still, I'd like to hear your opinion on this before I actually implement this.
The text was updated successfully, but these errors were encountered: