Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2025-03-31] Non working words #10

Open
neurlang opened this issue Mar 18, 2025 · 6 comments
Open

[2025-03-31] Non working words #10

neurlang opened this issue Mar 18, 2025 · 6 comments
Assignees
Labels
bug Something isn't working dataset
Milestone

Comments

@neurlang
Copy link
Owner

Language Word
Tamil பொɯaːk
Bengali ৫টি
Bengali ইঞ্জিনিয়ারিং
@martinarisk
Copy link

martinarisk commented Mar 19, 2025

Language Word
Italian
Vietnamese quỹ
Vietnamese nhất
Vietnamese Ấn-Hy
Vietnamese khắc
Vietnamese nhất
Vietnamese Bất
Vietnamese Mỹ
Vietnamese suất
Vietnamese cấp
Vietnamese Foreign word containing letter w

@neurlang
Copy link
Owner Author

Rules don't exist in language.json so it's either analysis2 problem or bad --rowlossimportance hyperparameter (not learned due to low value of --rowlossimportance = 5 by default)

@neurlang neurlang added analysis2 study_language.sh script language.json and removed dataset labels Mar 20, 2025
@neurlang neurlang added this to the 0.5.1 milestone Mar 20, 2025
neurlang added a commit that referenced this issue Mar 21, 2025
Needed to increase the hyperparameter to cover more rare characters

Fixes:

- Italian: fà
- Tamil: பொ

See issue #10
@neurlang neurlang added dataset and removed analysis2 study_language.sh script language.json labels Mar 21, 2025
@neurlang neurlang modified the milestones: 0.5.1, 0.6.0 Mar 21, 2025
@neurlang
Copy link
Owner Author

Italian/Tamil fixed
Bengali/Vietnamese need data fix, moving to 0.6.0 (full retrain)

@neurlang
Copy link
Owner Author

Language Word
Hebrew אֵלֶיךָ
Hebrew הַשֶּׁמֶשׁ

@neurlang neurlang self-assigned this Mar 31, 2025
@neurlang neurlang changed the title Non working words [2025-03-31] Non working words Mar 31, 2025
@neurlang
Copy link
Owner Author

neurlang commented Mar 31, 2025

Arabic - enable sorting of diacritics, figure out the same sounding diacritics
MalayArab -same
عندما
بِسْمِ اللَّهِ الرَّحْمَنِ

@neurlang
Copy link
Owner Author

neurlang commented Mar 31, 2025

Javanese - not returning anything because it's trained on Javanese script (Aksara Jawa), traditional Javanese writing system

@neurlang neurlang added the bug Something isn't working label Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dataset
Projects
None yet
Development

No branches or pull requests

2 participants