-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uby MySQL dump does not handle UTF-8 properly #172
Comments
That may depend on how the encoding in the mysql db has been configured and whether that configuration is included in the mysql dump. A properly configured mysql should support UTF-8. |
We always used this configuration for the creation of the database (before starting the actual import): CREATE SCHEMA |
Hm. Do the tables that were then created by Hibernate also reflect the encoding and collation? In principle, you can set these parameters individually for each table. |
the tables are created via Hibernate in LMFDBUtils
|
Does the MySQL dump file include lines with reference to UTF-8? E.g.
|
yes: C:\Users\Judith>more uby_open_0_7_0_nonfree.sql
|
I think we had that problem as well with WebAnno until we set the default server encoding (and updated the documentation accordingly):
It seems that can alternatively be specified on the command-line when importing. Cf: https://makandracards.com/makandra/595-dumping-and-importing-from-to-mysql-in-an-utf-8-safe-way But I believe if you only specify it on the command-line instead of the my.cnf, then you'll also have to set the encoding on the JDBC connection string. Cf: http://stackoverflow.com/questions/13234433/utf8-garbled-when-importing-into-mysql |
@judithek Unfortunately, that is the expected behavior for the utf8_general_ci collation in mysql. If you want to differentiate between sägen and sagen in your query, the collation must be utf8_bin. Do this and you'll see it:
|
Note, however, that if you use the utf8_bin collation, only case sensitive searches are possible... :(
|
@reckart @betoboullosa thanks a lot for your comments and insights! @betoboullosa IMO think case sensitive searches perfectly make sense for a lexical resource |
@judithek Yes, case sensitivity is OK for a lexical resource. It might be a problem if searching for words in sentences for example. |
When querying an UBY mysql dump for German lemmas with umlauts this is not handled correctly:
e.g. querying for "sägen" returns all entries for "sägen" and "sagen".
This can be reproduced in the Uby web browser, on the command line with pure mysql and using the Uby-API.
This issue does not occur with the H2 database (another point in favor of using H2).
The text was updated successfully, but these errors were encountered: