Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uby MySQL dump does not handle UTF-8 properly #172

Open
judithek opened this issue Nov 24, 2016 · 11 comments
Open

Uby MySQL dump does not handle UTF-8 properly #172

judithek opened this issue Nov 24, 2016 · 11 comments
Labels

Comments

@judithek
Copy link
Member

When querying an UBY mysql dump for German lemmas with umlauts this is not handled correctly:
e.g. querying for "sägen" returns all entries for "sägen" and "sagen".

This can be reproduced in the Uby web browser, on the command line with pure mysql and using the Uby-API.

This issue does not occur with the H2 database (another point in favor of using H2).

@judithek judithek added the bug label Nov 24, 2016
@reckart
Copy link
Member

reckart commented Nov 24, 2016

That may depend on how the encoding in the mysql db has been configured and whether that configuration is included in the mysql dump. A properly configured mysql should support UTF-8.

@judithek
Copy link
Member Author

We always used this configuration for the creation of the database (before starting the actual import):

CREATE SCHEMA DATABASE_NAME DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci ;

@reckart
Copy link
Member

reckart commented Nov 24, 2016

Hm. Do the tables that were then created by Hibernate also reflect the encoding and collation? In principle, you can set these parameters individually for each table.

@judithek
Copy link
Member Author

the tables are created via Hibernate in LMFDBUtils
https://github.com/dkpro/dkpro-uby/blob/master/de.tudarmstadt.ukp.uby.persistence.transform-asl/src/main/java/de/tudarmstadt/ukp/lmf/transform/LMFDBUtils.java

  • this is not done for each table individually.
  • apparently H2 has no problems with this way of creating tables, while MySQL has.

@reckart
Copy link
Member

reckart commented Nov 24, 2016

Does the MySQL dump file include lines with reference to UTF-8? E.g.

/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `my-table` (
...
) ENGINE=InnoDB AUTO_INCREMENT=12001 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

@judithek
Copy link
Member Author

judithek commented Nov 24, 2016

yes:

C:\Users\Judith>more uby_open_0_7_0_nonfree.sql

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0
*/;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

/*!40000 DROP DATABASE IF EXISTS `uby_open_0_7_0`*/;

CREATE DATABASE /*!32312 IF NOT EXISTS*/ `uby_open_0_7_0` /*!40100 DEFAULT CHARA
CTER SET utf8 */;

USE `uby_open_0_7_0`;
DROP TABLE IF EXISTS `ArgumentRelation`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `ArgumentRelation` (
  `argumentRelationId` bigint(20) NOT NULL,
  `relType` varchar(255) DEFAULT NULL,
  `relName` varchar(255) DEFAULT NULL,
  `target` varchar(255) DEFAULT NULL,
  `semanticArgumentId` varchar(255) DEFAULT NULL,
  `idx` int(11) DEFAULT NULL,
  PRIMARY KEY (`argumentRelationId`),
  KEY `argumentrelation_target_IDX` (`target`),
  KEY `FK3A19F3F9D07C6108` (`semanticArgumentId`),
  CONSTRAINT `FK3A19F3F9D07C6108` FOREIGN KEY (`semanticArgumentId`) REFERENCES
`SemanticArgument` (`semanticArgumentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

LOCK TABLES `ArgumentRelation` WRITE;
/*!40000 ALTER TABLE `ArgumentRelation` DISABLE KEYS */;
.....

@reckart
Copy link
Member

reckart commented Nov 24, 2016

I think we had that problem as well with WebAnno until we set the default server encoding (and updated the documentation accordingly):

make sure your MySQL server is configured for UTF-8. Check the following line is present in /etc/mysql/my.cnf:

character-set-server = utf8

Source: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/WebAnno%20(GitHub)%20(master)/de.tudarmstadt.ukp.clarin.webanno$webanno-doc/doclinks/2/#_prepare_database

It seems that can alternatively be specified on the command-line when importing. Cf: https://makandracards.com/makandra/595-dumping-and-importing-from-to-mysql-in-an-utf-8-safe-way

But I believe if you only specify it on the command-line instead of the my.cnf, then you'll also have to set the encoding on the JDBC connection string. Cf: http://stackoverflow.com/questions/13234433/utf8-garbled-when-importing-into-mysql

@betoboullosa
Copy link
Member

@judithek Unfortunately, that is the expected behavior for the utf8_general_ci collation in mysql. If you want to differentiate between sägen and sagen in your query, the collation must be utf8_bin.

Do this and you'll see it:

create table test_unicodeci (a varchar(20)) character set utf8 collate utf8_unicode_ci;
create table test_bin (a varchar(20)) character set utf8 collate utf8_bin;

insert into test_unicodeci values ("sagen");
insert into test_unicodeci values ("sägen");

insert into test_bin values ("sagen");
insert into test_bin values ("sägen");

select * from test_unicodeci where a = "sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "sägen";
+--------+
| a      |
+--------+
| sägen  |
+--------+

@betoboullosa
Copy link
Member

Note, however, that if you use the utf8_bin collation, only case sensitive searches are possible... :(

select * from test_unicodeci where a = "Sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "Sägen";
Empty set (0.00 sec)

@judithek
Copy link
Member Author

@reckart @betoboullosa thanks a lot for your comments and insights!

@betoboullosa IMO think case sensitive searches perfectly make sense for a lexical resource

@betoboullosa
Copy link
Member

@judithek Yes, case sensitivity is OK for a lexical resource. It might be a problem if searching for words in sentences for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants