Uby MySQL dump does not handle UTF-8 properly #172

judithek · 2016-11-24T20:05:32Z

When querying an UBY mysql dump for German lemmas with umlauts this is not handled correctly:
e.g. querying for "sägen" returns all entries for "sägen" and "sagen".

This can be reproduced in the Uby web browser, on the command line with pure mysql and using the Uby-API.

This issue does not occur with the H2 database (another point in favor of using H2).

reckart · 2016-11-24T20:07:02Z

That may depend on how the encoding in the mysql db has been configured and whether that configuration is included in the mysql dump. A properly configured mysql should support UTF-8.

judithek · 2016-11-24T20:10:32Z

We always used this configuration for the creation of the database (before starting the actual import):

CREATE SCHEMA DATABASE_NAME DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci ;

reckart · 2016-11-24T20:12:02Z

Hm. Do the tables that were then created by Hibernate also reflect the encoding and collation? In principle, you can set these parameters individually for each table.

judithek · 2016-11-24T20:23:04Z

the tables are created via Hibernate in LMFDBUtils
https://github.com/dkpro/dkpro-uby/blob/master/de.tudarmstadt.ukp.uby.persistence.transform-asl/src/main/java/de/tudarmstadt/ukp/lmf/transform/LMFDBUtils.java

this is not done for each table individually.
apparently H2 has no problems with this way of creating tables, while MySQL has.

reckart · 2016-11-24T20:34:47Z

Does the MySQL dump file include lines with reference to UTF-8? E.g.

/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `my-table` (
...
) ENGINE=InnoDB AUTO_INCREMENT=12001 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

judithek · 2016-11-24T20:38:27Z

yes:

C:\Users\Judith>more uby_open_0_7_0_nonfree.sql

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0
*/;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

/*!40000 DROP DATABASE IF EXISTS `uby_open_0_7_0`*/;

CREATE DATABASE /*!32312 IF NOT EXISTS*/ `uby_open_0_7_0` /*!40100 DEFAULT CHARA
CTER SET utf8 */;

USE `uby_open_0_7_0`;
DROP TABLE IF EXISTS `ArgumentRelation`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `ArgumentRelation` (
  `argumentRelationId` bigint(20) NOT NULL,
  `relType` varchar(255) DEFAULT NULL,
  `relName` varchar(255) DEFAULT NULL,
  `target` varchar(255) DEFAULT NULL,
  `semanticArgumentId` varchar(255) DEFAULT NULL,
  `idx` int(11) DEFAULT NULL,
  PRIMARY KEY (`argumentRelationId`),
  KEY `argumentrelation_target_IDX` (`target`),
  KEY `FK3A19F3F9D07C6108` (`semanticArgumentId`),
  CONSTRAINT `FK3A19F3F9D07C6108` FOREIGN KEY (`semanticArgumentId`) REFERENCES
`SemanticArgument` (`semanticArgumentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

LOCK TABLES `ArgumentRelation` WRITE;
/*!40000 ALTER TABLE `ArgumentRelation` DISABLE KEYS */;
.....

reckart · 2016-11-24T20:51:12Z

I think we had that problem as well with WebAnno until we set the default server encoding (and updated the documentation accordingly):

make sure your MySQL server is configured for UTF-8. Check the following line is present in /etc/mysql/my.cnf:

character-set-server = utf8

Source: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/WebAnno%20(GitHub)%20(master)/de.tudarmstadt.ukp.clarin.webanno$webanno-doc/doclinks/2/#_prepare_database

It seems that can alternatively be specified on the command-line when importing. Cf: https://makandracards.com/makandra/595-dumping-and-importing-from-to-mysql-in-an-utf-8-safe-way

But I believe if you only specify it on the command-line instead of the my.cnf, then you'll also have to set the encoding on the JDBC connection string. Cf: http://stackoverflow.com/questions/13234433/utf8-garbled-when-importing-into-mysql

betoboullosa · 2016-11-25T10:18:05Z

@judithek Unfortunately, that is the expected behavior for the utf8_general_ci collation in mysql. If you want to differentiate between sägen and sagen in your query, the collation must be utf8_bin.

Do this and you'll see it:

create table test_unicodeci (a varchar(20)) character set utf8 collate utf8_unicode_ci;
create table test_bin (a varchar(20)) character set utf8 collate utf8_bin;

insert into test_unicodeci values ("sagen");
insert into test_unicodeci values ("sägen");

insert into test_bin values ("sagen");
insert into test_bin values ("sägen");

select * from test_unicodeci where a = "sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "sägen";
+--------+
| a      |
+--------+
| sägen  |
+--------+

betoboullosa · 2016-11-25T10:22:03Z

Note, however, that if you use the utf8_bin collation, only case sensitive searches are possible... :(

select * from test_unicodeci where a = "Sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "Sägen";
Empty set (0.00 sec)

judithek · 2016-11-25T15:54:37Z

@reckart @betoboullosa thanks a lot for your comments and insights!

@betoboullosa IMO think case sensitive searches perfectly make sense for a lexical resource

betoboullosa · 2016-11-25T16:08:27Z

@judithek Yes, case sensitivity is OK for a lexical resource. It might be a problem if searching for words in sentences for example.

judithek added the bug label Nov 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uby MySQL dump does not handle UTF-8 properly #172

Uby MySQL dump does not handle UTF-8 properly #172

judithek commented Nov 24, 2016

reckart commented Nov 24, 2016

judithek commented Nov 24, 2016

reckart commented Nov 24, 2016

judithek commented Nov 24, 2016

reckart commented Nov 24, 2016

judithek commented Nov 24, 2016 •

edited by reckart

Loading

reckart commented Nov 24, 2016

betoboullosa commented Nov 25, 2016

betoboullosa commented Nov 25, 2016

judithek commented Nov 25, 2016

betoboullosa commented Nov 25, 2016

Uby MySQL dump does not handle UTF-8 properly #172

Uby MySQL dump does not handle UTF-8 properly #172

Comments

judithek commented Nov 24, 2016

reckart commented Nov 24, 2016

judithek commented Nov 24, 2016

reckart commented Nov 24, 2016

judithek commented Nov 24, 2016

reckart commented Nov 24, 2016

judithek commented Nov 24, 2016 • edited by reckart Loading

reckart commented Nov 24, 2016

betoboullosa commented Nov 25, 2016

betoboullosa commented Nov 25, 2016

judithek commented Nov 25, 2016

betoboullosa commented Nov 25, 2016

judithek commented Nov 24, 2016 •

edited by reckart

Loading