Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for utf8mb3 #3007

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

danielbeardsley
Copy link

@danielbeardsley danielbeardsley commented Sep 3, 2024

Yes, it's an old encoding that mysql is phasing out. However, some DBs out there still use it and this library shouldn't crash in unexpected ways. Some servers (like ours) have some default connection settings that are still set to utf8mb3 even though all our columns / tables are in utf8mb4.

Ironically, if you run a query that has no results (REPLACE, DELETE, ...) then the metadata in the empty resultset is set to the server's default charset. If that happens to be utf8mb3, this library crashes.

Closes #1398

Co-Author @davidrans

Yes, it's an old encoding that myswl is phasing out. However, some DBs
out there still use it. Some servers (like ours) have some default
connection setting that is still set to utf8mb3 even though all our
columns / tables are in utf8mb4.

Ironically, if you run a query that has *no results* (REPLACE, DELETE,
...) then the metadata in the empty result set is set to the server's
default charset. If that happens to utf8mb3, this library crashes.
Copy link

codecov bot commented Sep 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.13%. Comparing base (30064f4) to head (264dfdd).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3007   +/-   ##
=======================================
  Coverage   88.13%   88.13%           
=======================================
  Files          71       71           
  Lines       12889    12890    +1     
  Branches     1352     1353    +1     
=======================================
+ Hits        11360    11361    +1     
  Misses       1529     1529           
Flag Coverage Δ
compression-0 88.13% <100.00%> (+<0.01%) ⬆️
compression-1 88.13% <100.00%> (+<0.01%) ⬆️
tls-0 87.55% <100.00%> (+<0.01%) ⬆️
tls-1 87.89% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wellwelwel
Copy link
Collaborator

wellwelwel commented Sep 3, 2024

Thanks, @danielbeardsley 🙋🏻‍♂️

Closes #2640

Could you explain how these changes affect #2640?

@danielbeardsley
Copy link
Author

Could you explain how these changes affect

I realize there likely could be several different causes of that error, but that's the error I got when I had this same utf8mb3 problem in single-connection mode. While using a pool connection, I received the error from #1398.

You're welcome to drop that "Closes" if you think it should stay open.

@sidorares
Copy link
Owner

let's drop #2640 , I'm not convinced it's directly related
thanks for the PR @danielbeardsley !

@sidorares
Copy link
Owner

I'm trying to find some references to confirm mysql charset name <-> code <-> iconv charset name mapping

looking at https://github.com/mysql/mysql-server/blob/596f0d238489a9cf9f43ce1ff905984f58d227b6/sql/protocol_classic.cc#L406

  MySQL has a very flexible character set support as documented in
  [Character Set Support](http://dev.mysql.com/doc/refman/5.7/en/charset.html).
  The list of character sets and their IDs can be queried as follows:

<pre>
  SELECT id, collation_name FROM information_schema.collations ORDER BY id;
  +----+-------------------+
  | id | collation_name    |
  +----+-------------------+
  |  1 | big5_chinese_ci   |
  |  2 | latin2_czech_cs   |
  |  3 | dec8_swedish_ci   |
  |  4 | cp850_general_ci  |
  |  5 | latin1_german1_ci |
  |  6 | hp8_english_ci    |
  |  7 | koi8r_general_ci  |
  |  8 | latin1_swedish_ci |
  |  9 | latin2_general_ci |
  | 10 | swe7_swedish_ci   |
  +----+-------------------+
</pre>

  The following table shows a few common character sets.

  Number |  Hex  | Character Set Name
  -------|-------|-------------------
       8 |  0x08 | @ref my_charset_latin1 "latin1_swedish_ci"
      33 |  0x21 | @ref my_charset_utf8mb3_general_ci "utf8mb3_general_ci"
      63 |  0x3f | @ref my_charset_bin "binary"
  • utf8mb3_general_ci has a code 33, which we currently map to cesu8

@sidorares
Copy link
Owner

for a context, cesu8: https://en.wikipedia.org/wiki/CESU-8
also see discussion in #374 (comment)

@davidrans
Copy link

davidrans commented Sep 4, 2024

This is what I get on our DB. utf8mb3_unicode_ci maps to 192 which is where I got that number:

mysql> show collation WHERE charset = 'utf8mb3';
+-----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation                   | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+-----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb3_bin                 | utf8mb3 |  83 |         | Yes      |       1 | PAD SPACE     |
| utf8mb3_croatian_ci         | utf8mb3 | 213 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_czech_ci            | utf8mb3 | 202 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_danish_ci           | utf8mb3 | 203 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_esperanto_ci        | utf8mb3 | 209 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_estonian_ci         | utf8mb3 | 198 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_general_ci          | utf8mb3 |  33 | Yes     | Yes      |       1 | PAD SPACE     |
| utf8mb3_general_mysql500_ci | utf8mb3 | 223 |         | Yes      |       1 | PAD SPACE     |
| utf8mb3_german2_ci          | utf8mb3 | 212 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_hungarian_ci        | utf8mb3 | 210 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_icelandic_ci        | utf8mb3 | 193 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_latvian_ci          | utf8mb3 | 194 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_lithuanian_ci       | utf8mb3 | 204 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_persian_ci          | utf8mb3 | 208 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_polish_ci           | utf8mb3 | 197 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_romanian_ci         | utf8mb3 | 195 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_roman_ci            | utf8mb3 | 207 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_sinhala_ci          | utf8mb3 | 211 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_slovak_ci           | utf8mb3 | 205 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_slovenian_ci        | utf8mb3 | 196 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_spanish2_ci         | utf8mb3 | 206 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_spanish_ci          | utf8mb3 | 199 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_swedish_ci          | utf8mb3 | 200 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_tolower_ci          | utf8mb3 |  76 |         | Yes      |       1 | PAD SPACE     |
| utf8mb3_turkish_ci          | utf8mb3 | 201 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_unicode_520_ci      | utf8mb3 | 214 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_unicode_ci          | utf8mb3 | 192 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_vietnamese_ci       | utf8mb3 | 215 |         | Yes      |       8 | PAD SPACE     |
+-----------------------------+---------+-----+---------+----------+---------+---------------+

utf8mb3_unicode_ci is the one our db is using:

mysql> SHOW VARIABLES LIKE 'character_set_server';
SHOW VARIABLES LIKE 'collation_server';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb3 |
+----------------------+---------+
1 row in set (0.00 sec)

+------------------+--------------------+
| Variable_name    | Value              |
+------------------+--------------------+
| collation_server | utf8mb3_unicode_ci |
+------------------+--------------------+
1 row in set (0.00 sec)

@sidorares
Copy link
Owner

@danielbeardsley could you check if there is any missing charset id in addition to 192? Everything from your table needs to map to utf8 I believe, they only differ in collation / case sensitivity which does not apply to the driver

@danielbeardsley
Copy link
Author

could you check if there is any missing charset id in addition to 192

I'm not sure I understand. Missing from where? The lib/constants/encoding_charset.js file only has 40 or so out of 300, so yes.

Oh wait, you mean utilize tools/generate... to print the missing mappings.

Here, I included their ids too:
{"dec8":3,"eucjpms":97,"geostd8":92,"hp8":6,"keybcs2":37,"swe7":10}

Are you suggesting I add these with their ids to encoding_charset.js? I'm not entirely sure I understand this system and what I'm adding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: Encoding not recognized: 'undefined'
4 participants