Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for specific fixed encoding in ANSI functions #45

Open
pali opened this issue Jan 2, 2020 · 10 comments
Open

Add support for specific fixed encoding in ANSI functions #45

pali opened this issue Jan 2, 2020 · 10 comments
Assignees

Comments

@pali
Copy link

pali commented Jan 2, 2020

Currently ANSI functions uses char* type for passing string arguments. And value of char* on Linux builds is interpreted to be encoded according to current locale settings, more precisely what was passed to setlocale(LC_CTYPE, ...) call. By default when application does not call any setlocale function, 7bit ASCII is configured as current locale, env variables are ignored.

But some ODBC drivers excepts that char * values in ANSI functions are always encoded in UTF-8, independently of what is set via current locale settings (setlocale()).

So it would be nice if iODBC manager provides some API to set explicit encoding which would be used for any conversion from char* to SQLWCHAR* and vice-versa. To have better support for those drivers which expects fixed encoding (e.g. UTF-8) in ANSI functions.

@TallTed
Copy link
Contributor

TallTed commented Jan 6, 2020

@smalinin @pkleef @openlink -- Please have a look at this.

@pali
Copy link
Author

pali commented Jan 6, 2020

Just to note, one of such ODBC driver which always uses UTF-8 char* for ANSI ODBC functions is
sqliteodbc from http://www.ch-werner.de/sqliteodbc/
So you cannot use iODBC with this sqliteodbc driver in ANSI mode when system locale codeset is not set to UTF-8.

@smalinin
Copy link
Collaborator

smalinin commented Jan 6, 2020

@pali

  • The sqlliteodbc driver supports Unicode ODBC calls, so it will be better to use Unicode ODBC calls, if you want to work with Unicode data, it is more portable.
  • now we use OS functions for conversion between ANSI and Unicode, so it will not be a simple update for switch to something like iconv-lib for this.
  • if you need the simple support of UTF8 for ODBC driver, the iODBC sources from develop branch support multi-unicode for ODBC Unicode calls, so you could use ODBC Unicode calls with UTF8 or UTF16 or UCS4 (with all ODBC drivers UTF8/UTF16/UCS4). I could write you the details about how-to-use, if you need this.

@pali
Copy link
Author

pali commented Jan 6, 2020

  • The sqlliteodbc driver supports Unicode ODBC calls, so it will be better to use Unicode ODBC calls, if you want to work with Unicode data, it is more portable.

SQLite works internally in UTF-8 and therefore preferred way to use it is via ANSI ODBC API. Unicode ODBC API means to work in UTF-16 mode. Preferred Unicode encoding on Linux is UTF-8 (and always was UTF-8) therefore most applications use UTF-8.

So your suggestion is basically to convert strings from native application encoding UTF-8 to UTF-16, then pass UTF-16 strings to iODBC manager via Unicode API which pass them to SQLite ODBC driver via Unicode API. SQLite ODBC driver in Unicode API then converts UTF-16 string to UTF-8 and pass it to SQLite ODBC driver ANSI API which then pass it to SQLite database implementation.

So basically there are two useless conversions UTF-8 --> UTF-16 and UTF-16 --> UTF-8 involved. I think that it is always better to pass UTF-8 string directly and avoid doing useless conversions on different layers.

  • now we use OS functions for conversion between ANSI and Unicode, so it will not be a simple update for switch to something like iconv-lib for this.

I understand that adding another encoding library and its usage does not have to be simple. That is why I opened this feature request -- it would be nice to avoid re-encoding when it is not needed.

  • if you need the simple support of UTF8 for ODBC driver, the iODBC sources from develop branch support multi-unicode for ODBC Unicode calls, so you could use ODBC Unicode calls with UTF8 or UTF16 or UCS4. I could write
    you the details about how-to-use, if you need this.

I quickly looked at this code and if I understood correctly, Unicode API has a switch to supply UTF-8 strings via SQLWCHAR*. But this is something which is not widely supported. Most ODBC drivers expect either UTF-16 or UTF-32 buffers in Unicode SQLWCHAR* API, not UTF-8. It is also because UTF-8 strings are null-term string, stored in char* type, which is mapped in most cases to ANSI API on unixes.

@pali
Copy link
Author

pali commented Jan 6, 2020

Anyway, SQLite ODBC is not the only driver which works in this mode. I mentioned it as a good example, most developers knows it, can be easily tested (checked how it works) and plus is open source so anybody can check how is really implemented.

But I have there another example of ODBC driver which pass into this category of fixed encoding: Vertica ODBC driver. It is commercial proprietary database and its ODBC driver also ignores locale settings (*). So e.g. when locale is set to Latin1 it expects that ODBC manager pass UTF-8 strings. And because it is proprietary it is not possible to change this behavior and even this behavior is not documented. Vertica is commercial product and ODBC is the only way how to use it in C/C++ application.

(*) - one exception, when locale is set to 7 bit ascii "C" or "POSIX" then it respects it and works only in 7 bit mode.

@smalinin
Copy link
Collaborator

smalinin commented Jan 6, 2020

I quickly looked at this code and if I understood correctly, Unicode API has a switch to supply UTF-8 strings via SQLWCHAR*. But this is something which is not widely supported. Most ODBC drivers expect either UTF-16 or UTF-32 buffers in Unicode SQLWCHAR* API, not UTF-8. It is also because UTF-8 strings are null-term string, stored in char* type, which is mapped in most cases to ANSI API on unixes.

  • You write in DSN or Driver attributes, that ODBC driver uses UTF8/UTF16/UCS4 for Unicode calls
  • You write in iODBC *.ini , that all Applications will use UTF8/UTF16/UCS4 for Unicode calls or set OS env variable ODBC_APP_UNICODE_TYPE=UTF8 (or UTF16/UCS4) .

After this iODBC driver manager will convert all Unicode data between App Unicode CodePage and Driver Unicode CodePage, so you could use UTF8 Unicode call(for example) with all Unicode ODBC drivers UTF8/UTF16/UCS4 and etc.

@smalinin
Copy link
Collaborator

smalinin commented Jan 6, 2020

@pali
Also, I think, we could to add some iODBC settings/flags for call
setlocale (LC_ALL, "");
in iODBC code, when Environment is created first time, so it must resolve your problem with setlocale, I thnk, it will be the easieast way.

@pali
Copy link
Author

pali commented Jan 6, 2020

Also, I think, we could to add some iODBC settings/flags for call
setlocale (LC_ALL, "");
in iODBC code, when Environment is created first time, so it must resolve your problem with setlocale, I thnk, it will be the easieast way.

It is really not a good idea to call setlocale in library! setlocale changes locale for all threads of running application and therefore it may break application. Application itself does not expects that some library loaded in one thread would change locale of another thread. Normally application itself calls setlocale at begging with appropriate settings. Also LC_ALL changes e.g. numeric conversions and change behavior of e.g. scanf/printf C functions.

@pali
Copy link
Author

pali commented Jan 7, 2020

  • now we use OS functions for conversion between ANSI and Unicode, so it will not be a simple update for switch to something like iconv-lib for this.

To all functions which do conversion iODC already passing structure DM_CONV which contains information how to do Unicode API encoding.

Would it be really hard to extend this DM_CONV structure to contains also information how to ANSI API encoding? E.g. there can be boolean flag if to use OS functions for conversion between ANSI <--> Unicode and iconv object for doing other fixed encoding (e.g. UTF-8). Or even simpler variant without iconv: flag which will indicate to use OS functions or fixed UTF-8 encoding. As iODBC has already implemented conversion between UTF-8 and UTF-16 and UTF-32, iconv is not needed when "fixed encoding" for ANSI mode would be set to UTF-8.

But this is just a my result of inspecing current iODBC code.

@pali
Copy link
Author

pali commented Jan 16, 2020

I see that some ODBC drivers support IANAAppCodePage attribute for specifying (fixed) encoding of ANSI functions. Value of IANAAppCodePage is not string, but rather number from mapping table https://www.iana.org/assignments/character-sets/character-sets.xhtml

So e.g. IANAAppCodePage=4 means Latin1 encoding or IANAAppCodePage=106 is UTF-8 encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants