Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caPutJsonLog may send invalid JSON #9

Open
dirk-zimoch opened this issue Sep 13, 2021 · 3 comments
Open

caPutJsonLog may send invalid JSON #9

dirk-zimoch opened this issue Sep 13, 2021 · 3 comments

Comments

@dirk-zimoch
Copy link
Contributor

When writing non utf-8 characters to a field that is logged, e.g. German äöü or µ or ° from a terminal in ISO-8859-1 encoding, these non utf-8 bytes (0x80 ... 0xff) are copied into the JSON string without any checks or conversions. The resulting erroneous JSON may be impossible to parse for a JSON compliant logger.

What to do?
a) Convert to utf-8, assuming some encoding, e.g. iso-8859-1, which is relatively easy, other encodings are more tricky.
b) skip those bytes
c) replace them with some "special" character

@mdavidsaver
Copy link
Contributor

a) Convert to utf-8, assuming some encoding ...

I take it you mean at runtime? Is it realistic to instead convert string constants in the relevant source files to UTF-8? It seems to me that this would be the only realistic way to escape from the N+1 nature of character encoding standards.

@anjohnson
Copy link
Member

It looks like the Pandora's Box of the IOC's character encoding is getting opened, whether we like it or not.

  1. We should not attempt to write our own character encoding conversion code for EPICS, and we should not publish software that can only work for some EPICS users (e.g. that only supports iso-8859-1 say). I'm not saying that we have to support any/all encodings though, we can require sites to do some localization tasks to implement their specific character encoding(s).
  2. This discussion will lead to many questions, e.g. should the IOC support one or many character encodings (and if many, how to tell which string uses which encoding)?
  3. If we (have to) pick one specific encoding for the IOC it should be UTF-8. Up to now we have left the encoding of IOC strings undefined, and it's been possible (AFAIK) to use any fixed 8-bit encoding across a whole EPICS site. It's not clear if or for how long we can sustain that position.
  4. It seems that PSI is currently using iso-8859-1 for their IOC strings; other sites could be using other encodings. If we choose to make the IOC use UTF-8 but sites want to (continue to) use other encodings for strings sent over the network we might be able to let them provide their own encoding and decoding routines as external plug-ins.
  5. We should probably move this discussion to core-talk.

@dirk-zimoch
Copy link
Contributor Author

Michael, the problem is not string constants but user provided run time strings set with caput from a terminal with other encoding than utf-8.
Andrew, one should preferably use a library like iconv, but I don' know if that exists for other architectures than GNU/Linux. In the case of iso-8851-1, the conversion is so simple, that it can be done without a library, but other encodings require lots of lookup tables, and maybe worse.
It is not so much about what we use at PSI (though in fact we do have several ° chars in our db templates), but what anyone may do and what breaks if they do it. EPICS did not care about encoding so far, but JSON does. Thus if we want to build JSON from EPICS, we have to care about it. Either reject non utf strings or convert them, either when importing them into the db or when exporting them to JSON,.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants