caPutJsonLog may send invalid JSON #9

dirk-zimoch · 2021-09-13T15:38:38Z

When writing non utf-8 characters to a field that is logged, e.g. German äöü or µ or ° from a terminal in ISO-8859-1 encoding, these non utf-8 bytes (0x80 ... 0xff) are copied into the JSON string without any checks or conversions. The resulting erroneous JSON may be impossible to parse for a JSON compliant logger.

What to do?
a) Convert to utf-8, assuming some encoding, e.g. iso-8859-1, which is relatively easy, other encodings are more tricky.
b) skip those bytes
c) replace them with some "special" character

mdavidsaver · 2021-09-13T16:14:40Z

a) Convert to utf-8, assuming some encoding ...

I take it you mean at runtime? Is it realistic to instead convert string constants in the relevant source files to UTF-8? It seems to me that this would be the only realistic way to escape from the N+1 nature of character encoding standards.

anjohnson · 2021-09-13T17:02:38Z

It looks like the Pandora's Box of the IOC's character encoding is getting opened, whether we like it or not.

We should not attempt to write our own character encoding conversion code for EPICS, and we should not publish software that can only work for some EPICS users (e.g. that only supports iso-8859-1 say). I'm not saying that we have to support any/all encodings though, we can require sites to do some localization tasks to implement their specific character encoding(s).
This discussion will lead to many questions, e.g. should the IOC support one or many character encodings (and if many, how to tell which string uses which encoding)?
If we (have to) pick one specific encoding for the IOC it should be UTF-8. Up to now we have left the encoding of IOC strings undefined, and it's been possible (AFAIK) to use any fixed 8-bit encoding across a whole EPICS site. It's not clear if or for how long we can sustain that position.
It seems that PSI is currently using iso-8859-1 for their IOC strings; other sites could be using other encodings. If we choose to make the IOC use UTF-8 but sites want to (continue to) use other encodings for strings sent over the network we might be able to let them provide their own encoding and decoding routines as external plug-ins.
We should probably move this discussion to core-talk.

dirk-zimoch · 2021-09-14T07:30:25Z

Michael, the problem is not string constants but user provided run time strings set with caput from a terminal with other encoding than utf-8.
Andrew, one should preferably use a library like iconv, but I don' know if that exists for other architectures than GNU/Linux. In the case of iso-8851-1, the conversion is so simple, that it can be done without a library, but other encodings require lots of lookup tables, and maybe worse.
It is not so much about what we use at PSI (though in fact we do have several ° chars in our db templates), but what anyone may do and what breaks if they do it. EPICS did not care about encoding so far, but JSON does. Thus if we want to build JSON from EPICS, we have to care about it. Either reject non utf strings or convert them, either when importing them into the db or when exporting them to JSON,.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caPutJsonLog may send invalid JSON #9

caPutJsonLog may send invalid JSON #9

dirk-zimoch commented Sep 13, 2021

mdavidsaver commented Sep 13, 2021

anjohnson commented Sep 13, 2021

dirk-zimoch commented Sep 14, 2021

caPutJsonLog may send invalid JSON #9

caPutJsonLog may send invalid JSON #9

Comments

dirk-zimoch commented Sep 13, 2021

mdavidsaver commented Sep 13, 2021

anjohnson commented Sep 13, 2021

dirk-zimoch commented Sep 14, 2021