-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MY_DATA - HTML doesn't handle an accented character in column name #432
Comments
Thanks Martin! |
The same happens with accents in the data |
Thanks Martin, if the table rows are not being shown due to a character in the table title we should prioritise this to be fixed in the milestone after next (which should come out really soon). |
@microbit-carlos Knowing little about such things, I just discovered what the problem is... The browser expects UTF-8 because of |
Ah, thanks for checking Martin! Do you know how the characters with UTF-8 value end up being |
Not really! AIUI, it's when the character is not valid, I think assuming from the charset that the source data is UTF-8. I don't know if the script can get the raw data before it's interpreted as UTF-8. The script relies on finding FFFD (usually corresponding to an FF in the source) to terminate the string. MicroBitLog could avoid the problem by replacing non-ascii characters, with ? for example. Assuming the script can't get at the source data, to be able to retrieve non-ascii characters, I guess the logged data needs to be UTF-8. I wonder what code page MakeCode uses? If the charset is changed to a code page, all characters are valid and the script needs some other way to find the end of the data. |
Right, so if I understand that right, it's possible that it is MakeCode that is encoding non-ascii characters to 0xFFFD, and in that case we could update MicroBitLog to check for non-ascii characters and replace them, is that right? I think the part that I find a bit weird is that if MakeCode didn't support UTF-8 encoding, then it'd likely encode non-ascii characters to a single byte instead of two bytes. Or, if it uses any different type of encoding, which is able to output the same two bytes to identify "bad data" ( |
Okay, so trying the original programme (https://makecode.microbit.org/_U3ibMVU94dfH) with MakeCode live and opening the HTML file with a hex editor and But the HTML doesn't render, even if it does have data: Similarly, with the other programme created by @martinwork https://makecode.microbit.org/_93aHvuLvCJwq the @martinwork how did you read the 0xFFFD characters? |
We are typing at the same time! The UTF-8 for e grave is C3A8 The logged data in the MY_DATA.HTM file from the first example above contains single byte E8 characters. I think in the process of loading the HTM file, the browser is interpreting this single byte text in the light of When the MY_DATA script scans the outerHTML string, it finds FFFD. In most cases micro:bit can only handle ASCII. I guess perhaps MicroBitLog could pass through UTF-8, but I'm not sure how the UTF-8 could be created. |
Yes, sorry, you are right Martin, I trusted a bad UTF8-to-hex converter, which I think it was actually using extended ASCII instead (as in that case Right, so something in MakeCode or CODAL is either using extended ascii , CP-1252, or similar ( After that, as you mentioned, the browser reads the HTML file as UTF-8 and when it encounters the invalid byte
Yeah, for things like displaying text CODAL only deals with ASCII, but strings that are processed outside of the device (like serial messages or data to display in an HTML file or CSV file), where CODAL doesn't need to parse the string, those should be fine in whatever encoding, no? |
Ah, yes I did not find the MakeCode implementation.
I was immediately thinking "yes", but I found previous related posts... All the above and linked issues seem interesting, though I haven't looked very carefully yet. I don't know what is happening here yet... |
Right, yes, thanks for finding those! I think mostly the issue is that MakeCode doesn't do UTF-8 (or 16 or whatever) encoding, so users are struggling to get strings working when the encoding is not done "correctly".
My comment here was mostly along the lines of "if MakeCode encoded all strings as UTF-8 (like MicroPython does), there shouldn't be any problem to pass those strings along, as long as we didn't try to scroll them on the display or similar". Based on the comments from microsoft/pxt-microbit#2372 (comment) the main drawback would be code size, however trying to enable it in a project as shown in microsoft/pxt#6988 didn't seem to work for me. Doing this quick test it does look like MakeCode is doing something along the lines of encoding to UTF-16 and only keeping the LSB (e.g. By creating a simple programme with Using character Which makes sense if encoded like this: An easier example with surrogates that can be seen on serial would be |
The utf8 option didn't seem to work for me either. I guess there might be problems any time a string is passed to C++ in CODAL or an extension and not simply passed on. The example https://makecode.microbit.org/_UJTCHTRPc2Jf does work better than it appeared. CoolTerm sees the expected single byte characters sent from micro:bit, plus the padding spaces MakeCode adds, but MakeCode rejects it with console message "invalid utf8 serial data", and displays nothing in the console. |
Arising from support ticket https://support.microbit.org/helpdesk/tickets/75447 (private)
The HTML display, Download and Copy are all truncated just before the accented character.
https://makecode.microbit.org/_U3ibMVU94dfH
The text was updated successfully, but these errors were encountered: