[Bug] ps recognizes UTF8 as UTF16 #4697

Semnodime · 2024-11-03T00:53:41Z

Work environment

rizin 0.8.0 @ linux-x86-64
commit: 73d85d2

Expected behavior

Detect and display string (hex f0 9f 9f aa f0 9f 9f aa 00, decoded 🟪🟪) as UTF8

Actual behavior

UTF16BE (which is incorrectly parsed as well, if it actually was UTF16 but that's a separate bug)

Steps to reproduce the behavior

ELF AMD64

[0x0007ed51]> pxc
- offset -   0 1  2 3  4 5  6 7  8 9  A B  C D  E F  0123456789ABCDEF  comment
0x0007ed51  f09f 9faa f09f 9faa 0025 6868 75ef b88f  .........%hhu...  ; data.0007ed51  ; str.hhu
[0x0007ed51]> psj
{"string":"\u00f0\u009f\u009f\u00aa\u00f0\u009f\u009f\u00aa%\u0068\u0068\u0075\u00ef\u00b8\u008f\u00e2\u0083\u00a3\u0000\u0059\u006f\u0075\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0021\u0000\u002f\u0062\u0069\u006e\u002f\u0073\u0068\u0000Y\u006f\u0075\u0020\u006d\u0061\u0079\u0020\u0068\u0061\u0076\u0065\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0070\u0075\u007a\u007a\u006c\u0065\u0020\u0062\u0075\u0074\u0020\u0079\u006f\u0075\u0020\u0064\u0069\u0064\u0020\u006e\u006f\u0074\u0020\u0073\u006f\u006c\u0076\u0065\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0020\u003b\u0029","offset":519505,"section":".rodata","length":122,"type":"utf16be"}
[0x0007ed51]> ps+j
{"string":"\u009f\u009f\u00aa\u00f0\u009f\u009f\u00aa\u0000\u0025\u0068\u0068\u0075\u00ef\u00b8\u008f\u00e2\u0083\u00a3Y\u006f\u0075\u0020\u0073\u006f\u006c\u0076\u0065\u0064\u0020\u0074\u0068\u0065\u0020\u0063\u0068\u0061\u006c\u006c\u0065\u006e\u0067\u0065\u0021/\u0062\u0069\u006e\u002f\u0073\u0068","offset":519505,"section":".rodata","length":50,"type":"utf16be"}
[0x0007ed51]> ps
龪龪%桨痯뢏ꌀ奯甠獯汶敤⁴桥\xe2\x81\xa3桡汬敮来℀⽢楮⽳栀Y潵\xe2\x81\xad慹\xe2\x81\xa8慶攠獯汶敤⁴桥⁰畺穬攠扵琠祯甠摩搠湯琠獯汶攠瑨攠捨慬汥湧攠㬩
[0x0007ed51]> ps+
龟꫰龟ꨀ╨桵迢莣Y潵\xe2\x81\xb3潬癥搠瑨攠捨慬汥湧攡/扩港獨

The text was updated successfully, but these errors were encountered:

wargio · 2024-11-03T11:53:57Z

I believe is due the guess encoding. you can enforce utf-8 by setting str.search.encoding=utf8

[0x00000000]> e str.search.encoding
guess
[0x00000000]> e str.search.encoding=?
ascii
8bit
utf8
utf16le
utf32le
utf16be
utf32be
guess

wargio · 2024-11-03T11:55:06Z

Also since those chars are emoji, i am strongly sure we do not handle it correctly when guessing.

Rot127 · 2025-01-24T06:35:31Z

The string detection metrics are a little off in general.
The functions in str_search.c have some not further described metrics and some of them are definitely not correct (although probably work for the context they are in).
And rz_str_guess_encoding_from_buffer doesn't check for ibm037 and non-Unicode encodings and have the problem mentioned above.

Rot127 · 2025-01-27T16:53:21Z

To add to this. The problem of string encoding detection is also a nice fit for the knowledge base. Since different encodings have overlapping characters. Being able to seamlessly switch, define the expected encoding once or detect the expected encoding from according to some statistics, would be nice to have. But this in itself is a single module on top of the knowledge base I think.

notxvilka modified the milestones: 0.8.0, 0.9.0 Jan 20, 2025

Rot127 added the bug Something isn't working label Jan 24, 2025

This was referenced Feb 1, 2025

Update Unicode tables. #4872

Closed

Unicode/EBCDIC decode fixes and validator functions #4874

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] ps recognizes UTF8 as UTF16 #4697

[Bug] ps recognizes UTF8 as UTF16 #4697

Semnodime commented Nov 3, 2024 •

edited

Loading

wargio commented Nov 3, 2024

wargio commented Nov 3, 2024 •

edited

Loading

Rot127 commented Jan 24, 2025

Rot127 commented Jan 27, 2025

[Bug] ps recognizes UTF8 as UTF16 #4697

[Bug] ps recognizes UTF8 as UTF16 #4697

Comments

Semnodime commented Nov 3, 2024 • edited Loading

Work environment

Expected behavior

Actual behavior

Steps to reproduce the behavior

wargio commented Nov 3, 2024

wargio commented Nov 3, 2024 • edited Loading

Rot127 commented Jan 24, 2025

Rot127 commented Jan 27, 2025

Semnodime commented Nov 3, 2024 •

edited

Loading

wargio commented Nov 3, 2024 •

edited

Loading