Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit UTF-8 validation #136

Open
essen opened this issue Feb 26, 2024 · 2 comments
Open

Revisit UTF-8 validation #136

essen opened this issue Feb 26, 2024 · 2 comments

Comments

@essen
Copy link
Member

essen commented Feb 26, 2024

The code in

cowlib/src/cow_ws.erl

Lines 581 to 588 in cc04201

%% Based on the Flexible and Economical UTF-8 Decoder algorithm by
%% Bjoern Hoehrmann <[email protected]> (http://bjoern.hoehrmann.de/utf-8/decoder/dfa/).
%%
%% The original algorithm has been unrolled into all combinations of values for C and State
%% each with a clause. The common clauses were then grouped together.
%%
%% This function returns 0 on success, 1 on error, and 2..8 on incomplete data.
validate_utf8(<<>>, State) -> State;
was written a decade ago. The VM has changed a lot. The JSON PR in OTP has a different way of doing this that may be faster: erlang/otp#8111

@codeadict
Copy link

For extra info, there is also this discussion about adding a C BIF to the BEAM using this algorithm https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/

@essen
Copy link
Member Author

essen commented Mar 8, 2024

OK I took a long look at all that chatter about UTF-8 validation that I missed (including erlang/otp#6576 as fairly interesting). Thank you.

As far as SIMD goes, I am open to believe it could be a better alternative, but it remains to be proven for use within Erlang. Note that some strings can be overly long so the implementation would need to account for that. This might make it not as good as initially hoped.

The Elixir PR adding a fast_ascii option sounds good but as far as Cowboy is concerned users that want to skip this validation (because it will be done when decoding JSON, for example) should use a binary frame. Other users that do use text frames are more likely to use more than just ASCII. At least that's what I've experienced.

So for now the ticket is about refreshing the algorithm implementation rather than switching to a different algorithm. But it's possible that I missed something; I didn't actually start working on this and it is not yet a priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants