-
Notifications
You must be signed in to change notification settings - Fork 347
Optimize 'json_parse_string' using ARM Neon. #816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ext/json/ext/parser/parser.c
Outdated
// Benchmarking this on an M1 Macbook Air shows that this is faster than | ||
// using the neon_match_mask function (see the generator.c code) and bit | ||
// operations to find the first match. | ||
if (vmaxvq_u8(needs_escape)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is why you didn't share the between the two? Because I'd much prefer share it for maintainability reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I went with this because it was faster.. at least with the activitypub.json
tests. Testing a bit more tonight.. it seems like it's dataset dependent. Additionally, I did go a bit farther than this by also using the same neon_next_match
and keeping track of the mask
in the state
variable. That didn't perform as well and I decided to keep it simple and when with the original code in the PR.
A bit of a middle-ground is the code below:
uint64_t mask = neon_match_mask(needs_escape);
if (mask) {
state->cursor += trailing_zeros64(mask) >> 2;
return 1;
}
I get:
== Parsing activitypub.json (58160 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
after 912.000 i/100ms
Calculating -------------------------------------
after 9.415k (± 0.6%) i/s (106.21 μs/i) - 47.424k in 5.037101s
Comparison:
before: 8419.2 i/s
after: 9415.2 i/s - 1.12x faster
== Parsing twitter.json (567916 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
after 91.000 i/100ms
Calculating -------------------------------------
after 917.805 (± 0.8%) i/s (1.09 ms/i) - 4.641k in 5.056969s
Comparison:
before: 808.1 i/s
after: 917.8 i/s - 1.14x faster
== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
after 41.000 i/100ms
Calculating -------------------------------------
after 410.991 (± 1.5%) i/s (2.43 ms/i) - 2.091k in 5.089051s
Comparison:
before: 418.4 i/s
after: 411.0 i/s - same-ish: difference falls within error
== Parsing ohai.json (32444 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
after 1.063k i/100ms
Calculating -------------------------------------
after 10.708k (± 0.7%) i/s (93.38 μs/i) - 54.213k in 5.062878s
Comparison:
before: 10124.3 i/s
after: 10708.4 i/s - 1.06x faster
Sometimes the ohai.json
isn't any faster at all.. it depends on the run, though it's never slower. I'm happy to use this as it's more consistent with the generator code.
As of commit
|
ext/json/ext/simd/simd.h
Outdated
#if (defined(__GNUC__ ) || defined(__clang__)) | ||
#define FORCE_INLINE __attribute__((always_inline)) | ||
#else | ||
#define FORCE_INLINE | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro is now defined in 3 files. If we move this above the #ifdef JSON_ENABLE_SIMD
so it's always available we can remove this from the generator/generator.c
and parser/parser.c
.
SSE2 Support. I can merge that to include it in this PR or I can create a separate PR after this is merged if you prefer. The benchmarks look pretty good. |
You can have both in the same PR. What I'd like to figure out, is a way to share the search function, because the logic is exactly the same, and the code non-trivial. But that's tricky because the implementation in |
I think I might know a way to do that, though I'm not entirely sure what the API will be. However, the idea is to create some sort of
Then we can refactor the search code to use the iterator.
In the parser there might be an implementation something like:
As I type this I'm not entirely sure that's the correct level of abstraction, but hopefully it gets the idea across. We can hopefully have the same interface but the implementations within the Parser and Generator can depend on their own local data structures. |
I think so. What worry me here is that all these function pointer will prevent inlining, but perhaps compilers are smarter than I think. What I had in mind was to move the Fbuffer outside struct generator_search_state {
search_state se;
Fbuffer buffer;
} Which would allow to at least share the parts that don't touch the buffer. But I never got the time+energy to attempt this yet. |
I tried this tonight and both gcc and clang didn't inline the function pointers. I went another route with direct function calls with function pointer arguments. This seems to work as expected though it's not complete. I need to update the generator a bit to handle the case when there are fewer than 16 bytes left but enough that it makes sense to still use SIMD. I don't know if I love it.. but it does seem to work. https://github.com/samyron/json/pull/6/files // in simd.h
// Note: I don't like the name of this function.
static inline int FORCE_INLINE neon_vector_scan(void *state, int (*has_next_vector)(void *, size_t), const char *(*ptr)(void *),
void (*advance_by)(void *, size_t), void (*set_match_mask)(void *, uint64_t))
{
while (has_next_vector(state, sizeof(uint8x16_t))) {
uint8x16_t chunk = vld1q_u8((const unsigned char *)ptr(state));
// Trick: c < 32 || c == 34 can be factored as c ^ 2 < 33
// https://lemire.me/blog/2025/04/13/detect-control-characters-quotes-and-backslashes-efficiently-using-swar/
const uint8x16_t too_low_or_dbl_quote = vcltq_u8(veorq_u8(chunk, vdupq_n_u8(2)), vdupq_n_u8(33));
uint8x16_t has_backslash = vceqq_u8(chunk, vdupq_n_u8('\\'));
uint8x16_t needs_escape = vorrq_u8(too_low_or_dbl_quote, has_backslash);
uint64_t mask = neon_match_mask(needs_escape);
if (mask) {
set_match_mask(state, mask);
return 1;
}
advance_by(state, sizeof(uint8x16_t));
}
return 0;
}
// in parser.c
static inline FORCE_INLINE int has_next_vector(void *state, size_t width)
{
JSON_ParserState *s = state;
return s->cursor + width <= s->end;
}
static inline FORCE_INLINE const char *ptr(void *state)
{
return ((JSON_ParserState *) state)->cursor;
}
static inline FORCE_INLINE void advance_by(void *state, size_t count)
{
((JSON_ParserState *) state)->cursor += count;
}
static inline FORCE_INLINE void set_match_mask(void *state, uint64_t mask)
{
advance_by(state, trailing_zeros64(mask) >> 2);
}
static inline FORCE_INLINE bool string_scan_neon(JSON_ParserState *state)
{
if (neon_vector_scan(state, has_next_vector, ptr, advance_by, set_match_mask)) {
return true;
}
if (state->cursor < state->end) {
return string_scan_basic(state);
}
return 0;
}
// in generator.c
static inline FORCE_INLINE int has_next_vector(void *state, size_t width)
{
search_state *search = (search_state *) state;
return search->ptr + width <= search->end;
}
static inline FORCE_INLINE const char *ptr(void *state)
{
search_state *search = (search_state *) state;
return search->ptr;
}
static inline FORCE_INLINE void advance_by(void *state, size_t count)
{
((search_state *) state)->ptr += count;
}
static inline FORCE_INLINE void set_match_mask(void *state, uint64_t mask)
{
((search_state *) state)->matches_mask = mask;
}
static inline unsigned char search_escape_basic_neon(search_state *search)
{
if (RB_UNLIKELY(search->has_matches)) {
// There are more matches if search->matches_mask > 0.
if (search->matches_mask > 0) {
return neon_next_match(search);
} else {
// neon_next_match will only advance search->ptr up to the last matching character.
// Skip over any characters in the last chunk that occur after the last match.
search->has_matches = false;
search->ptr = search->chunk_end;
}
}
if (neon_vector_scan(search, has_next_vector, ptr, advance_by, set_match_mask)) {
search->has_matches = true;
search->chunk_base = search->ptr;
search->chunk_end = search->ptr + sizeof(uint8x16_t);
return neon_next_match(search);
}
// TODO HANDLE THIS BETTER
// There are fewer than 16 bytes left.
unsigned long remaining = (search->end - search->ptr);
if (remaining >= SIMD_MINIMUM_THRESHOLD) {
char *s = copy_remaining_bytes(search, sizeof(uint8x16_t), remaining);
uint64_t mask = neon_rules_update(s);
if (!mask) {
// Nothing to escape, ensure search_flush doesn't do anything by setting
// search->cursor to search->ptr.
fbuffer_consumed(search->buffer, remaining);
search->ptr = search->end;
search->cursor = search->end;
return 0;
}
search->matches_mask = mask;
search->has_matches = true;
search->chunk_end = search->end;
search->chunk_base = search->ptr;
return neon_next_match(search);
}
if (search->ptr < search->end) {
return search_escape_basic(search);
}
search_flush(search);
return 0;
} |
I thought my original attempt at using structures made it unnecessarily hard on the compiler. I was effectively using inheritance to define the base structure with the functions then in the parser and generator child struct's to encapsulate the state. With the functions I mentioned above, I tried the code below in order to simplify passing all of the function pointers to
// simd.h
typedef struct _string_scan_iter {
int (*has_next_vector)(void *, size_t);
const char *(*ptr)(void *);
void (*advance_by)(void *, size_t);
void (*set_match_mask)(void *, uint64_t);
} NeonStringScanIterator;
// Renamed neon_vector_scan to string_scan_simd_neon
static inline int FORCE_INLINE string_scan_simd_neon(void *state, NeonStringScanIterator *iter)
{
...
}
// Using this in the parser and the generator:
NeonStringScanIterator iter = {
.has_next_vector = has_next_vector,
.ptr = ptr,
.advance_by = advance_by,
.set_match_mask = set_match_mask
};
if (string_scan_simd_neon(search, &iter)) {
...
} |
A bit of a reset. Going back to the code in this branch, For Neon specifically, since we compile the support directly we can actually avoid using function pointers entirely. With the assumption that x86-64 chips may not have SSE2, then we'll still need some sort of conditional or function pointer along with the runtime ISA detection to choose the scalar or vectorized code. |
All x86-64 chips have SSE2; it comes as part of the base ISA. |
The SIMD code is now be shared by the parser and generator. One major difference is the generator uses function pointers to determine which SIMD implementation to use. In the parser, the Neon code is compiled directly into the |
void (*set_match_mask)(void *, uint64_t); | ||
} NeonStringScanIterator; | ||
|
||
static inline FORCE_INLINE uint64_t compute_chunk_mask_neon(const char *ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not set on this function name.
ext/json/ext/simd/simd.h
Outdated
return neon_match_mask(needs_escape); | ||
} | ||
|
||
static inline FORCE_INLINE int string_scan_simd_neon(NeonStringScanIterator *iter, void *state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not set on this function name.
I'll try to find some time to attempt a solution without function pointers to see if it fare any better. |
I should note that on |
Real world parsing benchmarks on x86-64 using the SSE2 code on a
|
There are no more function pointers / iterator structure. The code now uses pointers as output variables to modify the necessary state and pass the match |
72f5448
to
87a3a7f
Compare
This PR uses ARM Neon instructions to optimize the
json_parse_string
function in the parser.I did refactor the
simd.h
from the generator toext/json/ext/simd/simd.h
and reference it from both the generator and the parser.Similar to the generator, this can be disabled using the
--disable-parser-use-simd
flag.Benchmarks
Machine: Macbook Air M1
master
branch at commit41d89748fab7343bfd59e55bc171b97d61ab2bb8
neon-simd-parser
commite884e5025613db8bf3259286c17ecfa9f48aa537