Skip to content

Conversation

kenballus
Copy link

Fixes #47.

Contribution

@ioquatix
Copy link
Member

ioquatix commented Sep 9, 2025

This looks good to me, but I want to draw your attention to https://github.com/socketry/protocol-http1/blob/main/test/protocol/http1/parser.rb which is testing these patterns to ensure they match in linear time.

@kenballus
Copy link
Author

This looks good to me, but I want to draw your attention to https://github.com/socketry/protocol-http1/blob/main/test/protocol/http1/parser.rb which is testing these patterns to ensure they match in linear time.

Interesting :)

Assuming the test works, I'm not sure why it's passing. Worst-case on this regex should be $\Theta(n^2)$ with a naive algorithm. Maybe the regex engine here is using the \z anchor in the pattern to optimize this case?

@kenballus
Copy link
Author

Okay, clearly I don't understand regexes as well as I thought I did :)

Regexp.linear_time? definitely returns true for this pattern, but I remain unsure as to why.

@ioquatix
Copy link
Member

ioquatix commented Sep 9, 2025

Modern Ruby implementations use memoization to avoid exponential backtracking (trading memory for computation), so it can depend on the implementation, which is why linear_time? is a function. It may be that some Ruby implementations return false for the same regex. However, one of the reasons why I changed this code in the first place was actually because I found the regex was not linear.

@kenballus
Copy link
Author

Modern Ruby implementations use memoization to avoid exponential backtracking (trading memory for computation), so it can depend on the implementation, which is why linear_time? is a function.

I ended up going down that rabbithole last night and reading the paper referenced in the Ruby docs. Pretty cool!

However, one of the reasons why I changed this code in the first place was actually because I found the regex was not linear.

As it is, the current pattern clearly needs to change, given that the final OWS is useless (the preceding [^\r\n\0] always eats its lunch).

Anyway, the only pattern I can think of that doesn't use non-greedy matching would be something like this (imo, gross):

FIELD_VALUE = /|[^ \t\0\r\n]|[^ \t\0\r\n][^\0\r\n]*[^ \t\0\r\n]/.freeze

That is, 3 cases:

  • empty
  • a single non-whitespace character
  • a non-whitespace character, any number of potentially-whitespace characters, then another non-whitespace character

Would this be preferable?

@ioquatix
Copy link
Member

TBH, as long as 1inear_time? is true, I think it's okay, but it might be nice to evaluate the performance vs rstrip! before we make a final decision?

@kenballus
Copy link
Author

We want to strip only spaces and tabs, but rstrip will also strip vertical tab and form feed (and also CR, LF, and NUL, but these wouldn't match the regex anyway.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Whitespace is not stripped from the right side of header values, as the RFCs require
2 participants