Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove String#casecmp part #123

Open
AlexWayfer opened this issue Apr 14, 2017 · 6 comments
Open

Remove String#casecmp part #123

AlexWayfer opened this issue Apr 14, 2017 · 6 comments

Comments

@AlexWayfer
Copy link

The String#casecmp method does not work with Unicode (even in Ruby 2.4.1), this is written in the documentation.

Example:

'Привет'.casecmp('привет') # => -1

There is a method String#casecmp?, which works with Unicode.

Example:

'Привет'.casecmp?('привет') # => true

But String.casecmp? is slower:

Warming up --------------------------------------
String#downcase + ==   233.440k i/100ms
      String#casecmp   274.247k i/100ms
     String#casecmp?   219.906k i/100ms
Calculating -------------------------------------
String#downcase + ==      5.746M (± 1.7%) i/s -     28.947M in   5.039252s
      String#casecmp      6.942M (± 1.8%) i/s -     34.829M in   5.019073s
     String#casecmp?      4.517M (± 2.6%) i/s -     22.650M in   5.017864s

Comparison:
      String#casecmp:  6941676.9 i/s
String#downcase + ==:  5745893.8 i/s - 1.21x  slower
     String#casecmp?:  4517314.3 i/s - 1.54x  slower
Code

require 'benchmark/ips'

SLUG = 'ABCD'

def slow
  SLUG.downcase == 'abcd'
end

def fast
  SLUG.casecmp('abcd') == 0
end

def another
  SLUG.casecmp?('abcd')
end

Benchmark.ips do |x|
  x.report('String#downcase + ==') { slow }
  x.report('String#casecmp')       { fast }
  x.report('String#casecmp?')      { another }
  x.compare!
end

So, String#downcase + == is good compromise.

@kbrock
Copy link
Contributor

kbrock commented Apr 27, 2018

maybe a compromise here is to have String#casecmp (no unicode support)

so we have the comparison, but the problem is front and center?

@AlexWayfer
Copy link
Author

AlexWayfer commented Apr 27, 2018

maybe a compromise here is to have String#casecmp (no unicode support)

so we have the comparison, but the problem is front and center?

What? I think guides should promotes international (Unicode) ways, not ASCII-only (read as English-only).

@nirvdrum
Copy link
Collaborator

Many use cases are ASCII-only. I think it's still valid to retain benchmarks for those situations. Ruby even provides String#ascii_only? to help guard such cases.

As an addendum, every Ruby implementation optimizes for ASCII. It's not a value statement. It's just that ASCII characters are byte wide, which makes optimizing for them very easy. UTF-16 and UTF-32 are also somewhat optimized given they have fixed width encodings. UTF-8 complicates things by virtue of variable width characters. As a simple test, benchmark String#length with a string with all ASCII characters and the same string with the last character replaced by a variable-width one; you'll see a massive difference.

I'm sensitive to ensuring code works with all character sets. But to the extent that this is a set of benchmarks that are context-sensitive, I think it makes sense to lay out that context. Really, all the String benchmarks should be enhanced to run with ASCII-only strings and strings with multi-byte characters in UTF-8.

@mateusdeap
Copy link
Member

I agree with @kbrock and @nirvdrum. There's no good reason to restrict things to UTF-8 because all things should be international. They should follow the actual use cases.

@AlexWayfer do you think you can edit your PR to keep the old casecmp entry and add the case that is compatible with UTF-8?

@AlexWayfer
Copy link
Author

There's no good reason to restrict things to UTF-8 because all things should be international.

The strings in Ruby are UTF-8 by default.

Do you know about Ruby's # encoding: UTF-8 magic comment? It's not here anymore because of the changed default.

I think it's a very good reason.

They should follow the actual use cases.

And rules about different non-clear use cases should not exist, at least be turned off by default (if we're talking about linting).

@AlexWayfer do you think you can edit your PR to keep the old casecmp entry and add the case that is compatible with UTF-8?

I can, but I don't want. For what? The suggested "by international community" rule doesn't work with a lot of lanuagues, hungreds probably. The significant majority of our world.

I don't want to suggest to use a method which doesn't work correctly in most languages (strings are about texts, texts are written in languages) for a little performance. And I don't want to receive such suggestions.

In the same project in one case you'll be warned and it'll work and you're fine. In another, with another language, probably dynamically (what is difficult to investigate and analyze) it will be broken (just false for no reason, when it's obviously true).

@mateusdeap doi you think you can create a PR to Ruby to fix the casecmp behaviour for Unicode (the Ruby's default)? Then I'm gonna edit my PR. Otherwise… it just doesn't work, "there is no good reason to suggest things to performance because some of them are in ASCII".

@nirvdrum
Copy link
Collaborator

nirvdrum commented Nov 5, 2022

To clear up confusion, String#casecmp works fine with UTF-8 strings. What it ignores is non-ASCII characters. You can check if a string consists only of ASCII data via String#ascii_only?. This boils down to Ruby's Unicode case mapping support. String#casecmp should arguably take the case mapping options that String#downcase or String#upcase do. It implicitly works like String#downcase(:ascii).

I just want to clear up that "ASCII" here doesn't literally mean the US-ASCII encoding, but rather UTF-8 with characters that are drawn from ASCII (as UTF-8 is ASCII-compatible).

I have a few other parting thoughts:

  • I don't think this repo should be a considered recommendation for anything. If you want to enforce one method over the other, Rubocop would be a better tool for doing that. As far as I can tell, this project's benchmarks are intended to show the performance difference between calls under different contexts. In most cases, it's a curiosity at best and maybe gives an indication to Ruby developers where things could be sped up.
  • On that note, there should be no difference in using String#casecmp? and String#casecmp with ASCII-only data, which would eliminate the need to choose. Currently, there is, so this benchmark has value in that sense. This would be a good target for any of the various Ruby JITs to optimize.
  • Like it or not, there's a huge body of ASCII text out there -- it's a big part of why UTF-8 even exists. CRuby embraces that and optimizes for ASCII-only text throughout most String operations. While I agree the default operation should be the one that works for all characters, both exist for different use cases. To the best of my knowledge, String#casecmp is not deprecated in Ruby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants