Performance considerations when generating parser for a long list of words #85

venkatd · 2020-10-30T16:57:04Z

venkatd
Oct 30, 2020

I have started prototyping a feature that parses natural time such as "tomorrow at 3pm". Thanks to petit parser, this feature is simpler than it would have been otherwise :)

Now I'm looking to add timezone/location support such as "tomorrow at 3pm california time".

Are there any performance considerations if we are looking at a list of ~1000-2000 locations? I'm guessing would have something like:
final location = locations.map((l) => stringIgnoringCase(l.name))).toChoiceParser()

Are parsers optimized for this sort of matching? There'll be some sort of compact representation? We will be running this parser on every keystroke in a TextField but the text field content we parse won't get that large (probably no more than 500 characters).

Forgive my ignorance on how parsers work because petit parser is my first time using a parser in production!

Side note: It seems like I'll need to sort the words by length descending to ensure something like "india" doesn't prematurely match on "indiana"?

Answered by renggli

Oct 30, 2020

The code is not automatically optimized to match a large number of choices as described in the example. PetitParser will literally do a case-insensitive prefix match for every single location in your list. Though I don't think this is necessarily a problem and I wouldn't worry about optimizing at this point. Get it working correctly first, and then — if necessary — profile and improve the performance as needed. I see various ways of effectively optimizing:

Parse the location like a generic variable name and do a separate lookup in the parse action afterwards (might not work without other tricks, if locations are not clearly separated),
Try to build a smarter parse graph (i.e. a prefix tr…

View full answer

renggli · 2020-10-30T20:38:21Z

renggli
Oct 30, 2020
Maintainer

The code is not automatically optimized to match a large number of choices as described in the example. PetitParser will literally do a case-insensitive prefix match for every single location in your list. Though I don't think this is necessarily a problem and I wouldn't worry about optimizing at this point. Get it working correctly first, and then — if necessary — profile and improve the performance as needed. I see various ways of effectively optimizing:

Parse the location like a generic variable name and do a separate lookup in the parse action afterwards (might not work without other tricks, if locations are not clearly separated),
Try to build a smarter parse graph (i.e. a prefix tree),
Create a custom parser sub-class, or
Do a combination of the above 3.

0 replies

venkatd · 2020-10-30T20:55:28Z

venkatd
Oct 30, 2020
Author

Thanks for the ideas. I might later give a combination of 2+3 a try even if just as an exercise.

I'm seeing https://github.com/petitparser/dart-petitparser/blob/master/petitparser/lib/src/parser/predicate/predicate.dart

Would that be a good example parser subclass with similar goals?

And how do parseOn and fastParseOn get used? I am having trouble understanding the difference. In a previous issue you said

fastParseOn is an optimized version of parseOn, that does not produce a result, but can provide a significant speed boost in some cases

But then why is parseOn implemented for the predicate parser if we already have a fastParseOn implementation? When would the regular parseOn be used over fastParseOn?

0 replies

renggli · 2020-10-30T21:04:58Z

renggli
Oct 30, 2020
Maintainer

Yes, the PredicateParser is a good example. You will not need to bother about implementing fastParseOn, this is an optimization only used when no parse result is requested (i.e. within a flatten parser). In your case you want to return the location, so you can have all the logic in parseOn.

0 replies

venkatd · 2020-10-30T21:27:43Z

venkatd
Oct 30, 2020
Author

Ok great!

Sounds like a plan of action (if I need to optimize) would be something like:

class WordListParser extends Parser<String> where I can pass in a list of words
store the words in an optimized data structure like a prefix tree

Then in parseOn:

Iterate character by character starting from context.position up to context.buffer.length
Use the prefix tree to determine if there is a match yet or if we hit a dead end. Run context.success or context.failure accordingly

0 replies

renggli · 2020-10-30T22:17:04Z

renggli
Oct 30, 2020
Maintainer

Yes, that sounds like a good plan.

0 replies

venkatd · 2020-10-31T13:04:34Z

venkatd
Oct 31, 2020
Author

I implemented an any_string.dart as an exercise. Sharing here for future reference.

It's using a naive algorithm of iterating word by word. But if I want to optimize further, I could replace the _sortedByLengthDescThenName with a more efficient data structure like _buildPrefixTree as suggested.

import 'package:petitparser/petitparser.dart';
import 'package:collection/collection.dart' show equalsIgnoreAsciiCase;

Parser<String> anyStringIgnoringCase(Iterable<String> strings) =>
    AnyStringParser(
      strings,
      equalsIgnoreAsciiCase,
      'words expected',
    );

Parser<String> anyString(Iterable<String> strings) =>
    AnyStringParser(strings, (a, b) => a == b, 'words expected');

typedef StringEquals = bool Function(String a, String b);

class AnyStringParser extends Parser<String> {
  AnyStringParser(
    Iterable<String> words,
    this.equals,
    this.message,
  )   : assert(predicate != words, 'words must not be null'),
        assert(equals != null, 'equals must not be null'),
        assert(message != null, 'message must not be null'),
        words = _sortedByLengthDescThenName(words);

  final List<String> words;
  final String message;
  final StringEquals equals;

  @override
  Result<String> parseOn(Context context) {
    final start = context.position;
    final buffer = context.buffer;
    final bufferLength = context.buffer.length;

    for (final w in words) {
      if (start + w.length > bufferLength) continue;
      if (equals(buffer.substring(start, start + w.length), w)) {
        final result = buffer.substring(start, start + w.length);
        return context.success(result, start + result.length);
      }
    }
    return context.failure(message);
  }

  @override
  String toString() => '${super.toString()}[$message]';

  @override
  AnyStringParser copy() => AnyStringParser(words, equals, message);

  @override
  bool hasEqualProperties(AnyStringParser other) =>
      super.hasEqualProperties(other) &&
      words == other.words &&
      equals == other.equals &&
      message == other.message;
}

List<String> _sortedByLengthDescThenName(Iterable<String> strings) {
  return strings.toList()
    ..sort((a, b) {
      return a.length == b.length
          ? a.compareTo(b)
          : b.length.compareTo(a.length);
    });
}

Feedback appreciated :)

0 replies

renggli · 2020-10-31T15:07:44Z

renggli
Oct 31, 2020
Maintainer

Looks great! Probably using compareAsciiLowerCase function from package:collection would be more efficient than repeatedly converting the strings to lower-case?

0 replies

venkatd · 2020-10-31T17:54:59Z

venkatd
Oct 31, 2020
Author

Good point so we don't keep allocating strings.

Looks like compareAsciiLowerCase doesn't return 0 if they are equal in a case-insensitive way to ensure consistent sorting.

Comment from the code:

/// If two strings differ only on the case of ASCII letters, the one with the
/// capital letter at the first difference will compare as less than the other
/// string. This tie-breaking ensures that the comparison is a total ordering
/// on strings and is compatible with equality.

But I found there is an equalsIgnoreAsciiCase in the same lib that does the trick. Updated my code example above.

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance considerations when generating parser for a long list of words #85

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance considerations when generating parser for a long list of words #85

venkatd Oct 30, 2020

Replies: 8 comments

renggli Oct 30, 2020 Maintainer

venkatd Oct 30, 2020 Author

renggli Oct 30, 2020 Maintainer

venkatd Oct 30, 2020 Author

renggli Oct 30, 2020 Maintainer

venkatd Oct 31, 2020 Author

renggli Oct 31, 2020 Maintainer

venkatd Oct 31, 2020 Author

venkatd
Oct 30, 2020

renggli
Oct 30, 2020
Maintainer

venkatd
Oct 30, 2020
Author

renggli
Oct 30, 2020
Maintainer

venkatd
Oct 30, 2020
Author

renggli
Oct 30, 2020
Maintainer

venkatd
Oct 31, 2020
Author

renggli
Oct 31, 2020
Maintainer

venkatd
Oct 31, 2020
Author