Feature: named capture groups #206

ciaranmcnulty · 2023-01-25T19:18:54Z

I've been looking at how cucumber-expressions could be adopted in Behat.

One feature of our existing two pattern syntaxes (regex and turnip) is that they can name the arguments.

e.g.

@When /^I eat (?<count>[0-9]+) (?<fruit>.*)$/
@When I eat :count :fruit

This allows Behat to match arguments based on name as well as position, letting users transpose the argument order if necessary.

That can be useful if multiple patterns are attached to one step, something allowed in Behat:

/**
 * @When I eat :count :fruit
 * @When :count of the :fruit are eaten
 * @When I don't eat any :fruit
 */
public function myStepDef(string $fruit, int $count=0): void

The new feature would be for Cucumber Expressions to capture argument names:

Define a syntax for adding names to expressions (e.g. {fruit:string})
Retain that name in the generated regex as a named group
Attach the name to the matched Argument value objects that the parser outputs (with an accessor method)

Then it would be up to the Cucumber implementation to either use the names for matching to the step definition, or to ignore them and use the order directly as currently happens.

The text was updated successfully, but these errors were encountered:

mpkorstanje · 2023-01-25T22:30:26Z

If we decide to extend Cucumber expressions, which ever syntax we pick, we should take some care to ensure that we don't collide with existing usage. For this example, people may already use : in their cucumber expressions.

I can also image that Behat users would like to continue to use their turnip expressions. How would be facilitate this?

The one thing we can probably do without breaking things is make the TreeRegex capture group name aware.

ciaranmcnulty · 2023-01-26T16:38:27Z

I can also image that Behat users would like to continue to use their turnip expressions. How would be facilitate this?

This part's fine, they'd just import a different attribute (~= annotation in Java)

I think we can restrict it to attributes rather than the older comment-style annotations I showed above:

use Behat\Given as OldGiven;
use Cucumber\Given;

...

#[Given('I eat {string}')]
#[OldGiven('I eat :fruit')]

ciaranmcnulty · 2023-01-26T16:41:28Z

If we decide to extend Cucumber expressions, which ever syntax we pick, we should take some care to ensure that we don't collide with existing usage

Agreed, it doesn't look as if the grammar restricts it at all so it may be tricky to find a syntax

ciaranmcnulty · 2023-02-01T11:47:41Z

Thinking about it, the grammar doesn't restrict what chars are allowed but I bet if we fuzz the existing parser implementations a little we can find some chars that aren't currently allowed :)

My pref would be : because it's like turnip, and because PHP is a dynamic language we often don't need to cast stuff.

I'd quite like something along the lines of:

{type}:{label}   // does this collide with real-life usage?
{type}           // to support current usage, omitting the colon is allowed if there is no label
:{label}         // label-only so it stays as a string 
{}:{label}       // if the above is too BC-breaking we could allow empty type == string

Even if we break backwards compatibility we could do that in an Expressions major (depending on the impact that has across different package manager ecosystems)

mpkorstanje · 2023-02-01T15:37:03Z

I would prefer to keep all the syntax inside the {}. That way we would only limit the names of the parameters and not impact other parts of expressions. So:

{} // Anonymous, as is
{type}  // As is
{:label}  // Anonymous (.*) with a label
{type:label} // type and label

Now I really don't see a way to make this graceful, but if we release this with a feature toggle, we can also smooth out other migration problems i.e. I'd like to use this in Cucumber JVM but not enable this until the next major of Cucumber-JVM. Otherwise all future patches get stuck behind this breaking change.

ciaranmcnulty · 2023-02-02T10:31:57Z

I'd like to use this in Cucumber JVM

Nice :)

FYI The way we do it in behat is we match on name first and then on position, but there are some awkward edge cases

Given :X :Y :Z

function ($X, $Y, $Z) // as expected, matched on name
function ($a, $b, $c) // as expected, matched on position
function ($a, $X, $b) // would receive $a=Y $X=:X $b=:Z which can be counterintuitive with typos

(we also populate some arguments from a DI container)

We don't currently implement type-based matching though and I'm wondering if that'd be a nice feature to add, given Cucumber Expressions can capture value object types explicitly

ciaranmcnulty · 2023-02-02T10:39:59Z

@mpkorstanje a feature toggle is a GREAT idea to avoid the BC issue entirely

ciaranmcnulty · 2023-02-02T10:44:51Z

If it's feature-toggled perhaps I'll pilot this in the PHP implementation?

mpkorstanje · 2023-02-02T11:46:37Z

Given :X :Y :Z
function ($a, $X, $b) // would receive $a=Y $X=:X $b=:Z which can be counterintuitive with typos

Ah fair enough. I'll stick to positional until someone asks for it.

Anyway, what about Given {:A} {:A} {:A}? Is that an error in creating the cucumber expression?

ciaranmcnulty · 2023-02-02T12:27:52Z

I think that's {:label} // Anonymous (.*) with a label

mpkorstanje · 2023-02-05T22:59:28Z

Sure, but how would it or the equivalent turnip or regex with named groups map? How would a regex with named and unnamed groups match?

mattwynne · 2023-02-08T06:05:59Z

I just want to give this a big 👍

Seems like it could open up some interesting possibilities.

mladedav · 2024-03-01T15:06:57Z

I would really like to have this feature too. Having named arguments prevents many mistakes when just ordering is used, especially early on when the expressions change often.

As for having multiple definitions for the same name - I think that this should be an error. Regex from my experience also fails to compile when there are multiple capture groups with the same name.

And I also think that we cucumber could disallow mixing named and positional arguments. One expression would have to commit to one or the other. I personally don't see a scenario where that would cause issues.

Otherwise I am also in favor of the {type:label} grammar. Or {label:type} since that is closer to what I'm used to, but either is fine.

As an aside, what would have to happen now for this to move forward? Is there some formal RFC process or does this just need implementation?

jsa34 · 2024-10-08T05:59:04Z

Following a discussion about implementing cucumber expressions into the pytest-bdd implementation, we rely on having named parameters/args. For us, a format of {label:type} matches what existing parsers (cfparse/parse) in use by behave and pytest-bdd do, so would make the transition much smoother.

I am happy to implement this in Python as at the moment it would be a blocker without this feature after a discussion on Discord.

I don't want to just whack a solution in though if there is a due process and without advice from those who developed the expressions and have knowledge of the best solution.

I am also happy to draw up a draft PR to form the basis of a discussion if this helps.

The decisions that look like we need to decide that I can elicit so far:

Format of arg name and type in string
Whether or how to handle mixing named and anonymous args in the same string
How to specify named args without type (which can then be inferred or defaulted to str, this is most intuitive for duck typing languages)
Possible collision with existing formats (e.g. currently, if we specify an arg name without :, it handles this as a custom register, which means we couldn't use the current python format for named args without a type as this uses the same format of {label} )

jenisys · 2024-10-08T06:17:03Z

@jsa34
Note that @aslakhellesoy started by looking at cfparse/parse (and other similar approaches). During the evolution of what became cucumber-expressions the {name:type} approach was rejected AFAIK.

Instead of changing the design of cucumber-expressions (and its design choices), it may be better to support two step-matchers (or several ones): One is cucumber-expressions, another could be parse/cfparse and …

jsa34 · 2024-10-08T06:18:19Z

My response to these points from a python and personal position:

The common format for python formatted string and for existing parsers is {label:type} and would advocate for this format as users would probably expect this. It's also the order for type hints.
Perhaps we can start by not allowing mixing, and if a user has a mixture of named and anonymous args, we throw some sort of error and decide how these two strategies can be mixed as a next step (but for python, we wouldn't support both in the first instance, and existing impls would want to continue supporting existing anon args so stops the issue of breaking change. Implementations can also limit use of just anon or just named Args to maintain existing behaviour (i.e. if CucumberJVM user uses a mix, the cucumber expressions will throw an error to say this is not supported, and if they use just named Args, CucumberJVM can throw an error to say only anon/positional Args are supported... Whereas python would say the opposite: only named args are supported).
This could also be toggle-able instead as @mpkorstanje suggested: positional, named or both
Not sure about this!
Not sure about this!

jsa34 · 2024-10-08T06:22:22Z

@jsa34 Note that @aslakhellesoy started by looking at cfparse/parse (and other similar approaches). During the evolution of what became cucumber-expressions the {name:type} approach was rejected AFAIK.

Instead of changing the design of cucumber-expressions (and its design choices), it may be better to support two step-matchers (or several ones): One is cucumber-expressions, another could be parse/cfparse and …

This is helpful to know - thank you!

I don't think there is any harm in revisiting; things have moved on, I would argue, since the original discussion, as there is now an established cucumber expressions implementation and there is a need/desire for this feature judging by the pytest-bdd discussions and other members for others uses in this thread. We also shouldn't be limited because a historical decision was made - it's healthy to understand why it was rejected to re-evaluate, so any link or reference to why this was would be extremely helpful. Regardless, I think it's important to be able to reconsider all options.

I agree there could be multiple parsers, but the aim I always felt was for Cucumber Expressions to negate the need in most cases to dance between parsers that weren't necessarily specifically designed for gherkin language.

I also am conscious of consensus amongst languages as this is language-agnostic (i.e. the format would be the same for all languages), but also conscious that there is a strong existing python precedence for type against arg, especially in strings, and may make it frustrating and extremely counterintuitive if it broke that pattern.

We also have no current named args, so whatever choice it will change the design - that's a natural evolution, otherwise this feature will never get implemented and we might as well close anything that involves changing the design to add features

mattwynne · 2024-10-10T17:15:03Z

@jsa34 I think this tension between consensus and innovation is something we need to be aware of, and not let it hold us back.

One source of friction from aiming for consistent language implementations is capacity - if you drive this forward for python, it applies pressure to implement the same behaviour in other language implementations.

I would encourage you to experiment with this in python so that we can learn from a real implementation. If there are technical barriers, e.g. with releasing a new python version of cucumber-expressions with more features that the other implementations, let's talk about that. I'm sure that's not insurmountable.

mpkorstanje · 2024-10-10T17:56:01Z

For us, a format of {label:type} matches what existing parsers (cfparse/parse) in use by behave and pytest-bdd do, so would make the transition much smoother?

@jsa34 is your goal to label arguments to facilitate the use of Cucumber expressions in Python or to make Cucumber expressions compatible with the existing parsers and expressions?

I don't think the latter is feasible or realistic.

Whether or how to handle mixing named and anonymous args in the same string.

I think for Cucumber expressions the only thing that matters is the output format.

Formerly a list of arguments, now a list of possibly labeled arguments. How this output is used to invoke a step definition and with what checks should be left to the Cucumber implementation.

How to specify named args without type (which can then be inferred or defaulted to str, this is most intuitive for duck typing languages)

These would have the anonymous type and in strongly typed languages take their type from the methods parameter.

If I recall this correctly this too is part of the interface, so the Python Cucumber implementation can always pick a string.

jsa34 · 2024-10-10T19:27:51Z

@mpkorstanje the question is more - should there be a standardised format for expressing label and type on cucumber expressions across all languages?

For example, would we permit Python to use the format:
"I have {apple_count:int} apples"
but Java:
"I have {int:apple_count} apples"

These are just random examples, however I noted the original suggested format was akin to the Python example, and this feels like the natural format based on Python conventions and is the format currently in use for existing parsers python users have to use for Behave and Pytest-BDD, whereas it as suggested later to use the "Java" example.

(I am not proposing Java wants to use this format - it is merely an illustrative example)

mpkorstanje · 2024-10-11T00:10:37Z

Okay. I think the difference in what you are proposing is mostly stylistic then. And after looking at some languages in a very non-rigorous manner when a : is used in variable declaration (e.g. var x: int), it seems that the name comes first and type come last.

So I would be okay with:

{} // Anonymous type, as is
{type}  // As is
{label:}  // Anonymous type (.*) with a label
{label:type} // type and label

The character set to be used in labels and types would have to be restricted and may not include the :.

And the output of matching would be:

---
expression: "{}"
text: '0.22'
expected_args:
- value: '0.22'
  label: null
---
expression: "{}"
text: '0.22'
type_hint: 
 - int
expected_args:
- value: 0.22
  label: null
---
expression: "{int}"
text: '0.22'
expected_args:
- value: 0.22
  label: null
---
expression: "{x:}"
text: '0.22'
expected_args:
- value: '0.22'
  label: 'x'
---
expression: "{x:int}"
text: '0.22'
expected_args:
- value: 0.22
  label: 'x'

How the args are matched with a step definition would be up to the Cucumber implementation.

Questions:

How to deal with duplicate, non null, labels?
If labels must be unique, can we call them names?
What other error conditions do we have in mind?
- I can image that either all parameters must have a name or none may have a name
  - If so then we might instead have a Expression.positionalArguments: List and a Expression.namedArguments: Map method.
- Anything else?
What about whitespace around the label and the name? I think {x: int} is easier to read, but would require reserving white space too.
While logically sensible, I find {label: } rather hard to read. It seems incomplete.

jsa34 · 2024-10-11T05:01:43Z

@mpkorstanje that's a much clearer summary than I gave - thank you!

My opinions on the questions:

Labels should be unique, especially if some implementations mat rely on them as the key instead of positional.
I am more naturally inclined if they are unique to call them names instead of labels.
Agreed about whitespace - I think all white spaces should be trimmed from text captured inside the curly brackets. This would logically, and probably rightly, mean that whitespace cannot be within the group name nor the type (i.e. {some name:custom type} would be illegal)
I absolutely agree with the latter. Perhaps as we are saying it is a toggle and binary (i.e. only named or positional args allowed and no mixing), we don't need the colon to signify that this is a name instead of being a type (i.e. {name} and {name:type} would be permitted if systems use the named args implementation, but {type} and {} is used for those using positional only. This might get difficult though if we allow positional and named to be used in the same expression later down the line. I'm not 100% sure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: named capture groups #206

Feature: named capture groups #206

ciaranmcnulty commented Jan 25, 2023 •

edited

Loading

mpkorstanje commented Jan 25, 2023

ciaranmcnulty commented Jan 26, 2023 •

edited

Loading

ciaranmcnulty commented Jan 26, 2023

ciaranmcnulty commented Feb 1, 2023

mpkorstanje commented Feb 1, 2023 •

edited

Loading

ciaranmcnulty commented Feb 2, 2023 •

edited

Loading

ciaranmcnulty commented Feb 2, 2023

ciaranmcnulty commented Feb 2, 2023

mpkorstanje commented Feb 2, 2023

ciaranmcnulty commented Feb 2, 2023

mpkorstanje commented Feb 5, 2023

mattwynne commented Feb 8, 2023

mladedav commented Mar 1, 2024

jsa34 commented Oct 8, 2024 •

edited

Loading

jenisys commented Oct 8, 2024

jsa34 commented Oct 8, 2024 •

edited

Loading

jsa34 commented Oct 8, 2024 •

edited

Loading

mattwynne commented Oct 10, 2024

mpkorstanje commented Oct 10, 2024 •

edited

Loading

jsa34 commented Oct 10, 2024 •

edited

Loading

mpkorstanje commented Oct 11, 2024 •

edited

Loading

jsa34 commented Oct 11, 2024 •

edited

Loading

Feature: named capture groups #206

Feature: named capture groups #206

Comments

ciaranmcnulty commented Jan 25, 2023 • edited Loading

mpkorstanje commented Jan 25, 2023

ciaranmcnulty commented Jan 26, 2023 • edited Loading

ciaranmcnulty commented Jan 26, 2023

ciaranmcnulty commented Feb 1, 2023

mpkorstanje commented Feb 1, 2023 • edited Loading

ciaranmcnulty commented Feb 2, 2023 • edited Loading

ciaranmcnulty commented Feb 2, 2023

ciaranmcnulty commented Feb 2, 2023

mpkorstanje commented Feb 2, 2023

ciaranmcnulty commented Feb 2, 2023

mpkorstanje commented Feb 5, 2023

mattwynne commented Feb 8, 2023

mladedav commented Mar 1, 2024

jsa34 commented Oct 8, 2024 • edited Loading

jenisys commented Oct 8, 2024

jsa34 commented Oct 8, 2024 • edited Loading

jsa34 commented Oct 8, 2024 • edited Loading

mattwynne commented Oct 10, 2024

mpkorstanje commented Oct 10, 2024 • edited Loading

jsa34 commented Oct 10, 2024 • edited Loading

mpkorstanje commented Oct 11, 2024 • edited Loading

jsa34 commented Oct 11, 2024 • edited Loading

ciaranmcnulty commented Jan 25, 2023 •

edited

Loading

ciaranmcnulty commented Jan 26, 2023 •

edited

Loading

mpkorstanje commented Feb 1, 2023 •

edited

Loading

ciaranmcnulty commented Feb 2, 2023 •

edited

Loading

jsa34 commented Oct 8, 2024 •

edited

Loading

jsa34 commented Oct 8, 2024 •

edited

Loading

jsa34 commented Oct 8, 2024 •

edited

Loading

mpkorstanje commented Oct 10, 2024 •

edited

Loading

jsa34 commented Oct 10, 2024 •

edited

Loading

mpkorstanje commented Oct 11, 2024 •

edited

Loading

jsa34 commented Oct 11, 2024 •

edited

Loading