Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For enum values starting with a digit, prefix the generated symbol with an underscore. #3961

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

deven
Copy link
Contributor

@deven deven commented Jan 3, 2025

Motivation and Context

If an enum uses values which start with a digit, the generated symbols are invalid, causing both Smithy errors and Rust errors.

Description

Prefixes invalid symbols starting with a digit with underscores to make the symbols valid.

Testing

This commit fixes the Smithy error that I was getting for an enum using a value of "2DBarcode", and also fixes the generated Rust code to use "_2DBarcode" as the enum value instead of "2DBarcode".

Checklist

  • For changes to the smithy-rs codegen or runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "client," "server," or both in the applies_to key.
  • For changes to the AWS SDK, generated SDK code, or SDK runtime crates, I have created a changelog entry Markdown file in the .changelog directory, specifying "aws-sdk-rust" in the applies_to key.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@deven deven requested review from a team as code owners January 3, 2025 14:45
@landonxjames
Copy link
Contributor

Hi @deven, thanks for the contribution! I have kicked off the CI tests and expect them all to pass. Would it be possible to get a test added for this to ensure that it isn't reverted at some later point and maybe a comment on the regex replacements just to indicate their purpose?

@drganjoo
Copy link
Contributor

drganjoo commented Jan 3, 2025

It is unclear which error is this PR trying to resolve. Enum variant names follow the same rules as identifiers in Smithy, and Smithy does not allow identifiers to begin with a digit. For example, this is invalid in Smithy:

enum BarCodes {
    2DBarCode
    3DBarCode
}

However, this restriction does not apply to enum values. For example, the following is perfectly valid and will not raise any errors:

enum BarCodes {
    TwoDBarCode = "2d"
    ThreeDBarCode = "3d"
}

@deven
Copy link
Contributor Author

deven commented Jan 4, 2025

Hi @deven, thanks for the contribution! I have kicked off the CI tests and expect them all to pass. Would it be possible to get a test added for this to ensure that it isn't reverted at some later point and maybe a comment on the regex replacements just to indicate their purpose?

Will do, thanks!

@deven
Copy link
Contributor Author

deven commented Jan 4, 2025

It is unclear which error is this PR trying to resolve. Enum variant names follow the same rules as identifiers in Smithy, and Smithy does not allow identifiers to begin with a digit. For example, this is invalid in Smithy:

enum BarCodes {
    2DBarCode
    3DBarCode
}

However, this restriction does not apply to enum values. For example, the following is perfectly valid and will not raise any errors:

enum BarCodes {
    TwoDBarCode = "2d"
    ThreeDBarCode = "3d"
}

I tried to keep it simple, but I guess I need to walk through the gory details.

I've been working on creating AWS-style Rust SDKs for various sections of Amazon SP-API, using a custom script I wrote to translate the original Swagger IDL files into Smithy IDL files, then building the SDKs using smithy-rs. I've had to resolve many problems in this project, but that's a long story.

In this particular case, I was working on the Fulfillment Inbound API. Here is one of the enums in the original fulfillmentInboundV0.json Swagger IDL file:

    "BoxContentsSource": {
      "type": "string",
      "description": "Where the seller provided box contents information for a shipment.",
      "enum": [
        "NONE",
        "FEED",
        "2D_BARCODE",
        "INTERACTIVE"
      ],
      "x-docgen-enum-table-extension": [
        {
          "value": "NONE",
          "description": "There is no box contents information for this shipment. Amazon will manually process the box contents information. This may incur a fee."
        },
        {
          "value": "FEED",
          "description": "Box contents information is provided through the _POST_FBA_INBOUND_CARTON_CONTENTS_ feed."
        },
        {
          "value": "2D_BARCODE",
          "description": "Box contents information is provided by a barcode on the shipment. For more information, see Using 2D barcodes for box content information on Seller Central."
        },
        {
          "value": "INTERACTIVE",
          "description": "Box contents information is provided by an interactive source, such as a web tool."
        }
      ]
    },

My script translated this Swagger enum into the following Smithy enum in the generated Smithy IDL file:

/// Where the seller provided box contents information for a shipment.
enum BoxContentsSource {
    /// There is no box contents information for this shipment. Amazon will manually process the box contents information. This may incur a fee.
    NONE

    /// Box contents information is provided through the _POST_FBA_INBOUND_CARTON_CONTENTS_ feed.
    FEED

    /// Box contents information is provided by a barcode on the shipment. For more information, see Using 2D barcodes for box content information on Seller Central.
    _2D_BARCODE = "2D_BARCODE"

    /// Box contents information is provided by an interactive source, such as a web tool.
    INTERACTIVE
}

Of course, simply using 2D_BARCODE would be an invalid enum variant, since it doesn't conform to identifier rules because of the leading digit. I ran into that issue earlier, so I modified my script to use _2D_BARCODE = "2D_BARCODE" instead, to make the Smithy IDL valid without changing the wire value. This is analogous to TwoDBarCode = "2d" in your example. In theory, it should have worked, but it didn't. When I ran ./gradlew assemble in the smithy-rs directory to generate the Rust code, it failed with the following exception:

Projection fulfillmentinboundv0 failed: software.amazon.smithy.model.shapes.ShapeIdSyntaxException: Invalid shape ID member: 2DBarcode
software.amazon.smithy.model.shapes.ShapeIdSyntaxException: Invalid shape ID member: 2DBarcode
        at software.amazon.smithy.model.shapes.ShapeId.withMember(ShapeId.java:263)
        at software.amazon.smithy.rust.codegen.core.smithy.generators.EnumMemberModel$Companion.toEnumVariantName(EnumGenerator.kt:108)
        at software.amazon.smithy.rust.codegen.core.smithy.generators.EnumMemberModel.name(EnumGenerator.kt:119)
        at software.amazon.smithy.rust.codegen.client.smithy.generators.InfallibleEnumType.renderForwardCompatibilityNote(ClientEnumGenerator.kt:245)
        at software.amazon.smithy.rust.codegen.client.smithy.generators.InfallibleEnumType.access$renderForwardCompatibilityNote(ClientEnumGenerator.kt:29)
        at software.amazon.smithy.rust.codegen.client.smithy.generators.InfallibleEnumType$additionalDocs$1.invoke(ClientEnumGenerator.kt:165)
        at software.amazon.smithy.rust.codegen.client.smithy.generators.InfallibleEnumType$additionalDocs$1.invoke(ClientEnumGenerator.kt:164)
        at software.amazon.smithy.rust.codegen.core.smithy.generators.EnumGenerator.renderEnum(EnumGenerator.kt:253)
        at software.amazon.smithy.rust.codegen.core.smithy.generators.EnumGenerator.renderNamedEnum(EnumGenerator.kt:204)
        at software.amazon.smithy.rust.codegen.core.smithy.generators.EnumGenerator.render(EnumGenerator.kt:191)
        at software.amazon.smithy.rust.codegen.client.smithy.ClientCodegenVisitor$stringShape$1.invoke(ClientCodegenVisitor.kt:283)
        at software.amazon.smithy.rust.codegen.client.smithy.ClientCodegenVisitor$stringShape$1.invoke(ClientCodegenVisitor.kt:282)
        at software.amazon.smithy.rust.codegen.core.smithy.RustCrate.withModule$lambda$6(CodegenDelegator.kt:211)
        at software.amazon.smithy.codegen.core.WriterDelegator.useFileWriter(WriterDelegator.java:175)
        at software.amazon.smithy.rust.codegen.core.smithy.RustCrate.withModule(CodegenDelegator.kt:210)
        at software.amazon.smithy.rust.codegen.core.smithy.RustCrate.inPrivateModuleWithReexport(CodegenDelegator.kt:252)
        at software.amazon.smithy.rust.codegen.client.smithy.ClientCodegenVisitor.stringShape(ClientCodegenVisitor.kt:282)
        at software.amazon.smithy.rust.codegen.client.smithy.ClientCodegenVisitor.stringShape(ClientCodegenVisitor.kt:61)
        at software.amazon.smithy.model.shapes.ShapeVisitor.enumShape(ShapeVisitor.java:70)
        at software.amazon.smithy.model.shapes.EnumShape.accept(EnumShape.java:115)
        at software.amazon.smithy.rust.codegen.client.smithy.ClientCodegenVisitor.execute(ClientCodegenVisitor.kt:165)
        at software.amazon.smithy.rust.codegen.client.smithy.RustClientCodegenPlugin.executeWithDecorator(RustClientCodegenPlugin.kt:79)
        at software.amazon.smithy.rust.codegen.client.testutil.ClientDecoratableBuildPlugin.execute(ClientCodegenIntegrationTest.kt:61)
        at software.amazon.smithy.build.SmithyBuildImpl.applyPlugin(SmithyBuildImpl.java:432)
        at software.amazon.smithy.build.SmithyBuildImpl.applyProjection(SmithyBuildImpl.java:364)
        at software.amazon.smithy.build.SmithyBuildImpl.executeSerialProjection(SmithyBuildImpl.java:281)
        at software.amazon.smithy.build.SmithyBuildImpl.lambda$applyAllProjections$4(SmithyBuildImpl.java:201)
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
        at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

This exception is generated because "2DBarcode" is not a valid identifier because of the leading digit. However, the exact string "2DBarcode" was NOT in the Smithy IDL file at all! Some algorithm inside smithy-rs apparently took the wire value of "2D_BARCODE" and applied transformations to turn it into "2DBarcode", which then triggered the exception. If it had started with the "_2D_BARCODE" Smithy enum variant name instead, presumably it would have transformed that into "_2DBarcode", which would have been valid.

My commit uses the following code fragment to fix the invalid generated symbols starting with a digit:

.replace(Regex("^(_*\\d)"), "_$1")

This will prefix the symbol name with an underscore if it begins with a digit, making the generated symbol valid. To guarantee that this modification can't conflict with another enum variant, it also matches any number of leading underscores before that leading digit. Otherwise, prefixing 2DBarcode with an underscore could potentially conflict with a different _2DBarcode variant. Since this regex matches any number of leading underscores along with the leading digit, that means that 2DBarcode becomes _2DBarcode and _2DBarcode becomes __2DBarcode (adding a second leading underscore), preventing a possible collision from this prefixing step.

Originally, I only added this code fragment to EnumGenerator.kt, but it was insufficient to solve the problem. The exception above did go away, and the smithy-rs build was "successful", but the build output did include a new warning:

[WARNING] Failed to run cargo fmt: [com.amazon.sellingpartnerapi.fulfillmentinbound.v0#FulfillmentInbound_v0]
Unexpected exception thrown when executing subprocess:
CommandError(output=Command Error
error[E0585]: found a documentation comment that doesn't document anything
  --> /home/deven/git/smithy-rs/aws/sdk/build/smithyprojections/sdk/fulfillmentinboundv0/rust-client-codegen/src/types/_box_contents_source.rs:45:5
   |
44 | pub enum BoxContentsSource {
   |          ----------------- while parsing this enum
45 |     /// Box contents information is provided by a barcode on the shipment. For more information, see Using 2D barcodes for box content information on Seller Central.
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = help: doc comments must come before what they document, if a comment was intended use `//`
   = help: enum variants can be `Variant`, `Variant = <integer>`, `Variant(Type, ..., TypeN)` or `Variant { fields: Types }`

Error writing files: failed to resolve mod `_box_contents_source`: cannot parse /home/deven/git/smithy-rs/aws/sdk/build/smithyprojections/sdk/fulfillmentinboundv0/rust-client-codegen/src/types/_box_contents_source.rs

)

When I investigated, I found that the generated Rust code had the following enum definition:

pub enum BoxContentsSource {
    /// Box contents information is provided by a barcode on the shipment. For more information, see Using 2D barcodes for box content information on Seller Central.
    2DBarcode,
    /// Box contents information is provided through the _POST_FBA_INBOUND_CARTON_CONTENTS_ feed.
    Feed,
    /// Box contents information is provided by an interactive source, such as a web tool.
    Interactive,
    /// There is no box contents information for this shipment. Amazon will manually process the box contents information. This may incur a fee.
    None,
    /// `Unknown` contains new variants that have been added since this code was generated.
    #[deprecated(note = "Don't directly match on `Unknown`. See the docs on this enum for the correct way to handle unknown variants.")]
    Unknown(crate::primitives::sealed_enum_unknown::UnknownVariantValue)
}

Since 2DBarcode as an enum variant name is just as invalid in Rust as it is in Smithy, this generated Rust code is invalid and cannot be compiled. Investigating further, I concluded that this invalid enum variant name in the Rust code was being returned from the toSymbol() method. Further debugging showed that the fakeMemberShape object WAS being created with the underscore prefix, with an id value like com.amazon.sellingpartnerapi.fulfillmentinbound.v0#BoxContentsSource$_2DBarcode, but the toSymbol() method was nevertheless returning 2DBarcode as the symbol.

I needed to investigate the toSymbol() implementation, but it wasn't clear which concrete class implementation was being used. Using reflection and debugging print statements, I determined that the RustReservedWordSymbolProvider class contained the toSymbol() implementation in question. When I eventually determined the right way to integrate my new code fragment into the toSymbol() method, it finally resolved the problem complete. With this pull request, the generated Rust code now looks like this instead:

pub enum BoxContentsSource {
    /// Box contents information is provided by a barcode on the shipment. For more information, see Using 2D barcodes for box content information on Seller Central.
    ///
    /// _Note: `::2DBarcode` has been renamed to `::_2DBarcode`._
    _2DBarcode,
    /// Box contents information is provided through the _POST_FBA_INBOUND_CARTON_CONTENTS_ feed.
    Feed,
    /// Box contents information is provided by an interactive source, such as a web tool.
    Interactive,
    /// There is no box contents information for this shipment. Amazon will manually process the box contents information. This may incur a fee.
    None,
    /// `Unknown` contains new variants that have been added since this code was generated.
    #[deprecated(note = "Don't directly match on `Unknown`. See the docs on this enum for the correct way to handle unknown variants.")]
    Unknown(crate::primitives::sealed_enum_unknown::UnknownVariantValue),
}

This now contains a valid enum variant name (_2DBarcode), and even includes a note in the documentation comment about the symbol being renamed.

@aajtodd
Copy link
Contributor

aajtodd commented Jan 6, 2025

Have you tried changing your original transformation to not use underscore as a prefix and something like X_ or x2DBarCode or similar? The algorithm for converting the enum name is likely dropping underscores as part of converting along word boundaries. You may have better results with a different prefix and possibly negate the need for this change.

@drganjoo
Copy link
Contributor

drganjoo commented Jan 6, 2025

@deven Thanks for the detailed explanation. @aajtodd I agree that not appending _ should solve this issue, but we need to fix this overall, as the following is a valid Smithy model, and we should generate valid Rust code for it:

enum Dimensions {
    _2D = "2D"
    _3D = "3D"
}

However, I would like to dig a bit deeper to understand why we need to use PascalCase in the EnumGenerator.kt in the first place.

@deven
Copy link
Contributor Author

deven commented Jan 7, 2025

Have you tried changing your original transformation to not use underscore as a prefix and something like X_ or x2DBarCode or similar? The algorithm for converting the enum name is likely dropping underscores as part of converting along word boundaries. You may have better results with a different prefix and possibly negate the need for this change.

For the record, using x2D_BARCODE instead of _2D_BARCODE does work with the original code, generating a Rust enum member X2DBarcode. Not that I want to use an alphabetic prefix like that! As @drganjoo said, using the underscore prefix makes a valid Smithy model that should generate valid Rust code. My patch resolves this issue.

@deven deven force-pushed the prefix_leading_digit branch from e16a2af to c14bf4d Compare January 7, 2025 06:49
@deven deven force-pushed the prefix_leading_digit branch from c14bf4d to fbd70dd Compare January 7, 2025 06:50
@deven
Copy link
Contributor Author

deven commented Jan 7, 2025

Hi @deven, thanks for the contribution! I have kicked off the CI tests and expect them all to pass. Would it be possible to get a test added for this to ensure that it isn't reverted at some later point and maybe a comment on the regex replacements just to indicate their purpose?

Hi @landonxjames, I've replaced the original commit with a newer one which includes comments and test cases.

@deven
Copy link
Contributor Author

deven commented Jan 17, 2025

So, could we get this merged please? 😃

@drganjoo
Copy link
Contributor

Thank you for your patience regarding this PR review. I've identified some issues with the current implementation in EnumGenerator that need to be addressed.

The fundamental problem lies in the processing order of the identifier transformation. Currently, we apply toPascalCase before handling the underscore-prefixed numeric identifiers, which strips away the significant prefix information. The current implementation:

val name = definition.name.orNull()?.toPascalCase()?.replace(Regex("^(_*\\d)"), "_$1") ?: return null

This approach is problematic for the following reasons:

  1. It fails to preserve the original number of leading underscores in identifiers. Instead of maintaining the user's intended prefix structure, it reduces multiple underscores to a single one. In fact, the regular expression will never see the leading underscores at all, because toPascalCase has already removed them from the identifier name, so we might as well have been looking for a digit instead of zero or more underscores.

  2. It introduces potential naming collisions. Consider this Smithy model:

@enum([
    { 
        value: "2D_BARCODE",
        name: "_2D_BARCODE",
        documentation: "Example with leading digit." 
    },
    { 
        value: "2D_SPECIAL_BARCODE",
        name: "__2D_BARCODE",
        documentation: "Example with more underscores"
    },
])

This generates invalid Rust code with duplicate variant names:

pub enum BarCodes {
    /// Example with leading digit.
    ///
    /// _Note: `::2DBarcode` has been renamed to `::_2DBarcode`._
    _2DBarcode,
    /// Example with leading digit.
    ///
    /// _Note: `::2DBarcode` has been renamed to `::_2DBarcode`._
    _2DBarcode,
}
  1. The documentation comments unnecessarily expose internal transformation details. We don't document other common transformations (e.g., THREE_D_BARCODEThreeDeBarcode), so this special case treatment is inconsistent with our general approach.

Proposed Solutions:

  1. Enhance the EnumGenerator:

a) Implement proper prefix preservation (something like the following):

Note: We specifically use _+ (one or more underscores) rather than _* (zero or more underscores) in our regex because we only want to handle identifiers that start with leading underscores. Identifiers starting directly with digits (without underscores) are invalid, so we don't need to handle that case.

val name = definition.name.orNull()?.let { input ->
    Regex("^(_+\\d+)(.*)$").find(input)?.let { match ->
        val (prefix, rest) = match.destructured
        prefix + rest.toPascalCase()
    } ?: input.toPascalCase()
} ?: return null

b) Consider the broader implications of underscore preservation:

There's a larger architectural question about whether we should preserve all leading underscores, regardless of whether they're followed by digits. While it might seem logical to handle all underscore prefixes uniformly, we're intentionally limiting this fix to underscores followed by digits for backward compatibility. We can be confident this won't break existing models because any model with a digit-prefixed identifier would have generated invalid Rust code that wouldn't compile, making it impossible for such models to be in use successfully.

However, this limited approach doesn't address all edge cases. Consider this valid Smithy model:

@enum([
    { 
        value: "3DBARCODE",
        name: "_Three",
        documentation: "Example 3D Bar code" 
    },
    { 
        value: "3D_BAR_CODE",
        name: "Three",
        documentation: "Example 3D Bar code with a leading underscore" 
    },
])
string BarCodes

This still generates invalid Rust code because two enum variants differ only by a leading underscore. Ideally, we would like to detect and stop code generation with an error rather than generate uncompilable Rust code. While this is a valid concern, we've decided to address this larger issue in a separate PR to keep the current changes focused and manageable.

c) Optimize regex performance by declaring it as a companion object member:

class EnumMemberModel(...) {
    private val UNDERSCORE_PREFIX_PATTERN = Regex("^(_+\\d+)(.*)$")
}
  1. Architectural Improvement:

The identifier transformation logic should be moved out of RustReservedWordSymbolProvider, as this component should solely handle Rust keyword conflicts. The underscore-prefixed numeric identifier handling belongs in a separate symbol provider.

I will follow up with a specific recommendation for the new location of this logic after further analysis of the codebase architecture.

@deven
Copy link
Contributor Author

deven commented Jan 28, 2025

I was already prefixing the digit with an underscore in the Smithy IDL file to make the Smithy identifier valid:

_2D_BARCODE = "2D_BARCODE"

Because the toPascalCase() function is stripping ALL underscores, this didn't help when the Rust enum member name was generated as 2DBarcode.

I don't know what else this might impact, but an alternative approach would be to have toPascalCase() preserve any prefix consisting of leading underscores, instead of using the logic I added in this pull request:

fun String.toPascalCase(): String {
    // TODO(https://github.com/smithy-lang/smithy-rs/issues/3047): consider using our updated toSnakeCase (but need to audit diff)
    val match = Regex("^(_*)(.*)$").find(this)!!
    val prefix = match.groupValues[1]
    val remaining = match.groupValues[2]

    return prefix + CaseUtils.toPascalCase(CaseUtils.toSnakeCase(remaining))
}

I tested this solution and confirmed that it converts _2D_BARCODE into _2DBarcode instead of 2DBarcode. This achieves the same result as the .replace(Regex("^(_*\\d)"), "_$1") code fragment I was using in this pull request, without calling it out as a rename in the comments.

For consistency, it would make sense to apply the same logic to the toSnakeCase() function as well:

fun String.toSnakeCase(): String {
    val match = Regex("^(_*)(.*)$").find(this)!!
    val prefix = match.groupValues[1]
    val remaining = match.groupValues[2]

    return prefix + remaining.splitOnWordBoundaries().joinToString("_") { it.lowercase() }
}

Both functions are in this file: codegen-core/src/main/kotlin/software/amazon/smithy/rust/codegen/core/util/Strings.kt

This would likely be a reasonable approach. There is some chance of backward compatibility issues since existing generated Rust code with leading underscores in the Smithy IDL file would generate different identifiers with this code change than they currently do. On the other hand, Smithy explicitly discourages leading underscores like this, so it might not be a problem:

Enum member names SHOULD NOT contain any lowercase ASCII Latin letters (a-z) and SHOULD NOT start with an ASCII underscore (_). That is, enum names SHOULD match the following regular expression: ^[A-Z]+[A-Z_0-9]*$.

If the backward compatibility is a significant concern, the approach above could be slightly modified to only preserve leading underscores when followed by a digit by using Regex("^(_*\\d)(.*)$"), but it would be more consistent behavior to always preserve them.

@deven
Copy link
Contributor Author

deven commented Jan 30, 2025

Should I replace my current patch in this pull request with the alternative that I suggested above, to fix toPascalCase() and toSnakeCase() to always preserve leading underscores for a broader, more consistent fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants