Skip to content

Conversation

@edsko
Copy link
Collaborator

@edsko edsko commented Nov 15, 2025

Closes #1220.

@edsko
Copy link
Collaborator Author

edsko commented Nov 15, 2025

@TravisCardwell @dschrempf This PR works, but only for llvm-19.1.0 and up. The problem goes back to the detection of named vs anonymous declarations (#795, #856), something we've struggled quite a bit with. I'm not entirely sure how to proceed here; I can see two ways forward.

Option 1

First, we can skip the new test I add in this PR for llvm older than llvm-19.1. This also means that we'd then need to change qualPrelimDeclId so that for ordinary names, it ignores the UserProvided/ClangGenerated generated distinction, simply assuming that the name is user provided instead.

However, the way that we deal with macros in names is not actually great. The solution that we adopted previously is

-- If the expansion location and the spelling location /of the name/
-- are different, this means that the name is constructed using a
-- macro. This must therefore have been done by the user.

However, this is not quite accurate. If we have a macro that would construct the entire struct, for example, it may very well be that the struct anonymous.

Option 2

I'm wondering if there isn't an alternative solution. The heart of the problem is that later clang (not sure in which version this changed exactly) we cannot easily distinguish between these two declarations:

typedef struct     { .. } foo;
typedef struct foo { .. } foo;

because clang fills in that name foo for the struct. This is why we go through such lengths to figure out what the actual token in the file is, but that is awkward in the presence of macros, and even more awkward because we don't even get accurate location information about macro expansions in libclang < 19.1.

So here's my proposal: what if we don't distinguish either? In other words, we treat these two examples as exactly the same, just like clang does. For older clang, this would mean substituting foo for the empty string "" for the name of the struct, but that is easily done, I think. This does not mean we no longer have anonymous structs; for example, in

struct outer {
  struct {
    int inner_field;
  } outer_field;
}

that inner struct is still anonymous; clang does fill in a name here too, but we can easily recognize it (it's something like "struct (unnamed at ....)").

Proposal

So concretely, I'm proposing: an unnamed struct defined inside a typedef is not considered anonymous; instead, we consider its tag to be derived from the typedef name.

I think this can work, and it means we no longer have to jump through so many hoops (and then still fail) to make things work with libclang. It's a bit inconsistent, perhaps, but I feel like it's the better option given the constraints.

Thoughts?

@TravisCardwell
Copy link
Collaborator

TravisCardwell commented Nov 16, 2025

Consider the following example, where struct foo and the anonymous struct are different types.

struct foo { char a; char b; };
typedef struct { double x; double y; } foo;

Such terrible (yet valid) C code is an example of code that we cannot handle well by default: we generate conflicting data Foo types. Users may resolve the issue with a prescriptive binding specification that specifies different names. This is currently possible because we distinguish the following names.

  • struct foo
  • struct @foo
  • foo

With this proposal, we would no longer distinguish the first two; they would both be struct foo. I suspect that this may cause major issues internally. (Even the PrelimDeclIds would conflict, making it impossible to construct a consistent UseDeclGraph, DeclUseGraph, and DeclIndex like we currently do.) As for binding specifications, cname: struct foo would be ambiguous. How would we specify different names for each of these types?

@TravisCardwell
Copy link
Collaborator

I have been thinking about possibilities... While I have not yet thought of any particularly good ones, my best candidate so far is something that we previously decided to not do: global renaming. After parsing declarations, where anonymous declarations declared inside of a typedef are given the typedef name, we could rename declarations to resolve conflicts, based on the set of all declarations.

For example, we could rename by appending _# where # is a natural number. With the above C, there are two declarations named struct foo. After sorting, the first of these could keep the name struct foo, while the second must be renamed. We could select the first name in a list of candidates (["struct foo_" ++ show i | i <- [1..]]) that is not already used.

Note that using source locations would be a bad idea, because that would be extremely fragile.

This strategy is not very elegant, but users would only be exposed to it when dealing with particularly terrible C code. It would allow users to distinguish all declarations in binding specifications as well as selection predicates.

We could provide an hs-bindgen-cli info cnames command to (concisely) show which C names map to which source locations, to assist users in cases where the naming is confusing. (We might want to implement this anyway!)

Thoughts?

@TravisCardwell
Copy link
Collaborator

Note that test golden/typenames.h generates conflicting types with the same name. It is a minimal example, without using anonymous declarations, but I am mentioning it here because it relates to the global renaming proposal.

(I ran across this while reviewing changes for the hstypes work. The binding specification for this test has issues that are expected given the name collision.)

@dschrempf
Copy link
Collaborator

Is this only affecting structs and unions? Could we separate the situations

typedef struct foo {..} foo;
typedef struct     {..} foo;

using a hand-rolled parser on clang_tokenize? It seems like an easy task (but there may be many complications).

If not, globally checking for conflicting declarations may be the only way to go. I am a bit skeptical though, because, even though the bindings will be deterministic, there may be unexpected renamings. It also means that the names of the translated declarations of a module will depend on other modules in the closure which is a bit of a red flag.

@TravisCardwell
Copy link
Collaborator

Parsing tokens is indeed straightforward if there is no CPP. Macros significantly complicate things, and use of macros to generate type declarations is not uncommon practice. (Sometimes macros are used to implement naming conventions, with the body outside the macro. Sometimes macros are used to implement templating of similar types, with the whole declaration generated by the macro.)

This is what we are doing now. My understanding is that we are discovering that differences across different versions of LLVM/Clang (due to improvements as well as bugs) make this difficult/impossible to get correct across all supported versions.

Indeed, we avoided global renaming for good reasons. I just wonder if it is less complicated than the alternatives, especially given that users will only see it when translating particularly poorly written C.

It also means that the names of the translated declarations of a module will depend on other modules in the closure which is a bit of a red flag.

I am not sure I understand what you mean. Imports in a module are all qualified (aside from a small set of standard Prelude imports), so we only need to rename based on the set of names for the current main files being processed ([C.Decl]).

@dschrempf
Copy link
Collaborator

I am not sure I understand what you mean. Imports in a module are all qualified (aside from a small set of standard Prelude imports), so we only need to rename based on the set of names for the current main files being processed ([C.Decl]).

I think I miscomunicated, using module (which is related to Haskell mdoules) instead of "C file".

Is it true that global renaming means that other "C files" in the translation unit can influence the name of a declaration? That is, can the name of a declaration at hand change depending on other files we include, or do not include in the translation unit?

@TravisCardwell
Copy link
Collaborator

Yes: we would need to consider names for all of the declarations that we "parse," which (may) include declarations across transitive includes (depending on the parse predicate).

As a minimal example of when this matters, consider the following foo.h header.

struct foo { char a; char b; };

That declaration is translated to Haskell type Foo. Perhaps you later add #include "bar.h" to the top of that header, where bar.h is as follows.

typedef struct { double x; double y; } foo;

These types are valid C. There are no conflicts because struct foo is in the tagged namespace, while typedef ... foo is in the ordinary namespace. There is a name collision in Haskell, however. With my example strategy, we would name the typedef as Foo in Haskell and struct foo would get renamed to Foo_1. Therefore, the Haskell name for struct foo depends on the existence of that #include.

Note, however, that this case only arises when there is a name collision, caused by different C types where the tag of one type is the same as the ordinary name of another type. Currently, we generate Haskell types with name collisions, and it is up to users to provide a prescriptive binding specification in order to even generate valid Haskell. With this change, we would generate valid Haskell by default, and users may optionally use a prescriptive binding specification to configure better names.

@dschrempf
Copy link
Collaborator

Thank you, this clarifies things. This means we should fall back to global renaming only if we don't find another solution. On the other hand, it seems like we have tried hard, and didn't find one so far.

@edsko
Copy link
Collaborator Author

edsko commented Nov 18, 2025

Interestingly, for @TravisCardwell ' example of

struct foo { char a; char b; };
typedef struct { double x; double y; } foo;

clang reports

clang_getCursorSpelling: ("./test.h:1:8","foo")
clang_getCursorSpelling: ("./test.h:2:9","foo")
clang_getCursorSpelling: ("./test.h:2:40","foo")

So we indeed have two structs both called foo, according to clang. If that's good enough for clang, it should be good enough for us...?

For me numbering declarations is a no; we've worked really hard not do to that. For now I'll go with "Option 1" above as a stop-gap measure and I'll open a ticket for us to reconsider this.

@TravisCardwell
Copy link
Collaborator

So we indeed have two structs both called foo, according to clang. If that's good enough for clang, it should be good enough for us...?

IMHO no, but it should not matter much if such C code is rare. 😄

Perhaps we should document that hs-bindgen is unable to handle such code, but users can work around issues by changing the C source. A wrapper header could be used as a workaround (only?) if typedefs around the problematic types are made opaque.

@edsko
Copy link
Collaborator Author

edsko commented Nov 19, 2025

Discussing this with @TravisCardwell , we realized there is a way out of this mess. The fundamental problem is that (newer) clang does not let us distinguish between these two declarations:

typedef struct foo { int x; } foo;
typedef struct     { int x; } foo;

In both cases, the struct tag will be reported as "foo". We were thinking about ways in which we could treat all structs with a single use site inside a typedef as anonymous, so that we would not need to distinguish between them anymore, when we realized something very useful: such structs we squash anyway! Since we squash them anyway, the name of the struct is irrelevant, and we don't care if it's anonymous or not.

For Travis' counter-example from above

struct foo { char a; char b; };
typedef struct { double x; double y; } foo;

we will still generate two Haskell types with the same name, but the user can easily distinguish in a prescriptive binding spec: the first is struct foo, the latter is just foo. The anonymous struct inside the typedef (which would have the same name as the first struct) is unproblematic, because it will anyway be squashed.

We will need to be able to reliably detect anonymous structs in other contexts, for example in

struct rect {
  struct { int x; int y } bottom_left;
  struct { int x; int y } top_right;
}

but here clang generates a name such as unnamed at .., which are easily detected.

The only consequence of depending on this squashing behaviour is that it is no longer optional (previously we were planning to make squashing optional in this case, through a prescriptive binding spec), but we think that this is a very small price to pay.

@edsko
Copy link
Collaborator Author

edsko commented Nov 19, 2025

Opened #1307 for the remaining problem.

edsko added a commit to well-typed/libclang that referenced this pull request Nov 19, 2025
See well-typed/hs-bindgen#1292 for detailed discussion,
and especially
well-typed/hs-bindgen#1292 (comment) for
the reasoning behind this change.
@edsko
Copy link
Collaborator Author

edsko commented Nov 20, 2025

This is not yet correct; there is at least one problem ,with binding spec, though conceivably other orthogonal problems also. The issue is that when we have a struct inside a typedef, and we squash it, we nonetheless refer to the struct in the C code. Specifically,

typedef struct struct_in_typedef { int x; } typedef_with_tag;

results in

ctypes:
- headers: test.h
  cname: struct struct_in_typedef
  hsname: Typedef_with_tag

This is problematic, because that struct name might not exist (we cannot reliably detect if it does or not, that's what this PR is about). So instead we should do

- headers: test.h
  cname: typedef_with_tag 
  hsname: Typedef_with_tag

that also makes the treatment C-side and Haskell-side more consistent.

In order to do this, we need to change the name of the struct in the DeclInfo (or alternatively add another field, but I don't think that's necessary); however, that means we might need to assign a name with name kind Ordinary to a struct: #1305 will make that possible, so will do that first.

@bolt12
Copy link
Collaborator

bolt12 commented Nov 20, 2025

See: #1220 (comment) . I opened: #1311.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate CI issue with LLVM 15 and anonymous declarations of ordinary type

5 participants