Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for DWARF compile units in wasm binaries #210

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dschuff
Copy link
Contributor

@dschuff dschuff commented Jul 31, 2020

  • Add support for reading DWARF data from wasm custom sections
  • Use the common DWARF compile unit parser to support Bloaty's standard compileunit sink

@dschuff
Copy link
Contributor Author

dschuff commented Jul 31, 2020

Hello!
I work on WebAssembly toolchains, and I'm interested in adding support to Bloaty for the relevant DWARF info in WebAssembly files.
Here's a first minimal PR.
Do you have suggestions for anything else that should go in it?
How would you usually test this kind of change?

@haberman
Copy link
Member

haberman commented Jul 31, 2020

Hi @dschuff, thanks for the PR!

Does the PR seem to give reasonable results when you run it on WebAssembly files?

You could add some unit tests if you wanted, and check in some .wasm binaries to test against. In general Bloaty is hard to unit test due because compiler output is rather unpredictable. Also improvements in Bloaty can invalidate existing unit tests, for example if we increase our coverage.

@dschuff
Copy link
Contributor Author

dschuff commented Aug 7, 2020

(sorry, was out of town this week)
Yes, I have tested this PR locally with wasm files and gotten reasonable results.

Actually I do have one question about the output. With this PR it looks something like this for wasm:

    FILE SIZE        VM SIZE
 --------------  --------------
  77.5%  18.2Ki   NAN%       0    [section Code]
  10.8%  2.53Ki   NAN%       0    [section Data]
   3.6%     875   NAN%       0    [section name]
   1.9%     455   NAN%       0    link_a.c
   1.2%     293   NAN%       0    link_b.c
   1.0%     242   NAN%       0    [section Export]
   0.7%     180   NAN%       0    link_main.c
   0.7%     174   NAN%       0    [section Import]
   0.7%     160   NAN%       0    [section Type]
   0.6%     138   NAN%       0
   0.4%     103   NAN%       0    [section .debug_abbrev]
   0.3%      64   NAN%       0    [section Function]
   0.1%      30   NAN%       0    [section .debug_str]
   0.1%      22   NAN%       0    [section Global]
   0.1%      16   NAN%       0    [section .debug_ranges]
   0.1%      15   NAN%       0    [section .debug_info]
   0.1%      15   NAN%       0    [section .debug_line]
   0.1%      13   NAN%       0    [section Element]
   0.0%       8   NAN%       0    [WASM Header]
 100.0%  23.5Ki 100.0%       0    TOTAL

(ignore the NANs in the "VM Size" column, VM size isn't meaningful for wasm)
The compile units are link_{a,b,main}.c but it's also reporting "[section Code]" which is the entire code section, and if I'm not mistaken the reported size is the size of the whole code section (which includes the compile units for which we have debug info).
So I guess the first question is, what is the expected output supposed to cover? I would think for each compile unit, we want to report the size (i.e. how much of the code section is taken up by that compile unit)? And then have an entry for the remainder of the code section that doesn't have dwarf info?

@dschuff
Copy link
Contributor Author

dschuff commented Aug 7, 2020

... oh, it must the case that the size for e.g. link_a.c is the combination of both the code and the debug info for that CU. I think that explains the numbers I'm seeing.
When I do -d sections,compileunits on an ELF binary I see the CUs in the breakdown of the .text section. but for this wasm binary I see

    FILE SIZE        VM SIZE
 --------------  --------------
  77.5%  18.2Ki   NAN%       0    Code
  10.8%  2.53Ki   NAN%       0    Data
   3.6%     875   NAN%       0    name
   1.9%     468   NAN%       0    .debug_str
    31.8%     149   NAN%       0    link_a.c
    30.8%     144   NAN%       0    link_b.c
    29.5%     138   NAN%       0
     6.4%      30   NAN%       0    [section .debug_str]
     1.5%       7   NAN%       0    link_main.c
   1.2%     289   NAN%       0    .debug_line
    34.6%     100   NAN%       0    link_main.c
    30.1%      87   NAN%       0    link_a.c
    30.1%      87   NAN%       0    link_b.c
     5.2%      15   NAN%       0    [section .debug_line]
   1.1%     254   NAN%       0    .debug_info
    40.9%     104   NAN%       0    link_a.c
    28.7%      73   NAN%       0    link_main.c
    24.4%      62   NAN%       0    link_b.c
     5.9%      15   NAN%       0    [section .debug_info]
   1.0%     242   NAN%       0    Export
   0.8%     194   NAN%       0    .debug_abbrev
    53.1%     103   NAN%       0    [section .debug_abbrev]
    46.9%      91   NAN%       0    link_a.c
   0.7%     174   NAN%       0    Import
   0.7%     160   NAN%       0    Type
   0.3%      64   NAN%       0    Function
   0.2%      40   NAN%       0    .debug_ranges
    60.0%      24   NAN%       0    link_a.c
    40.0%      16   NAN%       0    [section .debug_ranges]
   0.1%      22   NAN%       0    Global
   0.1%      13   NAN%       0    Element
   0.0%       8   NAN%       0    [WASM Header]
 100.0%  23.5Ki 100.0%       0    TOTAL

which would seem to indicate that the Code section is being shown as monolithic and no parts of it are contributing to the CU totals (which of course is kind of the whole point). Do you have any idea how I might figure out how to fix that?

@haberman
Copy link
Member

haberman commented May 1, 2021

Sorry for the slow reply on this!

Ultimately the memory map, which you can see with bloaty -v, should help clarify. If the Code section is monolithic, it is probably because that section of the file is not getting any more detailed information.

Usually with DWARF we are relying on VM addresses to break down compile units. But with wasm there are no vm addresses available (at least that was my experience when implementing initial wasm support).

Does WASM for DWARF contain enough information to break down the file ranges?

@dschuff
Copy link
Contributor Author

dschuff commented May 5, 2021

OK, I think i understand what you're saying, but first a question about what I should expect to see and what the numbers should mean.

  1. When I run bloaty against an ELF file with -d compileunits I see something like this:
  57.2%   162Mi  55.2%  6.77Mi    [191 Others]
   4.5%  12.6Mi   6.2%   782Ki    /s/emr/emscripten-releases/binaryen/src/binaryen-c.cpp
...

Since that 12.6Mi size is greater than the whole code section, I'm sure that the number includes the debug info (aggregated across all the different debuginfo sections) for that CU.
Does it also include the portion of the code section that comes from that file?

When I run bloaty with on that same ELF binary with -d sections,compileunits I get something like this:

...
   2.7%  7.60Mi  61.9%  7.60Mi    .text
    49.1%  3.73Mi  49.1%  3.73Mi    [156 Others]
     6.6%   517Ki   6.6%   517Ki    /s/emr/emscripten-releases/binaryen/src/binaryen-c.cpp
...

In other words bloaty can break down the text section based on CU information from the debuginfo. I guess you're saying that bloaty uses the VM addresses (which, now that I think about it, makes sense because address fields in the debug info are VM address).
When I run (with my patch) against a wasm binary using -d compileunits or sections,compileunits, the Code section shows up as monolithic. For Wasm, code is not mapped into the same memory space as data (so there's not a unified "VM" space). All the code addresses in debug info are interpreted as offsets into the code section of the binary rather than VM addresses. So I guess the answer to your question is "yes", and we just need to figure out how to make bloaty break down the code section based on section offset rather than VM addresses.

@dschuff
Copy link
Contributor Author

dschuff commented Jan 7, 2022

I took another crack at this. Here I made the "VM" space represent just space in the code section. It's a break from what "VM" is supposed to mean (particularly for wasm, which doesn't map the code section at runtime at all). But setting up a VM mapping this way does make DWARF "just work" because the VM addresses in DWARF represent code section offsets. It also has another nice property when just using the wasm "name" section (which is used by bloaty's symbol sink) that it allows you to ignore the space taken up by the name section itself (which will often be stripped out like debug info before shipping).
It is kind of ugly though.

Separately from DWARF, I also thought it might be nice to do sort of the opposite, and make "VM" refer to just the data section, which is actually initialized into memory at runtime (since it's useful to be able to profile bloat in the data section, separately from code bloat). This would repurpose the "VM" idea in a different, incompatible way (which doesn't match the way it's used by DWARF).

@haberman
Copy link
Member

haberman commented Jan 7, 2022

Separately from DWARF, I also thought it might be nice to do sort of the opposite, and make "VM" refer to just the data section, which is actually initialized into memory at runtime (since it's useful to be able to profile bloat in the data section, separately from code bloat).

I think this is the right answer.

In Bloaty, "VM" should describe parts of the binary that will be mapped directly into memory at runtime.

It sounds like for WASM, this only applies to data, not code. So VM in WASM will probably want to describe data only.

If we can use DWARF data to refine the "File size" report for the code section, that seems ideal.

By the way, another contributor has recently started adding yaml2obj tests for WASM, which make for much more robust tests: https://github.com/google/bloaty/tree/master/tests/wasm

@dschuff
Copy link
Contributor Author

dschuff commented Jan 7, 2022

Separately from DWARF, I also thought it might be nice to do sort of the opposite, and make "VM" refer to just the data section, which is actually initialized into memory at runtime (since it's useful to be able to profile bloat in the data section, separately from code bloat).

I think this is the right answer.

In Bloaty, "VM" should describe parts of the binary that will be mapped directly into memory at runtime.

It sounds like for WASM, this only applies to data, not code. So VM in WASM will probably want to describe data only.

Yes, this does make sense to me, but...

If we can use DWARF data to refine the "File size" report for the code section, that seems ideal.

I don't actually know how to do this without going and changing all the DWARF parsing code, since it wants to add VM ranges. Maybe we could somehow actually have 2 separate "VM" address spaces? One for data, and one for the code section (which could then be displayed separately, or used to refine the file size report, or something?

By the way, another contributor has recently started adding yaml2obj tests for WASM, which make for much more robust tests: https://github.com/google/bloaty/tree/master/tests/wasm

Yes, I used it on my previous PR, and (coming from the LLVM world) it's nice!
Unfortunately IIRC yaml2obj doesn't really have support for DWARF the way it does for the object-file-level constructs, so it won't be significantly better for this use case than just checking in binaries (the debug info sections will just show up in the yaml file as hex-encoded blobs)

@dschuff
Copy link
Contributor Author

dschuff commented Jan 8, 2022

If we can use DWARF data to refine the "File size" report for the code section, that seems ideal.

I don't actually know how to do this without going and changing all the DWARF parsing code, since it wants to add VM ranges.

Or, to put it another way: Without adding a the fake VM range backing the code section as this PR currently does, there are no file ranges added for the code section when reading the DWARF sections (the only file ranges that show up are the dwarf sections themselves). But with this PR, the file ranges in the code section seem to be properly accounted for (in addition to the VM ranges the mirror them). So I don't know how to get the file ranges in the code section reported without also having the VM ranges.

@haberman
Copy link
Member

Yes, I see the conundrum. Bloaty's DWARF support is currently hard-coded to add VM ranges, not file ranges. This is "correct" given the defintion of DWARF.

But WASM uses DWARF in a nonstandard way, due to WASM's design which is significantly different than ELF or Mach-O.

What if we added some new functions to RangeSink that take an enum as a parameter, eg:

enum class RangeType {
  kVM,
  kFile,
};

class RangeSink {
  AddRange(const char * analyzer, RangeType type, uint64_t start, uint64_t size);
}

Then the DWARF routines could take RangeType as a parameter. ELF/Mach-O would pass kVM, while WASM would pass kFile.

Would that work?

Unfortunately IIRC yaml2obj doesn't really have support for DWARF the way it does for the object-file-level constructs

I think it does, see the tests in tests/dwarf, which uses DWARF: sections in their YAML files.

I don't know if this works in WASM though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants