fuzzer: excersizes query code on xml doc #551

nathaniel-brough · 2023-03-05T23:23:26Z

This change increases code coverage significantly during fuzzing as it excersizes the query code against an arbitrary file. I've also updated the fuzzer to use LLVM's FuzzedDataProvider which makes the fuzzer a little easier to understand/reason about.

zeux · 2023-03-06T04:34:52Z

It would be better if this change didn’t change pugixml.hpp; if the enumeration gets extended it will require changes in fuzzing code anyway so it seems better to explicitly switch/case or hardcode boolean as last enum member in the fuzzing harness.

Also the change should be converted to Allman brace style to fit the convention in the rest of the code.

nathaniel-brough

if the enumeration gets extended it will require changes in fuzzing code

I might be missing something, but it was my intention that this would not be required. As long as kMaxValue is aliased to the last element in the enum it shouldn't require updates to the fuzzer (see comments on ConsumeEnum).

I'm more than happy to refactor all of this slightly to not require changes to the main library header files. Although the alternatives are slightly less readable, and would definitely require updates to the fuzzer if the enum is extended.

nathaniel-brough · 2023-03-07T02:29:09Z

tests/fuzz_xpath.cpp

+    std::vector<std::string> var_name_storage = {};
+    for (size_t i = 0; i < var_count; i++) {
+      var_name_storage.push_back(fdp.ConsumeRandomLengthString(128));
+      vars.add(var_name_storage.back().c_str(), fdp.ConsumeEnum<pugi::xpath_value_type>());


The kMaxValue (modification to pugixml.hpp) is required ConsumeEnum to work correctly.

For posterity here is where kMaxValue is actually used by FuzzedDataProvider::ConsumeEnum https://github.com/llvm/llvm-project/blob/main/compiler-rt/include/fuzzer/FuzzedDataProvider.h#L298

nathaniel-brough · 2023-03-07T02:48:34Z

Also the change should be converted to Allman brace style to fit the convention in the rest of the code.

Not a problem, do you mind if I add a .clang-format file as a separate commit with the Allman style braces? This way most editors will autoformat to Allman style rather than some other default.

zeux · 2023-03-07T20:11:49Z

I might be missing something, but it was my intention that this would not be required. As long as kMaxValue is aliased to the last element in the enum it shouldn't require updates to the fuzzer (see comments on ConsumeEnum).

Sorry what I meant is that the fuzzing code right now has explicit calls to evaluate_X for each X, which is of course correct. If new types are added, this will expand the surface of the API which would require changing the fuzzing code anyway. So it seems preferable to avoid changing the header in this case and just hardcode the enum selection via an array lookup or something similar.

As an aside, just out of curiosity, is fuzzing efficiency better/worse/the same when using a specific fixed XML document? (eg some existing .xml file from the repository, or we could add a new one for this specific purpose)

I'm wondering this because it feels like a lot of generated documents will not really be valid or interesting XML, and as such a lot of queries would terminate early or not run at all, whereas if we knew the document was valid and had a variety of nodes with different types and structure, we would be able to directly fuzz the evaluation engine on known-interesting document and unknown query.

nathaniel-brough · 2023-03-10T01:33:17Z

Oh yeah I see what you are saying. I'll make some changes so that the header file isn't modified.

As an aside, just out of curiosity, is fuzzing efficiency better/worse/the same when using a specific fixed XML document? (eg some existing .xml file from the repository, or we could add a new one for this specific purpose)

The overall performance increase in code coverage >20%. But it's worth noting that this is using some other changes that I've made in the google/oss-fuzz configuration for this repo (I'll open a PR with oss-fuzz soon).

Concatenate fuzz_parse.dict and fuzz_xpath.dict so that the XML and query components are in the dictionary.
Merge the fuzz_parse and fuzz_xpath seed corpus

I haven't tested this change as standalone (i.e. without corpus/dictionary updates), but with the fairly minimal modifications (mentioned above), it's a fairly decent improvement.

But you raise a valid point that to really get the best performance a good set of handwritten seed corpus xml files would certainly help cover the most code for a query.

nathaniel-brough · 2023-03-10T01:58:33Z

Ok so this PR no longer changes the pugixml.hpp file and instead fuzzes this by casting a Consumed Integral back to the enum type.

You can find the PR with OSS-fuzz I mentioned earlier here google/oss-fuzz#9900.

I'll have a go at writing a few specific entries for the seed corpus later this week. But with the combination of these two PRs and fuzzing for a couple of hours on my laptop I've been able to increase code coverage by a bit over 20%.

Fix code style, no exceptions, other tweaks.

Remove unused variables

zeux · 2023-10-21T19:29:28Z

Thanks!

nathaniel-brough commented Mar 7, 2023

View reviewed changes

nathaniel-brough force-pushed the master branch 2 times, most recently from 8511086 to 0d8c457 Compare March 10, 2023 01:41

tests: excersizes query code on xml doc

9d8d193

nathaniel-brough force-pushed the master branch from 0d8c457 to 9d8d193 Compare March 10, 2023 01:46

zeux mentioned this pull request Mar 15, 2023

pugixml: Update seed-corpus/dict files to match new fuzzer google/oss-fuzz#9900

Merged

zeux added 2 commits October 21, 2023 11:11

Update fuzz_xpath.cpp

d75a081

Fix code style, no exceptions, other tweaks.

Update fuzz_xpath.cpp

4d42ba7

Remove unused variables

zeux merged commit 33f7093 into zeux:master Oct 21, 2023
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuzzer: excersizes query code on xml doc #551

fuzzer: excersizes query code on xml doc #551

nathaniel-brough commented Mar 5, 2023 •

edited

Loading

zeux commented Mar 6, 2023

nathaniel-brough left a comment

nathaniel-brough Mar 7, 2023

nathaniel-brough Mar 7, 2023

nathaniel-brough commented Mar 7, 2023

zeux commented Mar 7, 2023

nathaniel-brough commented Mar 10, 2023

nathaniel-brough commented Mar 10, 2023

zeux commented Oct 21, 2023

fuzzer: excersizes query code on xml doc #551

fuzzer: excersizes query code on xml doc #551

Conversation

nathaniel-brough commented Mar 5, 2023 • edited Loading

zeux commented Mar 6, 2023

nathaniel-brough left a comment

Choose a reason for hiding this comment

nathaniel-brough Mar 7, 2023

Choose a reason for hiding this comment

nathaniel-brough Mar 7, 2023

Choose a reason for hiding this comment

nathaniel-brough commented Mar 7, 2023

zeux commented Mar 7, 2023

nathaniel-brough commented Mar 10, 2023

nathaniel-brough commented Mar 10, 2023

zeux commented Oct 21, 2023

nathaniel-brough commented Mar 5, 2023 •

edited

Loading