-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuzzer: excersizes query code on xml doc #551
Conversation
It would be better if this change didn’t change pugixml.hpp; if the enumeration gets extended it will require changes in fuzzing code anyway so it seems better to explicitly switch/case or hardcode boolean as last enum member in the fuzzing harness. Also the change should be converted to Allman brace style to fit the convention in the rest of the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the enumeration gets extended it will require changes in fuzzing code
I might be missing something, but it was my intention that this would not be required. As long as kMaxValue
is aliased to the last element in the enum it shouldn't require updates to the fuzzer (see comments on ConsumeEnum
).
I'm more than happy to refactor all of this slightly to not require changes to the main library header files. Although the alternatives are slightly less readable, and would definitely require updates to the fuzzer if the enum is extended.
tests/fuzz_xpath.cpp
Outdated
std::vector<std::string> var_name_storage = {}; | ||
for (size_t i = 0; i < var_count; i++) { | ||
var_name_storage.push_back(fdp.ConsumeRandomLengthString(128)); | ||
vars.add(var_name_storage.back().c_str(), fdp.ConsumeEnum<pugi::xpath_value_type>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kMaxValue
(modification to pugixml.hpp
) is required ConsumeEnum
to work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity here is where kMaxValue
is actually used by FuzzedDataProvider::ConsumeEnum
https://github.com/llvm/llvm-project/blob/main/compiler-rt/include/fuzzer/FuzzedDataProvider.h#L298
Not a problem, do you mind if I add a |
Sorry what I meant is that the fuzzing code right now has explicit calls to evaluate_X for each X, which is of course correct. If new types are added, this will expand the surface of the API which would require changing the fuzzing code anyway. So it seems preferable to avoid changing the header in this case and just hardcode the enum selection via an array lookup or something similar. As an aside, just out of curiosity, is fuzzing efficiency better/worse/the same when using a specific fixed XML document? (eg some existing .xml file from the repository, or we could add a new one for this specific purpose) I'm wondering this because it feels like a lot of generated documents will not really be valid or interesting XML, and as such a lot of queries would terminate early or not run at all, whereas if we knew the document was valid and had a variety of nodes with different types and structure, we would be able to directly fuzz the evaluation engine on known-interesting document and unknown query. |
Oh yeah I see what you are saying. I'll make some changes so that the header file isn't modified.
The overall performance increase in code coverage >20%. But it's worth noting that this is using some other changes that I've made in the google/oss-fuzz configuration for this repo (I'll open a PR with oss-fuzz soon).
I haven't tested this change as standalone (i.e. without corpus/dictionary updates), but with the fairly minimal modifications (mentioned above), it's a fairly decent improvement. But you raise a valid point that to really get the best performance a good set of handwritten seed corpus xml files would certainly help cover the most code for a query. |
8511086
to
0d8c457
Compare
0d8c457
to
9d8d193
Compare
Ok so this PR no longer changes the You can find the PR with OSS-fuzz I mentioned earlier here google/oss-fuzz#9900. I'll have a go at writing a few specific entries for the seed corpus later this week. But with the combination of these two PRs and fuzzing for a couple of hours on my laptop I've been able to increase code coverage by a bit over 20%. |
Fix code style, no exceptions, other tweaks.
Remove unused variables
Thanks! |
This change increases code coverage significantly during fuzzing as it excersizes the query code against an arbitrary file. I've also updated the fuzzer to use LLVM's FuzzedDataProvider which makes the fuzzer a little easier to understand/reason about.