oss-fuzz/docs/research/target_generation.md

72 KiB
Raw Permalink Blame History

layout nav_exclude permalink
default true /research/llms/target_generation/

Fuzz target generation using LLMs

Read our announcement blog.

Background

OSS-Fuzz performs continuous fuzzing of 1000+ open source projects across most major languages. To integrate a new project, a human typically analyzes the attack surface of a library and writes fuzz targets (also called fuzzing harnesses) to exercise the relevant code. Linked with a fuzzing engine (e.g. libFuzzer, AFL, Centipede), this enables coverage-guided fuzzing for all OSS-Fuzz projects. Depending on the complexity of the project, writing fuzz targets typically requires several hours of manual work and sufficient background knowledge of the project.

Additionally, the main challenge for most integrated OSS-Fuzz projects is ensuring high code coverage. Most OSS-Fuzz projects have fairly low runtime coverage (~30%) despite millions of hours of CPU time. This means we are not finding vulnerabilities in approximately 70% of each project that were fuzzing. Our preliminary research found that many fuzz blockers (as determined by FuzzIntrospector) are because of deficiencies in existing targets, rather than deficiencies in fuzzing engines.

Generating fuzz targets via LLMs can reduce the manual effort required to more thoroughly fuzz existing projects in OSS-Fuzz as well as integrating new projects into OSS-Fuzz.

Goals

Our ideal end state of this research is to use LLMs for two use cases:

  1. Completely automatic fuzz target generation (or modification of existing targets) for existing OSS-Fuzz projects to unblock fuzz blockers and increase project code coverage (and bugs found) for free.
  2. Completely automatic fuzz target generation for completely new OSS-Fuzz projects. This is much more challenging than 1, and is an extension of it.

Our current experiments focus on the first use case for C/C++ projects. This report serves as a preliminary investigation into how effective LLMs are for this use case. More detailed results and the experimentation framework for our research will be published at a later date.

Experiment framework

To discover whether an LLM could successfully write new fuzz targets, we built an evaluation framework that connects OSS-Fuzz to Googles LLMs, conducts the experiment, and evaluates the results. The steps look like this:

![experiment framework]({{ site.baseurl }}/images/llm_framework.png "image_tooltip")

  1. OSS-Fuzzs Fuzz Introspector tool identifies an under-fuzzed, high-potential, portion of the target projects code and passes the code to the evaluation framework.
  2. The evaluation framework creates a prompt that the LLM will use to write the new fuzz target. The prompt includes project specific information.
  3. The evaluation framework takes the fuzz target generated by the LLM and runs the new target.
  4. The evaluation framework observes the run for any change in code coverage or crashes.
  5. In the event that the fuzz target fails to compile, the evaluation framework prompts the LLM to write a revised fuzz target that addresses the compilation errors.

1. Identifying high potential portions of the projects code

We leverage Fuzz Introspector (example JSON endpoint) to provide us with a list of functions with low runtime coverage (but high potential to reach more code coverage). These are turned into benchmark YAML files, which consist of an OSS-Fuzz project, and a list of function signatures to generate new targets for.

We have started with a small set of benchmarks, and will gradually scale this to larger, automated sets of benchmarks taken from all of OSS-Fuzz as we improve the function selection and prompt generation process.

Example benchmark (YAML):

functions:
- XML_Parser XMLCALL XML_ExternalEntityParserCreate(XML_Parser oldParser, const XML_Char
  *context, const XML_Char *encodingName)
- XML_Parser XMLCALL XML_ParserCreateNS(const XML_Char *encodingName, XML_Char nsSep)
- XML_Bool XMLCALL XML_ParserReset(XML_Parser parser, const XML_Char *encodingName)
- static enum XML_Error PTRCALL externalParEntInitProcessor(XML_Parser parser, const
  char *s, const char *end, const char **nextPtr)
project: expat
target_path: /src/expat/expat/fuzz/xml_parse_fuzzer.c
target_name: xml_parse_fuzzer_UTF-8

2. Prompt generation

We dynamically generate a prompt based on a template (example).

As part of our experimentation, we tried various different prompt approaches. So far, the best results have come from including:

  • One example of an existing function signature and fuzz target from the project under test, formatted into problem and solution structure. Too many examples yields worse results.
  • Two examples from other projects in OSS-Fuzz, formatted in the same way.
  • Examples of how to leverage FuzzedDataProvider to generate inputs for function arguments.
  • A priming that gives the task context.
  • Examples of code anti-patterns to avoid.

The dynamically generated sections today include examples of existing fuzz targets from both other projects on OSS-Fuzz as well as one example from the project under test. We have other unexplored ideas including more structured information about the function under test, such as:

  • Relevant data structure definitions
  • Function implementations of the function under test and related functions
  • Usages of the function under test and related functions

3. Build and run

We leverage the OSS-Fuzz build infrastructure to build new targets by replacing an existing targets source code with the newly generated target source code.

OSS-Fuzz projects often have strict compiler flags on by default. To make compilation easier, we also implemented a compiler wrapper that:

  • Turns off compiler warnings to prevent trivial issues such as missing pointer casts from blocking compilation.
  • Re-compiles targets as C++ (to leverage FuzzedDataProvider).

4. Measuring quality of generated targets

An important part of our research is to define metrics to measure the quality of generated targets.

These metrics are:

  • Syntax correctness and project consistency. This is measured by its compilation result. For example: whether it compiles successfully, does it call functions in the project correctly without hallucination.
  • Whether it crashes instantly or within the fuzz target. This often means that there is some miscalled API and the crashes are likely to be false positives.
  • New code coverage. This is measured by the new lines it covered compared to all existing targets in OSS-Fuzz for the same project.

All of these metrics can be automatically computed for a given generated target.

5. LLM Code Fixer

The fuzz targets generated by LLM often contain various trivial defects, which can be fixed by a separate LLM query.

The prompt of the code fixing query is structured as follows, where the raw code and error are respectively replaced with the fuzz target source code generated by the LLM and the build error messages extracted from pages of build logs:

Given the following code and its build error message, fix the code without affecting its functionality.
First explain the reason, then output the whole fixed code.
If a function is missing, fix it by including the related libraries.
Code:
```
{raw_code}
```
Build error message:
```
{error}
```
Fixed code:

Several rounds of code fixing queries are required for some cases. For example, when multiple defects incurs several error messages, sometimes LLM tends to only fix one of them at a time. Similarly, new defects may be introduced during code fixing. In these cases, we found iteratively querying LLM with the same prompt structure will gradually fix all errors.

LLM often proposes several responses for each query, we prefer the one with the longest code. This is an implementation decision to avoid a quadratically increasing number of targets to build (e.g. the LLMs could propose 4 new targets across N iterations) and to avoid the LLM deleting the function code to fix build failures.

Additionally, we also check that the generated target includes a call to the requested function to test. If it does not, this is surfaced as an error to the LLM.

Example

Prompt: Incorrect target with missing arguments passed to target function.

After fix: Correct function argument added.

Results

Initially, getting any compilable output was a challenge. We were able to improve this via prompt engineering and our compiler wrapper to having 14/31 tested OSS-Fuzz projects successfully compile new targets and increase coverage. The successful examples and prompts are published here.

We see a wide range of coverage improvements from 0-31% code coverage increases.

The top coverage increases, aggregated across all benchmarks per OSS-Fuzz project are:

tinyxml2 31%
cjson 6%
expat 4%
libplist 4%
libxml2 1%
elfutils 1%

The best result weve had is with the TinyXML2 project, where we managed to increase fuzz coverage from 38% line coverage to 69% line coverage without any interventions.

![alt_text]({{ site.baseurl }}/images/tinyxml2_examples.png "image_tooltip")

Additionally, we targeted OpenSSL from the perspective of discovering past vulnerabilities that were not found due to lack of fuzzing coverage. We were able to replicate a similar fuzz target that rediscovered CVE-2022-3602.

![alt_text]({{ site.baseurl }}/images/punycode.png "Stacktrace from LLM-generated target finding CVE-2022-3602")

Future work

Weve seen very promising early results in this space and will continue our research.

1. Continue research

There are a number of areas wed like to further research on:

  • Expand benchmarks to all of OSS-Fuzz. Wed like to expand the set of benchmarks to cover all of OSS-Fuzz.
  • Continued prompt engineering and experimentation with project-specific context, such as more structured context (e.g. structure definitions, implementations) around the relevant function to test.
  • Model fine-tuning
  • Expand to other languages beyond C/C++
  • Expand research to completely new projects with no existing OSS-Fuzz integration.

2. Open source evaluation framework

We plan to open source the evaluation framework weve built to help test arbitrary auto-fuzz target generation capabilities. We hope that OSS-Fuzz can serve as a valuable benchmarking platform for researchers in this space.

3. OSS-Fuzz integration

Ultimately, the goal is to integrate the results of this research into OSS-Fuzz, to provide:

  • Free coverage increases for existing projects
  • Automated onboarding of new projects, and tools to help maintainers write manual fuzz targets.

Appendix

Successful benchmark results

Project Function Output Build rate Max Coverage Max Line coverage diff Reports
tinyxml2 tinyxml2-xmldocument-print Prompt;
Fixes;
Targets.

50

29.74

11.16

Reports
tinyxml2 tinyxml2-xmldocument-deepcopy Prompt;
Fixes;
Targets.

25

26.8

4.45

Reports
tinyxml2 tinyxml2-xmlelement-setattribute Prompt;
Fixes;
Targets.

75

26.08

3.77

Reports
libplist plist_print Prompt;
Fixes;
Targets.

25

12.88

3.42

Reports
tinyxml2 tinyxml2-xmlelement-doubletext Prompt;
Fixes;
Targets.

62.5

25.61

3.28

Reports
tinyxml2 tinyxml2-xmlelement-booltext Prompt;
Fixes;
Targets.

87.5

26.18

2.9

Reports
tinyxml2 tinyxml2-xmlelement-insertnewunknown Prompt;
Fixes;
Targets.

25

25.67

2.64

Reports
tinyxml2 tinyxml2-xmlelement-int64text Prompt;
Fixes;
Targets.

87.5

25.91

2.64

Reports
cjson cjson_compare Prompt;
Fixes;
Targets.

75

29.68

2.47

Reports
tinyxml2 tinyxml2-xmlelement-floattext Prompt;
Fixes;
Targets.

62.5

25.09

2.45

Reports
tinyxml2 tinyxml2-xmlelement-inttext Prompt;
Fixes;
Targets.

75

26.2

2.41

Reports
tinyxml2 tinyxml2-xmlelement-unsigned64text Prompt;
Fixes;
Targets.

37.5

25.74

2.22

Reports
tinyobjloader tinyobj-objreader-parsefromfile Prompt;
Fixes;
Targets.

37.5

5.7

2.16

Reports
tinyxml2 tinyxml2-xmlelement-unsignedtext Prompt;
Fixes;
Targets.

50

25.53

2.15

Reports
tinyxml2 tinyxml2-xmlelement-shallowclone Prompt;
Fixes;
Targets.

50

25.03

2.07

Reports
cjson cjson_replaceiteminobject Prompt;
Fixes;
Targets.

37.5

27.56

1.98

Reports
tinyxml2 tinyxml2-xmlelement-gettext Prompt;
Fixes;
Targets.

37.5

25.42

1.96

Reports
tinyxml2 tinyxml2-xmlelement-shallowequal Prompt;
Fixes;
Targets.

62.5

25.58

1.96

Reports
cjson cjson_duplicate Prompt;
Fixes;
Targets.

62.5

27.4

1.89

Reports
expat xml_externalentityparsercreate Prompt;
Fixes;
Targets.

12.5

1.25

1.88

Reports
cjson cjson_replaceiteminobjectcasesensitive Prompt;
Fixes;
Targets.

87.5

25.54

1.85

Reports
tinyxml2 tinyxml2-xmlelement-insertnewcomment Prompt;
Fixes;
Targets.

62.5

25.61

1.85

Reports
expat xml_parsercreatens Prompt;
Fixes;
Targets.

12.5

45.6

1.84

Reports
tinyxml2 tinyxml2-xmlelement-deleteattribute Prompt;
Fixes;
Targets.

50

25.73

1.7

Reports
tinyxml2 tinyxml2-xmlelement-insertnewdeclaration Prompt;
Fixes;
Targets.

50

25

1.7

Reports
tinyxml2 tinyxml2-xmlelement-insertnewchildelement Prompt;
Fixes;
Targets.

50

24.88

1.51

Reports
tinyxml2 tinyxml2-xmlnode-previoussiblingelement Prompt;
Fixes;
Targets.

87.5

26.08

1.43

Reports
tinyobjloader tinyobj-loadobj Prompt;
Fixes;
Targets.

25

4.33

1.35

Reports
tinyobjloader tinyobj-material_t-material_t Prompt;
Fixes;
Targets.

25

23.91

1.35

Reports
tinyxml2 tinyxml2-xmlelement-insertnewtext Prompt;
Fixes;
Targets.

62.5

24.42

1.32

Reports
elfutils dwfl_module_relocate_address Prompt;
Fixes;
Targets.

87.5

7.81

1.1

Reports
libxml2 xmlschemavalidatefile Prompt;
Fixes;
Targets.

50

4.18

0.93

Reports
tinyxml2 tinyxml2-xmldocument-loadfile Prompt;
Fixes;
Targets.

25

2.17

0.9

Reports
speex ogg_stream_packetin Prompt;
Fixes;
Targets.

25

8.06

0.55

Reports
libxml2 xmltextreadersetschema Prompt;
Fixes;
Targets.

37.5

1.44

0.48

Reports
libxml2 xmltextreaderschemavalidate Prompt;
Fixes;
Targets.

37.5

4.03

0.36

Reports
libplist plist_dict_merge Prompt;
Fixes;
Targets.

12.5

0.8

0.35

Reports
libxml2 xmltextreaderschemavalidatectxt Prompt;
Fixes;
Targets.

25

1.19

0.33

Reports
cjson cjson_printpreallocated Prompt;
Fixes;
Targets.

75

26.44

0.22

Reports
libucl ucl_parser_insert_chunk Prompt;
Fixes;
Targets.

25

10.5

0.22

Reports
libucl ucl_object_compare Prompt;
Fixes;
Targets.

12.5

1.35

0.21

Reports
elfutils dwarf_getlocations Prompt;
Fixes;
Targets.

62.5

7.22

0.2

Reports
libucl ucl_parser_add_fd_priority Prompt;
Fixes;
Targets.

71.43

19.61

0.18

Reports
jsoncpp json-value-resize Prompt;
Fixes;
Targets.

37.5

2.23

0.17

Reports
mosquitto mosquitto_topic_matches_sub2 Prompt;
Fixes;
Targets.

87.5

28.28

0.16

Reports
xvid xvid_encore Prompt;
Fixes;
Targets.

31.25

0.22

0.16

Reports
cjson cjson_parse Prompt;
Fixes;
Targets.

75

25.72

0.13

Reports
cjson cjson_parsewithlength Prompt;
Fixes;
Targets.

62.5

25.72

0.13

Reports
jsoncpp json-value-removeindex Prompt;
Fixes;
Targets.

25

5.11

0.13

Reports
libucl ucl_object_merge Prompt;
Fixes;
Targets.

25

0.65

0.12

Reports
speex ogg_stream_pageout_fill Prompt;
Fixes;
Targets.

6.25

0

0.07

Reports
libsndfile sf_command Prompt;
Fixes;
Targets.

25

3.41

0.06

Reports
mosquitto mosquitto_topic_matches_sub Prompt;
Fixes;
Targets.

62.5

4.03

0.05

Reports
libsndfile sf_format_check Prompt;
Fixes;
Targets.

12.5

0.1

0.04

Reports
libucl ucl_object_replace_key Prompt;
Fixes;
Targets.

50

7.14

0.04

Reports
libdwarf dwarf_find_die_given_sig8 Prompt;
Fixes;
Targets.

37.5

11.62

0.01

Reports