--- layout: default nav_exclude: true permalink: /research/llms/target_generation/ --- # Fuzz target generation using LLMs [Read our announcement blog.](https://security.googleblog.com/2023/08/ai-powered-fuzzing-breaking-bug-hunting.html) # Background [OSS-Fuzz](http://github.com/google/oss-fuzz) performs continuous fuzzing of [1000+](https://github.com/google/oss-fuzz/tree/master/projects) open source projects across most major languages. To integrate a new project, a human typically analyzes the attack surface of a library and writes [fuzz targets](https://github.com/google/fuzzing/blob/master/docs/glossary.md#fuzz-target) (also called fuzzing harnesses) to exercise the relevant code. Linked with a [fuzzing engine](https://github.com/google/fuzzing/blob/master/docs/glossary.md#fuzzing-engine) (e.g. libFuzzer, AFL, Centipede), this enables coverage-guided fuzzing for all OSS-Fuzz projects. Depending on the complexity of the project, writing fuzz targets typically requires several hours of manual work and sufficient background knowledge of the project. Additionally, the main challenge for most integrated OSS-Fuzz projects is ensuring high code coverage. Most OSS-Fuzz projects have fairly low runtime coverage ([~30%](http://introspector.oss-fuzz.com/)) despite millions of hours of CPU time. This means we are not finding vulnerabilities in approximately 70% of each project that we’re fuzzing. Our [preliminary research](https://dl.acm.org/doi/abs/10.1145/3605157.3605177) found that many[ fuzz blockers](https://github.com/ossf/fuzz-introspector/blob/main/doc/Glossary.md#fuzz-blockers) (as determined by [FuzzIntrospector](https://introspector.oss-fuzz.com/)) are because of deficiencies in existing targets, rather than deficiencies in fuzzing engines. Generating fuzz targets via LLMs can reduce the manual effort required to more thoroughly fuzz existing projects in OSS-Fuzz as well as integrating new projects into OSS-Fuzz. # Goals Our ideal end state of this research is to use LLMs for two use cases: 1. Completely automatic fuzz target generation (or modification of existing targets) for existing OSS-Fuzz projects to unblock fuzz blockers and increase project code coverage (and bugs found) for free. 2. Completely automatic fuzz target generation for completely new OSS-Fuzz projects. This is much more challenging than 1, and is an extension of it. Our current experiments focus on the first use case for C/C++ projects. This report serves as a preliminary investigation into how effective LLMs are for this use case. More detailed results and the experimentation framework for our research will be published at a later date. # Experiment framework To discover whether an LLM could successfully write new fuzz targets, we built an evaluation framework that connects OSS-Fuzz to Google’s LLMs, conducts the experiment, and evaluates the results. The steps look like this: ![experiment framework]({{ site.baseurl }}/images/llm_framework.png "image_tooltip") 1. OSS-Fuzz’s [Fuzz Introspector tool](http://introspector.oss-fuzz.com/) identifies an under-fuzzed, high-potential, portion of the target project’s code and passes the code to the evaluation framework. 2. The evaluation framework creates a prompt that the LLM will use to write the new fuzz target. The prompt includes project specific information. 3. The evaluation framework takes the fuzz target generated by the LLM and runs the new target. 4. The evaluation framework observes the run for any change in code coverage or crashes. 5. In the event that the fuzz target fails to compile, the evaluation framework prompts the LLM to write a revised fuzz target that addresses the compilation errors. ## 1. Identifying high potential portions of the project’s code We leverage [Fuzz Introspector](https://introspector.oss-fuzz.com/) ([example JSON endpoint](https://introspector.oss-fuzz.com/api/far-reach-but-low-coverage?project=tinyxml2)) to provide us with a list of functions with low runtime coverage (but high potential to reach more code coverage). These are turned into benchmark YAML files, which consist of an OSS-Fuzz project, and a list of function signatures to generate new targets for. We have started with a small set of benchmarks, and will gradually scale this to larger, automated sets of benchmarks taken from all of OSS-Fuzz as we improve the function selection and prompt generation process. Example benchmark (YAML): ```yaml functions: - XML_Parser XMLCALL XML_ExternalEntityParserCreate(XML_Parser oldParser, const XML_Char *context, const XML_Char *encodingName) - XML_Parser XMLCALL XML_ParserCreateNS(const XML_Char *encodingName, XML_Char nsSep) - XML_Bool XMLCALL XML_ParserReset(XML_Parser parser, const XML_Char *encodingName) - static enum XML_Error PTRCALL externalParEntInitProcessor(XML_Parser parser, const char *s, const char *end, const char **nextPtr) project: expat target_path: /src/expat/expat/fuzz/xml_parse_fuzzer.c target_name: xml_parse_fuzzer_UTF-8 ``` ## 2. Prompt generation We dynamically generate a prompt based on a template ([example](https://storage.googleapis.com/oss-fuzz-llm-targets-public/jsoncpp-json-value-removeindex/prompts.txt)). As part of our experimentation, we tried various different prompt approaches. So far, the best results have come from including: * One example of an existing function signature and fuzz target from the project under test, formatted into problem and solution structure. Too many examples yields worse results. * Two examples from other projects in OSS-Fuzz, formatted in the same way. * Examples of how to leverage [FuzzedDataProvider](https://github.com/google/fuzzing/blob/master/docs/split-inputs.md#fuzzed-data-provider) to generate inputs for function arguments. * A priming that gives the task context. * Examples of code anti-patterns to avoid. The dynamically generated sections today include examples of existing fuzz targets from both other projects on OSS-Fuzz as well as one example from the project under test. We have other unexplored ideas including more structured information about the function under test, such as: * Relevant data structure definitions * Function implementations of the function under test and related functions * Usages of the function under test and related functions ## 3. Build and run We leverage the OSS-Fuzz build infrastructure to build new targets by replacing an existing target’s source code with the newly generated target source code. OSS-Fuzz projects often have strict compiler flags on by default. To make compilation easier, we also implemented a [compiler wrapper](https://github.com/google/oss-fuzz/blob/d78275b4e2e17d3d8f12b99db2b51de4b114edb3/infra/base-images/base-builder/jcc.go) that: * Turns off compiler warnings to prevent trivial issues such as missing pointer casts from blocking compilation. * Re-compiles targets as C++ (to leverage [FuzzedDataProvider](https://github.com/google/fuzzing/blob/master/docs/split-inputs.md#fuzzed-data-provider)). ## 4. Measuring quality of generated targets An important part of our research is to define metrics to measure the quality of generated targets. These metrics are: * Syntax correctness and project consistency. This is measured by its compilation result. For example: whether it compiles successfully, does it call functions in the project correctly without hallucination. * Whether it crashes instantly or within the fuzz target. This often means that there is some miscalled API and the crashes are likely to be false positives. * **New code coverage**. This is measured by the new lines it covered compared to all existing targets in OSS-Fuzz for the same project. All of these metrics can be automatically computed for a given generated target. ## 5. LLM Code Fixer The fuzz targets generated by LLM often contain various trivial defects, which can be fixed by a separate LLM query. The prompt of the code fixing query is structured as follows, where the raw code and error are respectively replaced with the fuzz target source code generated by the LLM and the build error messages extracted from pages of build logs: ```` Given the following code and its build error message, fix the code without affecting its functionality. First explain the reason, then output the whole fixed code. If a function is missing, fix it by including the related libraries. Code: ``` {raw_code} ``` Build error message: ``` {error} ``` Fixed code: ```` Several rounds of code fixing queries are required for some cases. For example, when multiple defects incurs several error messages, sometimes LLM tends to only fix one of them at a time. Similarly, new defects may be introduced during code fixing. In these cases, we found iteratively querying LLM with the same prompt structure will gradually fix all errors. LLM often proposes several responses for each query, we prefer the one with the longest code. This is an implementation decision to avoid a quadratically increasing number of targets to build (e.g. the LLMs could propose 4 new targets across N iterations) and to avoid the LLM deleting the function code to fix build failures. Additionally, we also check that the generated target includes a call to the requested function to test. If it does not, this is surfaced as an error to the LLM. ### Example [Prompt](https://storage.googleapis.com/oss-fuzz-llm-targets-public/tinyxml2-tinyxml2-xmldocument-print/fixes/04-F4/prompt.txt): Incorrect target with missing arguments passed to target function. [After fix](https://storage.googleapis.com/oss-fuzz-llm-targets-public/tinyxml2-tinyxml2-xmldocument-print/fixes/04-F4/01.cpp): Correct function argument added. # Results Initially, getting any compilable output was a challenge. We were able to improve this via prompt engineering and our compiler wrapper to having [14/31 tested OSS-Fuzz projects](https://storage.googleapis.com/oss-fuzz-llm-targets-public/index.html?prefix=benchmarks/) successfully compile new targets and increase coverage. The successful examples and prompts are published [here](https://storage.googleapis.com/oss-fuzz-llm-targets-public/index.html). We see a wide range of coverage improvements from 0-31% code coverage increases. The top coverage increases, aggregated across all benchmarks per OSS-Fuzz project are:
tinyxml2 | 31% |
cjson | 6% |
expat | 4% |
libplist | 4% |
libxml2 | 1% |
elfutils | 1% |
Project | Function | Output | Build rate | Max Coverage | Max Line coverage diff | Reports |
tinyxml2 | tinyxml2-xmldocument-print | Prompt;
Fixes; Targets. |
50 |
29.74 |
11.16 |
Reports |
tinyxml2 | tinyxml2-xmldocument-deepcopy | Prompt;
Fixes; Targets. |
25 |
26.8 |
4.45 |
Reports |
tinyxml2 | tinyxml2-xmlelement-setattribute | Prompt;
Fixes; Targets. |
75 |
26.08 |
3.77 |
Reports |
libplist | plist_print | Prompt;
Fixes; Targets. |
25 |
12.88 |
3.42 |
Reports |
tinyxml2 | tinyxml2-xmlelement-doubletext | Prompt;
Fixes; Targets. |
62.5 |
25.61 |
3.28 |
Reports |
tinyxml2 | tinyxml2-xmlelement-booltext | Prompt;
Fixes; Targets. |
87.5 |
26.18 |
2.9 |
Reports |
tinyxml2 | tinyxml2-xmlelement-insertnewunknown | Prompt;
Fixes; Targets. |
25 |
25.67 |
2.64 |
Reports |
tinyxml2 | tinyxml2-xmlelement-int64text | Prompt;
Fixes; Targets. |
87.5 |
25.91 |
2.64 |
Reports |
cjson | cjson_compare | Prompt;
Fixes; Targets. |
75 |
29.68 |
2.47 |
Reports |
tinyxml2 | tinyxml2-xmlelement-floattext | Prompt;
Fixes; Targets. |
62.5 |
25.09 |
2.45 |
Reports |
tinyxml2 | tinyxml2-xmlelement-inttext | Prompt;
Fixes; Targets. |
75 |
26.2 |
2.41 |
Reports |
tinyxml2 | tinyxml2-xmlelement-unsigned64text | Prompt;
Fixes; Targets. |
37.5 |
25.74 |
2.22 |
Reports |
tinyobjloader | tinyobj-objreader-parsefromfile | Prompt;
Fixes; Targets. |
37.5 |
5.7 |
2.16 |
Reports |
tinyxml2 | tinyxml2-xmlelement-unsignedtext | Prompt;
Fixes; Targets. |
50 |
25.53 |
2.15 |
Reports |
tinyxml2 | tinyxml2-xmlelement-shallowclone | Prompt;
Fixes; Targets. |
50 |
25.03 |
2.07 |
Reports |
cjson | cjson_replaceiteminobject | Prompt;
Fixes; Targets. |
37.5 |
27.56 |
1.98 |
Reports |
tinyxml2 | tinyxml2-xmlelement-gettext | Prompt;
Fixes; Targets. |
37.5 |
25.42 |
1.96 |
Reports |
tinyxml2 | tinyxml2-xmlelement-shallowequal | Prompt;
Fixes; Targets. |
62.5 |
25.58 |
1.96 |
Reports |
cjson | cjson_duplicate | Prompt;
Fixes; Targets. |
62.5 |
27.4 |
1.89 |
Reports |
expat | xml_externalentityparsercreate | Prompt;
Fixes; Targets. |
12.5 |
1.25 |
1.88 |
Reports |
cjson | cjson_replaceiteminobjectcasesensitive | Prompt;
Fixes; Targets. |
87.5 |
25.54 |
1.85 |
Reports |
tinyxml2 | tinyxml2-xmlelement-insertnewcomment | Prompt;
Fixes; Targets. |
62.5 |
25.61 |
1.85 |
Reports |
expat | xml_parsercreatens | Prompt;
Fixes; Targets. |
12.5 |
45.6 |
1.84 |
Reports |
tinyxml2 | tinyxml2-xmlelement-deleteattribute | Prompt;
Fixes; Targets. |
50 |
25.73 |
1.7 |
Reports |
tinyxml2 | tinyxml2-xmlelement-insertnewdeclaration | Prompt;
Fixes; Targets. |
50 |
25 |
1.7 |
Reports |
tinyxml2 | tinyxml2-xmlelement-insertnewchildelement | Prompt;
Fixes; Targets. |
50 |
24.88 |
1.51 |
Reports |
tinyxml2 | tinyxml2-xmlnode-previoussiblingelement | Prompt;
Fixes; Targets. |
87.5 |
26.08 |
1.43 |
Reports |
tinyobjloader | tinyobj-loadobj | Prompt;
Fixes; Targets. |
25 |
4.33 |
1.35 |
Reports |
tinyobjloader | tinyobj-material_t-material_t | Prompt;
Fixes; Targets. |
25 |
23.91 |
1.35 |
Reports |
tinyxml2 | tinyxml2-xmlelement-insertnewtext | Prompt;
Fixes; Targets. |
62.5 |
24.42 |
1.32 |
Reports |
elfutils | dwfl_module_relocate_address | Prompt;
Fixes; Targets. |
87.5 |
7.81 |
1.1 |
Reports |
libxml2 | xmlschemavalidatefile | Prompt;
Fixes; Targets. |
50 |
4.18 |
0.93 |
Reports |
tinyxml2 | tinyxml2-xmldocument-loadfile | Prompt;
Fixes; Targets. |
25 |
2.17 |
0.9 |
Reports |
speex | ogg_stream_packetin | Prompt;
Fixes; Targets. |
25 |
8.06 |
0.55 |
Reports |
libxml2 | xmltextreadersetschema | Prompt;
Fixes; Targets. |
37.5 |
1.44 |
0.48 |
Reports |
libxml2 | xmltextreaderschemavalidate | Prompt;
Fixes; Targets. |
37.5 |
4.03 |
0.36 |
Reports |
libplist | plist_dict_merge | Prompt;
Fixes; Targets. |
12.5 |
0.8 |
0.35 |
Reports |
libxml2 | xmltextreaderschemavalidatectxt | Prompt;
Fixes; Targets. |
25 |
1.19 |
0.33 |
Reports |
cjson | cjson_printpreallocated | Prompt;
Fixes; Targets. |
75 |
26.44 |
0.22 |
Reports |
libucl | ucl_parser_insert_chunk | Prompt;
Fixes; Targets. |
25 |
10.5 |
0.22 |
Reports |
libucl | ucl_object_compare | Prompt;
Fixes; Targets. |
12.5 |
1.35 |
0.21 |
Reports |
elfutils | dwarf_getlocations | Prompt;
Fixes; Targets. |
62.5 |
7.22 |
0.2 |
Reports |
libucl | ucl_parser_add_fd_priority | Prompt;
Fixes; Targets. |
71.43 |
19.61 |
0.18 |
Reports |
jsoncpp | json-value-resize | Prompt;
Fixes; Targets. |
37.5 |
2.23 |
0.17 |
Reports |
mosquitto | mosquitto_topic_matches_sub2 | Prompt;
Fixes; Targets. |
87.5 |
28.28 |
0.16 |
Reports |
xvid | xvid_encore | Prompt;
Fixes; Targets. |
31.25 |
0.22 |
0.16 |
Reports |
cjson | cjson_parse | Prompt;
Fixes; Targets. |
75 |
25.72 |
0.13 |
Reports |
cjson | cjson_parsewithlength | Prompt;
Fixes; Targets. |
62.5 |
25.72 |
0.13 |
Reports |
jsoncpp | json-value-removeindex | Prompt;
Fixes; Targets. |
25 |
5.11 |
0.13 |
Reports |
libucl | ucl_object_merge | Prompt;
Fixes; Targets. |
25 |
0.65 |
0.12 |
Reports |
speex | ogg_stream_pageout_fill | Prompt;
Fixes; Targets. |
6.25 |
0 |
0.07 |
Reports |
libsndfile | sf_command | Prompt;
Fixes; Targets. |
25 |
3.41 |
0.06 |
Reports |
mosquitto | mosquitto_topic_matches_sub | Prompt;
Fixes; Targets. |
62.5 |
4.03 |
0.05 |
Reports |
libsndfile | sf_format_check | Prompt;
Fixes; Targets. |
12.5 |
0.1 |
0.04 |
Reports |
libucl | ucl_object_replace_key | Prompt;
Fixes; Targets. |
50 |
7.14 |
0.04 |
Reports |
libdwarf | dwarf_find_die_given_sig8 | Prompt;
Fixes; Targets. |
37.5 |
11.62 |
0.01 |
Reports |