Shared: Generate more value-preserving flow summaries #19443

MathiasVP · 2025-05-01T08:59:55Z

Ideally, we want generated flow summaries to be content-sensitive. That is, a summary for a function such as:

int read_f(S* s) { return s->f; }

should specify that we read the content f from Argument[*0].

However, if a function is super complex it may read/write many access paths, and this could cause an explosion in the number of summaries we generate.

To mitigate this, the flow summary generation library puts various restrictions on which callables receive content-sensitive summaries.

When a content-sensitive summary isn't generated, we currently fall back to a taint-configuration-based summary which means we only generate a taint summary and not a value-preserving summary.

This PR adds a "midpoint" in between the content-sensitive value-preserving summary and the taint-based summary so that we now:

First check if we can generate a content-sensitive summary,
If not, we check if we can generate a value-preserving summary,
and finally, if that fails we check if we can generate a taint-based summary.

This seems to generate much better models on OpenSSL in particular.

Copilot

Pull Request Overview

This PR introduces a new value-preserving summary fallback between content-sensitive and taint-based flows. It updates test fixtures and QL modules to:

Change heuristic-summary annotations from taint to value in Java, C#, and C++ test data.
Update captureFlow calls to the new captureHeuristicFlow and expanded captureFlow(c,_,_) signatures in inline test configurations.
Switch debug imports and modules from PropagateFlow to PropagateTaintFlow, and remove exists() guards in path queries.

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
java/ql/test/utils/modelgenerator/dataflow/p/MultiPaths.java	Switched `heuristic-summary` tag from `taint` to `value`.
java/ql/test/utils/modelgenerator/dataflow/p/InnerClasses.java	Updated nested-class `heuristic-summary` tags to `value`.
java/ql/test/utils/modelgenerator/dataflow/p/Inheritance.java	Updated inheritance-test `heuristic-summary` tags to `value`.
java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java	Updated POJO `heuristic-summary` tag to `value`.
java/ql/test/utils/modelgenerator/dataflow/p/FinalClass.java	Updated final-class `heuristic-summary` tag to `value`.
java/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql	Swapped `captureFlow` to `captureHeuristicFlow` call.
java/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql	Expanded `captureFlow` to new overload with two parameters.
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql	Switched to `PropagateTaintFlow::PathGraph` import.
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql	Updated `FlowExplorationFwd` to use `PropagateTaintFlow`.
java/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql	Expanded `captureFlow` to new overload in core module.
csharp/ql/test/utils/modelgenerator/dataflow/Summaries.cs	Switched C# fixtures’ `heuristic-summary` tags to `value`.
csharp/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql	Swapped `captureFlow` to `captureHeuristicFlow` call in C#.
csharp/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql	Expanded `captureFlow` to new overload in C#.
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql	Switched to `PropagateTaintFlow::PathGraph` import in C#.
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql	Updated `FlowExplorationFwd` to use `PropagateTaintFlow` in C#.
csharp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql	Expanded `captureFlow` to new overload in core C# module.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp	Switched C++ fixtures’ `heuristic-summary` tags to `value`.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql	Swapped `captureFlow` to `captureHeuristicFlow` call in C++.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql	Expanded `captureFlow` to new overload in C++.

Comments suppressed due to low confidence (3)

java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java:28

The heuristic-summary entry for Argument[this] still uses taint; it should be updated to value to match the new value-preserving fallback.

// heuristic-summary=p;ImmutablePojo;false;or;(String);;Argument[this];ReturnValue;taint;df-generated

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:212

This heuristic-summary annotation still uses taint; it should use value to align with the updated fallback behavior.

// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*0];taint;df-generated

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:213

This heuristic-summary annotation still uses taint; it should use value to align with the updated fallback behavior.

// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*1];taint;df-generated

MathiasVP · 2025-05-02T09:58:18Z

Merge conflicts from #19382 resolved in d5bc95d 🤞

michaelnebel

This is a really good idea @MathiasVP! 😄
I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 .
Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.

michaelnebel · 2025-05-02T10:47:30Z

shared/mad/codeql/mad/modelgenerator/internal/ModelPrinting.qll

-    string asLiftedTaintModel(Printing::SummaryApi api, string input, string output) {
-      result = asModel(api, input, output, false, true)
+    bindingset[input, output, preservesValue]
+    string asLiftedTaintModel(


Rename to asLiftedModel.

Fixed in 54f0eed

michaelnebel · 2025-05-02T10:59:55Z

shared/mad/codeql/mad/modelgenerator/internal/ModelGeneratorImpl.qll

-      string captureFlow(DataFlowSummaryTargetApi api) {
-        result = captureQualifierFlow(api) or
-        result = captureThroughFlow(api)
+      string captureHeuristicFlow(DataFlowSummaryTargetApi api, boolean preservesValue) {


Maybe keep the name captureFlow (the predicate is in the Heuristic module - rename was probably done before the re-factor you just rebased on top of :-) )

Fixed in 4d2f2b8. I had to change a few tests to work around some ambiguities, but hopefully that's okay 🤞

michaelnebel · 2025-05-02T11:33:13Z

shared/mad/codeql/mad/modelgenerator/internal/ModelGeneratorImpl.qll

-      result = Heuristic::captureFlow(api) and
-      lift = true
+        not exists(DataFlowSummaryTargetApi api0 |
+          // If the heuristic summary is value-preserving then we keep both


Did you find any cases where it is relevant to keep both?
If we go with this approach, we need to eliminate duplicates - as it is now the heuristic flow and content based flow will generate some identical summaries (from input to output) where the only difference is provenance.

Did you find any cases where it is relevant to keep both?

I didn't, no. It's certainly not required for what I need it for (in OpenSSL). I'm happy to make this more restrictive 👍

If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.

That sounds fine to me 👍

If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.

That sounds fine to me 👍

Hm, wait a second. I'm looking at these three models we generate for the EVP_CIPHER_CTX_copy function in OpenSSL:

- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[*1]", "Argument[*0]", "value", "df-generated"] - ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*0]", "taint", "dfc-generated"] - ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*1]", "taint", "dfc-generated"]

(the value-preserving model here is one of the models that motivated this work.)

As you can see, we get two taint models using the content-based approach and one value-preserving model using the heuristic-based approach. The heuristic one is the one I really care about. However, if we only keep the heuristic model if there is no content-based then we end up throwing the value-based one away 😭

So I think what I would prefer is:

Taint-based heuristic-model and a taint-based content-model: Only use the content-based model

Taint-based heuristic-model and a value-preserving content-model: Only use the content-based model

Value-preserving heuristic-model and a taint-based content-model: Use both (we will need to deduplicate this)

Value-preserving heuristic-model and a value-preserving content-model: Only use the content-based model

How does this sound, @michaelnebel ?

Yes, let's give that a try 😄

I've done the deduplication in 37bc2bf now. Let me know what you think of it! I'll rerun the model generation experiments 🤞

MathiasVP · 2025-05-02T11:41:03Z

This is a really good idea @MathiasVP! 😄 I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 . Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.

Thanks! Do you want me to re-generate the models for .NET runtime as part of this PR? Or open another PR on top of this one that re-generates the models? I'm happy to follow whatever PR structure you think is best.

... I'm also happy to leave the .NET model generation to you, of course 😄

michaelnebel · 2025-05-02T12:35:39Z

This is a really good idea @MathiasVP! 😄 I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 . Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.

Thanks! Do you want me to re-generate the models for .NET runtime as part of this PR? Or open another PR on top of this one that re-generates the models? I'm happy to follow whatever PR structure you think is best.

... I'm also happy to leave the .NET model generation to you, of course 😄

Usually I just make a separate PR with the update to the .NET Runtime models - as we also need to supply a change note.
You are more than welcome to re-generate the .NET Runtime models and start a DCA run - but let me know if you prefer that I do it (this should serve as a sanity check of the changes for the model generator) 😃

MathiasVP · 2025-05-02T12:51:08Z

Usually I just make a separate PR with the update to the .NET Runtime models - as we also need to supply a change note. You are more than welcome to re-generate the .NET Runtime models and start a DCA run - but let me know if you prefer that I do it (this should serve as a sanity check of the changes for the model generator) 😃

It would be fantastic if you could do it 👍 Thinking about it I'm not 100% sure which parameters you pass to GenerateFlowModel.py when generating the .NET Runtime models so it's probably best if I leave that to you 🙂

shared/mad/codeql/mad/modelgenerator/internal/ModelGeneratorImpl.qll

michaelnebel · 2025-05-09T14:30:23Z

shared/mad/codeql/mad/modelgenerator/internal/ModelGeneratorImpl.qll

+      result = ContentSensitive::captureFlow(api, _, _, lift, _)
+      or
+      exists(boolean preservesValue, string input, string output |
+        not exists(
+          DataFlowSummaryTargetApi api0, string input0, string output0, boolean preservesValue0
+        |
+          // If the heuristic summary is taint-based, and we can generate a content-sensitive
+          // summary that is value-preserving then we omit generating any heuristic summary.
+          preservesValue = false and
+          preservesValue0 = true
+          or
+          // However, if they're both value-preserving (or both taint-based) then we only
+          // generate a heuristic summary if we didn't generate a content-sensitive summary.
+          preservesValue = preservesValue0 and
+          input0 = input and
+          output0 = output
        |
          (api0 = api or api.lift() = api0) and
-          exists(ContentSensitive::captureFlow(api0, false, _))
+          exists(ContentSensitive::captureFlow(api0, input0, output0, false, preservesValue0))
          or
          api0.lift() = api.lift() and
-          exists(ContentSensitive::captureFlow(api0, true, _))
+          exists(ContentSensitive::captureFlow(api0, input0, output0, true, preservesValue0))
        ) and
-        result = Heuristic::captureFlow(api, preservesValue) and
+        result = Heuristic::captureFlow(api, input, output, preservesValue) and


This change does a bit more than de-duplicating the generated summaries.
Prior to the change, a heuristic summary (value or taint) wouldn't be generated (at all) in a case there existed any content sensitive summary for the API. With this change a heuristic summary is generated on an input/output pair in case there doesn't exist a content sensitive summary for the same input/output pair for the given API.
My original comment merely stated that we should try and remove duplicate entries (where the only difference was the provenance df-generated/dfc-generated).

Oh, that's a good point. I think the behavior you want is what we get if we get rid of this disjunct:

// If the heuristic summary is taint-based, and we can generate a content-sensitive // summary that is value-preserving then we omit generating any heuristic summary. preservesValue = false and preservesValue0 = true

right? Then we'd only generate a heuristic summary if we fail to generate a content-sensitive with exactly the same (input, output, kind) tuple.

I'm happy to do that if you'd rather want this behavior?

I realised that my previous comment is not entirely correct.
The issue only pertains to taint like summaries and the cause is the other disjunkt:

preservesValue = preservesValue0 and input0 = input and output0 = output

That is, we only exclude a summary df-generated taint summary in case there doesn't exist a dfc-generated generated taint summary on the same input/output (string) pair. However, this causes a problem as the df-generated taint summaries truncates field/property accesses in the input/output strings (but the dataflow configuration allows a "limited" set of field/property reads/stores - which means we start producing field conflation issues with taint summaries).

Oh! Hm, yeah I can see how that's a problem. Should we delay deduplication until we see whether it's actually a problem in practice? I'd really like to somehow get this working as the taint summaries we get for OpenSSL isn't sufficient for what we need them for 😭

I tried re-generating the .NET runtime models - and we do get some models that we don't want.
Also, I noticed that the printing of heuristic models might need to change for the exact value preserving models.
Is it ok if I experiment a bit and make some changes to the PR?
Considering to make the filtering like this:

// If the heuristic summary is taint-based, and we can generate a content-sensitive // summary then we omit generating the heuristic summary. preservesValue = false or // If they're both value-preserving then we only generate a heuristic summary if // we didn't generate a content-sensitive summary on the same input/output pair. preservesValue = true and preservesValue0 = true and input0 = input and output0 = output

Thanks a lot for regenerating the .NET runtime models 🙇

Yeah, feel free to experiment on the PR all you want. Let me know if I can do something to help out!

I should also add: I think the filtering you propose there makes a lot of sense, and I also just checked that it still gives us the models we care about 🥳

Great! 😄
I couldn't push the changes to this branch - could you cherry pick the last six commits from (when CI is green): #19481
The commits does the following:

Filter summaries based on the idea proposed above, which means that

For value based summaries we keep both heuristic and content-based (but de-duplicating based on provenance and keep the dfc-generated).

For taint based heuristic summaries we only keep these in case there doesn't exist any (value or taint) content-based summary for the same api.

Adjust the printing of value heuristic models. Taint steps allow (at least for C#) reading from a collection (as a taint step), which means that we used the type of the input to "guess" whether it was likely that we just read an element of a collection. However, this doesn't apply for value models.

Fix an unrelated issue of the printing of source models (as these don't support .Element etc).

That's fantastic, Michael! I've pushed your commits now 🙇

Excellent - and thank you very much for contributing to the model generator @MathiasVP - I really appreciate it!
I will start some DCA runs to check whether this impacted performance.

…nt sensitive summary can be generated.

…inor issue with output printing in captureSink).

michaelnebel · 2025-05-14T08:03:21Z

Ran DCA on .NET runtime for C# and Open JDK 17 and commons-io for Java. Performance doesn't appear to to be impacted. There are some changes in the produced tuples (but this is expected).
For C# a another PR has been opened which updates the models and the changes looks good to me.

michaelnebel

Thank you for the this great contribution @MathiasVP !!

michaelnebel · 2025-05-14T08:05:20Z

The failing integration tests are unrelated to the changes in this PR.

MathiasVP added 5 commits April 30, 2025 20:34

Shared: Generate value-preserving summaries when possible.

607a1e4

C#: Fixup queries and accept test changes.

cd47379

Java: Fixup queries and accept test changes.

07641e4

Rust: Fixup queries.

7751973

C++: Fixup queries and accept test changes.

d8eafbb

github-actions bot added C# C++ Java Rust Pull requests that update Rust code labels May 1, 2025

MathiasVP marked this pull request as ready for review May 1, 2025 09:22

Copilot AI review requested due to automatic review settings May 1, 2025 09:22

MathiasVP requested review from a team as code owners May 1, 2025 09:22

MathiasVP requested a review from michaelnebel May 1, 2025 09:22

MathiasVP added the no-change-note-required This PR does not need a change note label May 1, 2025

Copilot AI reviewed May 1, 2025

View reviewed changes

MathiasVP mentioned this pull request May 1, 2025

C++: Add generated summaries for OpenSSL. microsoft/codeql#221

Closed

Merge branch 'main' into generate-more-value-preserving-summaries-2

d5bc95d

michaelnebel reviewed May 2, 2025

View reviewed changes

MathiasVP added 4 commits May 2, 2025 13:54

Shared: Rename 'asLiftedTaintModel' to 'asLiftedModel'.

54f0eed

Shared/Java/C#/Rust/C++: Rename 'captureHeuristicFlow' to 'captureFlow'.

4d2f2b8

Shared: Deduplicate flow summaries.

37bc2bf

C++/C#/Java/Rust: Fixup tests.

bce5f25

github-advanced-security bot found potential problems May 2, 2025

View reviewed changes

shared/mad/codeql/mad/modelgenerator/internal/ModelGeneratorImpl.qll Fixed Show fixed Hide fixed

Shared: Fix QLDoc to make QL4QL happy.

6437168

michaelnebel reviewed May 9, 2025

View reviewed changes

michaelnebel added 6 commits May 13, 2025 13:44

Shared: Avoid generating taint based heuristic summaries when a conte…

6c9f248

…nt sensitive summary can be generated.

Shared: Adjust the printing of heuristic value summaries (and fix a m…

a94cffa

…inor issue with output printing in captureSink).

C#: Update model generator implementation and test expected output.

09dc3c8

Java: Update model generator implementation and test expected output.

ee83ca9

Rust: Update model generator implementation.

6712cce

Cpp: Update model generator implementation.

fcecc5a

michaelnebel approved these changes May 14, 2025

View reviewed changes

MathiasVP merged commit e903d76 into github:main May 14, 2025
53 of 76 checks passed

michaelnebel mentioned this pull request May 14, 2025

C#: Re-generate .NET 9 Runtime models. #19480

Merged

MathiasVP mentioned this pull request May 14, 2025

C++: Add summary models for openssl and sqlite #19492

Merged

Shared: Generate more value-preserving flow summaries #19443

Shared: Generate more value-preserving flow summaries #19443

Uh oh!

Conversation

MathiasVP commented May 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

MathiasVP commented May 2, 2025

Uh oh!

michaelnebel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelnebel May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathiasVP May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathiasVP commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelnebel commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MathiasVP commented May 2, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelnebel May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelnebel May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelnebel commented May 14, 2025

Uh oh!

michaelnebel left a comment

Choose a reason for hiding this comment

Uh oh!

michaelnebel May 2, 2025 •

edited

Loading

MathiasVP May 2, 2025 •

edited

Loading

MathiasVP commented May 2, 2025 •

edited

Loading

michaelnebel commented May 2, 2025 •

edited

Loading

michaelnebel May 13, 2025 •

edited

Loading

michaelnebel May 13, 2025 •

edited

Loading