Skip to content

Shared: Generate more value-preserving flow summaries #19443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

MathiasVP
Copy link
Contributor

Ideally, we want generated flow summaries to be content-sensitive. That is, a summary for a function such as:

int read_f(S* s) { return s->f; }

should specify that we read the content f from Argument[*0].

However, if a function is super complex it may read/write many access paths, and this could cause an explosion in the number of summaries we generate.

To mitigate this, the flow summary generation library puts various restrictions on which callables receive content-sensitive summaries.

When a content-sensitive summary isn't generated, we currently fall back to a taint-configuration-based summary which means we only generate a taint summary and not a value-preserving summary.

This PR adds a "midpoint" in between the content-sensitive value-preserving summary and the taint-based summary so that we now:

  • First check if we can generate a content-sensitive summary,
  • If not, we check if we can generate a value-preserving summary,
  • and finally, if that fails we check if we can generate a taint-based summary.

This seems to generate much better models on OpenSSL in particular.

@github-actions github-actions bot added C# C++ Java Rust Pull requests that update Rust code labels May 1, 2025
@MathiasVP MathiasVP marked this pull request as ready for review May 1, 2025 09:22
@Copilot Copilot AI review requested due to automatic review settings May 1, 2025 09:22
@MathiasVP MathiasVP requested review from a team as code owners May 1, 2025 09:22
@MathiasVP MathiasVP requested a review from michaelnebel May 1, 2025 09:22
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label May 1, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new value-preserving summary fallback between content-sensitive and taint-based flows. It updates test fixtures and QL modules to:

  • Change heuristic-summary annotations from taint to value in Java, C#, and C++ test data.
  • Update captureFlow calls to the new captureHeuristicFlow and expanded captureFlow(c,_,_) signatures in inline test configurations.
  • Switch debug imports and modules from PropagateFlow to PropagateTaintFlow, and remove exists() guards in path queries.

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated no comments.

Show a summary per file
File Description
java/ql/test/utils/modelgenerator/dataflow/p/MultiPaths.java Switched heuristic-summary tag from taint to value.
java/ql/test/utils/modelgenerator/dataflow/p/InnerClasses.java Updated nested-class heuristic-summary tags to value.
java/ql/test/utils/modelgenerator/dataflow/p/Inheritance.java Updated inheritance-test heuristic-summary tags to value.
java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java Updated POJO heuristic-summary tag to value.
java/ql/test/utils/modelgenerator/dataflow/p/FinalClass.java Updated final-class heuristic-summary tag to value.
java/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql Swapped captureFlow to captureHeuristicFlow call.
java/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql Expanded captureFlow to new overload with two parameters.
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql Switched to PropagateTaintFlow::PathGraph import.
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql Updated FlowExplorationFwd to use PropagateTaintFlow.
java/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql Expanded captureFlow to new overload in core module.
csharp/ql/test/utils/modelgenerator/dataflow/Summaries.cs Switched C# fixtures’ heuristic-summary tags to value.
csharp/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql Swapped captureFlow to captureHeuristicFlow call in C#.
csharp/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql Expanded captureFlow to new overload in C#.
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql Switched to PropagateTaintFlow::PathGraph import in C#.
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql Updated FlowExplorationFwd to use PropagateTaintFlow in C#.
csharp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql Expanded captureFlow to new overload in core C# module.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp Switched C++ fixtures’ heuristic-summary tags to value.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql Swapped captureFlow to captureHeuristicFlow call in C++.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql Expanded captureFlow to new overload in C++.
Comments suppressed due to low confidence (3)

java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java:28

  • The heuristic-summary entry for Argument[this] still uses taint; it should be updated to value to match the new value-preserving fallback.
// heuristic-summary=p;ImmutablePojo;false;or;(String);;Argument[this];ReturnValue;taint;df-generated

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:212

  • This heuristic-summary annotation still uses taint; it should use value to align with the updated fallback behavior.
// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*0];taint;df-generated

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:213

  • This heuristic-summary annotation still uses taint; it should use value to align with the updated fallback behavior.
// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*1];taint;df-generated

@MathiasVP
Copy link
Contributor Author

Merge conflicts from #19382 resolved in d5bc95d 🤞

Copy link
Contributor

@michaelnebel michaelnebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good idea @MathiasVP! 😄
I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 .
Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.

string asLiftedTaintModel(Printing::SummaryApi api, string input, string output) {
result = asModel(api, input, output, false, true)
bindingset[input, output, preservesValue]
string asLiftedTaintModel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to asLiftedModel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 54f0eed

string captureFlow(DataFlowSummaryTargetApi api) {
result = captureQualifierFlow(api) or
result = captureThroughFlow(api)
string captureHeuristicFlow(DataFlowSummaryTargetApi api, boolean preservesValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the name captureFlow (the predicate is in the Heuristic module - rename was probably done before the re-factor you just rebased on top of :-) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4d2f2b8. I had to change a few tests to work around some ambiguities, but hopefully that's okay 🤞

result = Heuristic::captureFlow(api) and
lift = true
not exists(DataFlowSummaryTargetApi api0 |
// If the heuristic summary is value-preserving then we keep both
Copy link
Contributor

@michaelnebel michaelnebel May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find any cases where it is relevant to keep both?
If we go with this approach, we need to eliminate duplicates - as it is now the heuristic flow and content based flow will generate some identical summaries (from input to output) where the only difference is provenance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find any cases where it is relevant to keep both?

I didn't, no. It's certainly not required for what I need it for (in OpenSSL). I'm happy to make this more restrictive 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to me 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.

That sounds fine to me 👍

Hm, wait a second. I'm looking at these three models we generate for the EVP_CIPHER_CTX_copy function in OpenSSL:

- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[*1]", "Argument[*0]", "value", "df-generated"]
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*0]", "taint", "dfc-generated"]
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*1]", "taint", "dfc-generated"]

(the value-preserving model here is one of the models that motivated this work.)

As you can see, we get two taint models using the content-based approach and one value-preserving model using the heuristic-based approach. The heuristic one is the one I really care about. However, if we only keep the heuristic model if there is no content-based then we end up throwing the value-based one away 😭

So I think what I would prefer is:

  • Taint-based heuristic-model and a taint-based content-model: Only use the content-based model
  • Taint-based heuristic-model and a value-preserving content-model: Only use the content-based model
  • Value-preserving heuristic-model and a taint-based content-model: Use both (we will need to deduplicate this)
  • Value-preserving heuristic-model and a value-preserving content-model: Only use the content-based model

How does this sound, @michaelnebel ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's give that a try 😄

Copy link
Contributor Author

@MathiasVP MathiasVP May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done the deduplication in 37bc2bf now. Let me know what you think of it! I'll rerun the model generation experiments 🤞

@MathiasVP
Copy link
Contributor Author

MathiasVP commented May 2, 2025

This is a really good idea @MathiasVP! 😄 I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 . Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.

Thanks! Do you want me to re-generate the models for .NET runtime as part of this PR? Or open another PR on top of this one that re-generates the models? I'm happy to follow whatever PR structure you think is best.

... I'm also happy to leave the .NET model generation to you, of course 😄

@michaelnebel
Copy link
Contributor

michaelnebel commented May 2, 2025

This is a really good idea @MathiasVP! 😄 I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 . Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.

Thanks! Do you want me to re-generate the models for .NET runtime as part of this PR? Or open another PR on top of this one that re-generates the models? I'm happy to follow whatever PR structure you think is best.

... I'm also happy to leave the .NET model generation to you, of course 😄

Usually I just make a separate PR with the update to the .NET Runtime models - as we also need to supply a change note.
You are more than welcome to re-generate the .NET Runtime models and start a DCA run - but let me know if you prefer that I do it (this should serve as a sanity check of the changes for the model generator) 😃

@MathiasVP
Copy link
Contributor Author

Usually I just make a separate PR with the update to the .NET Runtime models - as we also need to supply a change note. You are more than welcome to re-generate the .NET Runtime models and start a DCA run - but let me know if you prefer that I do it (this should serve as a sanity check of the changes for the model generator) 😃

It would be fantastic if you could do it 👍 Thinking about it I'm not 100% sure which parameters you pass to GenerateFlowModel.py when generating the .NET Runtime models so it's probably best if I leave that to you 🙂

@@ -344,12 +345,14 @@
/**
* Gets the summary model of `api`, if it follows the `fluent` programming pattern (returns `this`).
*/
private string captureQualifierFlow(DataFlowSummaryTargetApi api) {
private string captureQualifierFlow(DataFlowSummaryTargetApi api, string input, string output) {

Check warning

Code scanning / CodeQL

Missing QLDoc for parameter Warning

The QLDoc has no documentation for input, or output, but the QLDoc mentions fluent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C# C++ Java no-change-note-required This PR does not need a change note Rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants