-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Shared: Generate more value-preserving flow summaries #19443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Shared: Generate more value-preserving flow summaries #19443
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new value-preserving summary fallback between content-sensitive and taint-based flows. It updates test fixtures and QL modules to:
- Change
heuristic-summary
annotations fromtaint
tovalue
in Java, C#, and C++ test data. - Update
captureFlow
calls to the newcaptureHeuristicFlow
and expandedcaptureFlow(c,_,_)
signatures in inline test configurations. - Switch debug imports and modules from
PropagateFlow
toPropagateTaintFlow
, and removeexists()
guards in path queries.
Reviewed Changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
java/ql/test/utils/modelgenerator/dataflow/p/MultiPaths.java | Switched heuristic-summary tag from taint to value . |
java/ql/test/utils/modelgenerator/dataflow/p/InnerClasses.java | Updated nested-class heuristic-summary tags to value . |
java/ql/test/utils/modelgenerator/dataflow/p/Inheritance.java | Updated inheritance-test heuristic-summary tags to value . |
java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java | Updated POJO heuristic-summary tag to value . |
java/ql/test/utils/modelgenerator/dataflow/p/FinalClass.java | Updated final-class heuristic-summary tag to value . |
java/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql | Swapped captureFlow to captureHeuristicFlow call. |
java/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload with two parameters. |
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql | Switched to PropagateTaintFlow::PathGraph import. |
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql | Updated FlowExplorationFwd to use PropagateTaintFlow . |
java/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in core module. |
csharp/ql/test/utils/modelgenerator/dataflow/Summaries.cs | Switched C# fixtures’ heuristic-summary tags to value . |
csharp/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql | Swapped captureFlow to captureHeuristicFlow call in C#. |
csharp/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in C#. |
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql | Switched to PropagateTaintFlow::PathGraph import in C#. |
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql | Updated FlowExplorationFwd to use PropagateTaintFlow in C#. |
csharp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in core C# module. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp | Switched C++ fixtures’ heuristic-summary tags to value . |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql | Swapped captureFlow to captureHeuristicFlow call in C++. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in C++. |
Comments suppressed due to low confidence (3)
java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java:28
- The heuristic-summary entry for
Argument[this]
still usestaint
; it should be updated tovalue
to match the new value-preserving fallback.
// heuristic-summary=p;ImmutablePojo;false;or;(String);;Argument[this];ReturnValue;taint;df-generated
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:212
- This
heuristic-summary
annotation still usestaint
; it should usevalue
to align with the updated fallback behavior.
// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*0];taint;df-generated
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:213
- This
heuristic-summary
annotation still usestaint
; it should usevalue
to align with the updated fallback behavior.
// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*1];taint;df-generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good idea @MathiasVP! 😄
I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 .
Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.
string asLiftedTaintModel(Printing::SummaryApi api, string input, string output) { | ||
result = asModel(api, input, output, false, true) | ||
bindingset[input, output, preservesValue] | ||
string asLiftedTaintModel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to asLiftedModel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 54f0eed
string captureFlow(DataFlowSummaryTargetApi api) { | ||
result = captureQualifierFlow(api) or | ||
result = captureThroughFlow(api) | ||
string captureHeuristicFlow(DataFlowSummaryTargetApi api, boolean preservesValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep the name captureFlow
(the predicate is in the Heuristic
module - rename was probably done before the re-factor you just rebased on top of :-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4d2f2b8. I had to change a few tests to work around some ambiguities, but hopefully that's okay 🤞
result = Heuristic::captureFlow(api) and | ||
lift = true | ||
not exists(DataFlowSummaryTargetApi api0 | | ||
// If the heuristic summary is value-preserving then we keep both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find any cases where it is relevant to keep both?
If we go with this approach, we need to eliminate duplicates - as it is now the heuristic flow and content based flow will generate some identical summaries (from input to output) where the only difference is provenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find any cases where it is relevant to keep both?
I didn't, no. It's certainly not required for what I need it for (in OpenSSL). I'm happy to make this more restrictive 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fine to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.
That sounds fine to me 👍
Hm, wait a second. I'm looking at these three models we generate for the EVP_CIPHER_CTX_copy
function in OpenSSL:
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[*1]", "Argument[*0]", "value", "df-generated"]
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*0]", "taint", "dfc-generated"]
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*1]", "taint", "dfc-generated"]
(the value-preserving model here is one of the models that motivated this work.)
As you can see, we get two taint models using the content-based approach and one value-preserving model using the heuristic-based approach. The heuristic one is the one I really care about. However, if we only keep the heuristic model if there is no content-based then we end up throwing the value-based one away 😭
So I think what I would prefer is:
- Taint-based heuristic-model and a taint-based content-model: Only use the content-based model
- Taint-based heuristic-model and a value-preserving content-model: Only use the content-based model
- Value-preserving heuristic-model and a taint-based content-model: Use both (we will need to deduplicate this)
- Value-preserving heuristic-model and a value-preserving content-model: Only use the content-based model
How does this sound, @michaelnebel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's give that a try 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done the deduplication in 37bc2bf now. Let me know what you think of it! I'll rerun the model generation experiments 🤞
Thanks! Do you want me to re-generate the models for .NET runtime as part of this PR? Or open another PR on top of this one that re-generates the models? I'm happy to follow whatever PR structure you think is best. ... I'm also happy to leave the .NET model generation to you, of course 😄 |
Usually I just make a separate PR with the update to the .NET Runtime models - as we also need to supply a change note. |
It would be fantastic if you could do it 👍 Thinking about it I'm not 100% sure which parameters you pass to |
@@ -344,12 +345,14 @@ | |||
/** | |||
* Gets the summary model of `api`, if it follows the `fluent` programming pattern (returns `this`). | |||
*/ | |||
private string captureQualifierFlow(DataFlowSummaryTargetApi api) { | |||
private string captureQualifierFlow(DataFlowSummaryTargetApi api, string input, string output) { |
Check warning
Code scanning / CodeQL
Missing QLDoc for parameter Warning
Ideally, we want generated flow summaries to be content-sensitive. That is, a summary for a function such as:
should specify that we read the content
f
fromArgument[*0]
.However, if a function is super complex it may read/write many access paths, and this could cause an explosion in the number of summaries we generate.
To mitigate this, the flow summary generation library puts various restrictions on which callables receive content-sensitive summaries.
When a content-sensitive summary isn't generated, we currently fall back to a taint-configuration-based summary which means we only generate a taint summary and not a value-preserving summary.
This PR adds a "midpoint" in between the content-sensitive value-preserving summary and the taint-based summary so that we now:
This seems to generate much better models on OpenSSL in particular.