-
Notifications
You must be signed in to change notification settings - Fork 213
String canonicalization #985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The phrasing of "string canonicalization" has always been vague. In fact, the specification does not require canonicalization, it only requires that const a = "s";
const b = "s";
print(identical(a, b)); // must be true because `identical(a, b)` is a constant expression
var c = a;
print(identical(c, b)); // false. Not bound by the specification. We wouldn't want that, but the specification allows it because it doesn't talk about canonicalization of strings at all. If we read the specification as implicitly defining canonicalization of strings through their behavior wrt I believe the current VM behavior is that they canonicalize all constant string literals, and they canonicalize the value of constant string expressions in constant contexts. That is This does ensure that The JS compilation is limited by the underlying platform. A JS program cannot distinguish two strings with the same content in any way, so they effectively canonicalize all string values. We do not want the VM to do the same, so we will definitely need to allow more canonicalization than what the specification requires. We should probably specify the exact behavior that we require.
Then we can say that:
|
The API reference seems to show the behavior of https://api.dart.dev/stable/2.8.3/dart-core/String/operator_equals.html |
We have indeed specified both It's still better to specify canonicalization explicitly instead of implicitly. (There have been a lot of different ways to read that part of the spec over time). |
Neither? Why is this a priority right now? |
It's definitely not a priority. I think it probably came up because we just migrated the language/identity tests which have some tests around strings and |
Feb 2024: Note that the original example prints 'true' in many (perhaps all?) configurations including void main() {
var s1 = "ab";
var s2 = "a" "b"; // or `"a" + "b"`.
print(identical(s1, s2)); // 'true'.
} |
That used to work, but the VM still does not canonicalize It's not that |
The specification uses the term iff, which implies that This implies that One might reasonably expect an optimizing compiler to replace We might square the definition of |
That's an interesting example, @rakudrama! The currently specified behavior with respect to canonicalization is pretty clear: (1) 'constant context' is defined, (2) a For strings, we have the rule in the specification of
For the phrase 'constant strings', we'll have to assume that it means 'an object whose run-time type is a subtype of This actually means that 'constant strings' do not have to be canonicalized, but if they are not canonicalized then I don't know if we can always test at run time whether or not any given object is "a constant string" (unless it is canonicalized, in which case they might be allocated in a special memory region—but then we don't need it!). On top of this, we have this:
I certainly think it would be reasonable to change the documentation of Considering all these things, I think it would be a reasonable way ahead to keep the specification unchanged, and achieve the specified treatment of strings by actually canonicalizing all strings that are the value of a constant expression. I don't think there are any other kinds of constant expressions with a similar issue: They are either inherently canonicalized because they are simple values with a special representation, or they have an explicit Finally: @lrhn wrote:
We could consider that, too. It's hard to tell how breaking this change would be (it could certainly cause some identical strings like the result of evaluating |
This specification, as currently written, is a complete definition of the platform It says that the The specification says
About "constant string", I wouldn't necessarily restrict it to only strings created by a constant expression. Let's say that it definitely includes any string object which is the value of a constant expression, and it's (somehow) a property of the string object. That doesn't preclude that other strings can also be designated as constant strings, even if they are allocated as a "new string" according to docs. (Also, don't trust docs of a factory constructor if it says "new object".) Then it's not a problem if That makes web compilation not inherently in violation of this specification, it just makes all strings constant strings. ... I still think it's better to specify canonicalization directly, not indirectly through Any function which returns a Any function which returns a string may choose to return an existing string with the same content, rather than create a new string. If two expressions are evaluated a constant, and they should evaluate to equal strings, they evaluate to the same string. |
This seems to imply that every expression of type However, I suspect that the specified behavior of @kustermann, @mraleph, @rakudrama, is it true that there is no run-time representation of the status of an instance of I would assume that If this is true then it implies that there is no way to implement One way out would be to say that the specification of
I believe this would be essentially the same thing as making all strings constant strings, except that it avoids introducing something to the notion of 'constant strings' which is not compatible with the existing notion of 'constant expressions'. This could be costly because it forces An alternative (that I'd prefer) would be to ensure more consistently that every object that we wish to consider a 'constant string' is canonical. In that case In order to specify that the latter approach is used, the specification would be changed such that |
VM has special bits on constant objects, but I would suggest that we remove anything from the spec that implies special treatment for strings which originated from constant expressions. In JS you can't distinguish two string values if they have the same content even if they are represented as two different objects. So I would strongly encourage to relax the spec for |
I actually do want constant strings to be canonicalized, so you know you can depend on The question is what a "constant string" is. Anything computed in a context which must be constant (and computed at compile time) is a given, but is a simple string literal like As specified, these are constant expressions, except the last one, |
Sounds good! This whole description matches exactly what I suggested as the preferred approach here. |
The degree to which canonicalization occurs in Dart, in particular for strings, has always been somewhat unclear.
We have explicit rules requiring that the following forms of constant expressions must be evaluated to the same (canonical) object when evaluating any of the occurrences of such an expression:
#foo
or#+
) are canonicalized.<constObjectExpression>
(such asconst C(2)
) is canonicalized.The definition of
identical()
ensures that instances ofbool
,int
, anddouble
are canonicalized, in the sense that they are considered identical when they represent the same value (and it is not observable whether this happens because the representation is a tagged bit string, or becauseidentical
treats distinct boxed representations in a special way).Strings have traditionally been treated differently:
The specification of
identical
actually requires thatidentical(c1, c2)
must evaluate to true whenwhich implies that the behavior of
dart
is wrong, anddart2js
is correct:"ab"
is a constant expression of typeString
,"a" "b"
is a constant expression of typeString
, and the two strings are equal according tooperator ==
. (We haven't specified the behavior ofoperator ==
on strings, but we hardly want to make those two strings unequal).I believe that all non-string objects are canonicalized or not in a way which is well-defined: It is specified for the basic forms mentioned above that every constant expression of these forms has a canonicalized value, and constant expressions do not allow for creating composite objects from existing objects, canonicalized or not, other than constant collection literals (e.g., we can't have
const C(x)
). In particular, we do not have to specify any additional rules saying that the value of a constant variable is canonicalized.It is unfortunate that different tools behave differently, but it is also likely to be a serious breaking change to change the behavior of any of these tools. So do we wish to proceed and change the specification to make
identical()
implementation dependent on strings, or do we wish to choose a specific behavior and get that implemented?@munificent, @lrhn, @leafpetersen, WDYT?
The text was updated successfully, but these errors were encountered: