Skip to content

String canonicalization #985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eernstg opened this issue May 26, 2020 · 14 comments
Open

String canonicalization #985

eernstg opened this issue May 26, 2020 · 14 comments
Labels
enhanced-const Requests or proposals about enhanced constant expressions question Further information is requested specification technical-debt Dealing with a part of the language which needs clarification or adjustments

Comments

@eernstg
Copy link
Member

eernstg commented May 26, 2020

The degree to which canonicalization occurs in Dart, in particular for strings, has always been somewhat unclear.

We have explicit rules requiring that the following forms of constant expressions must be evaluated to the same (canonical) object when evaluating any of the occurrences of such an expression:

  • Symbol literals (such as #foo or #+) are canonicalized.
  • List, set, and map literals with the same elements/keys/values are canonicalized.
  • The value of a <constObjectExpression> (such as const C(2)) is canonicalized.

The definition of identical() ensures that instances of bool, int, and double are canonicalized, in the sense that they are considered identical when they represent the same value (and it is not observable whether this happens because the representation is a tagged bit string, or because identical treats distinct boxed representations in a special way).

Strings have traditionally been treated differently:

void main() {
  var s1 = "ab";
  var s2 = "a" "b"; // or `"a" + "b"`.
  print(identical(s1, s2)); // 'false' with dart, 'true' with dart2js.
}

The specification of identical actually requires that identical(c1, c2) must evaluate to true when

c1 and c2 are constant strings and c1 == c2

which implies that the behavior of dart is wrong, and dart2js is correct: "ab" is a constant expression of type String, "a" "b" is a constant expression of type String, and the two strings are equal according to operator ==. (We haven't specified the behavior of operator == on strings, but we hardly want to make those two strings unequal).

I believe that all non-string objects are canonicalized or not in a way which is well-defined: It is specified for the basic forms mentioned above that every constant expression of these forms has a canonicalized value, and constant expressions do not allow for creating composite objects from existing objects, canonicalized or not, other than constant collection literals (e.g., we can't have const C(x)). In particular, we do not have to specify any additional rules saying that the value of a constant variable is canonicalized.

It is unfortunate that different tools behave differently, but it is also likely to be a serious breaking change to change the behavior of any of these tools. So do we wish to proceed and change the specification to make identical() implementation dependent on strings, or do we wish to choose a specific behavior and get that implemented?

@munificent, @lrhn, @leafpetersen, WDYT?

@eernstg eernstg added question Further information is requested specification labels May 26, 2020
@lrhn
Copy link
Member

lrhn commented May 26, 2020

The phrasing of "string canonicalization" has always been vague. In fact, the specification does not require canonicalization, it only requires that identical(c1, c2) behaves in a specific way if c1 and c2 are constant expressions (so the identical invocation is a constant invocation). There is nothing technically prohibiting:

  const a = "s";
  const b = "s";
  print(identical(a, b));  // must be true because `identical(a, b)` is a constant expression
  var c = a;
  print(identical(c, b));  // false.  Not bound by the specification.

We wouldn't want that, but the specification allows it because it doesn't talk about canonicalization of strings at all.

If we read the specification as implicitly defining canonicalization of strings through their behavior wrt identical, then it only states that compile-time constant string expressions must canonicalized their values. (At least if that value reaches a constant identical invocation).

I believe the current VM behavior is that they canonicalize all constant string literals, and they canonicalize the value of constant string expressions in constant contexts. That is const s = "a" "b"; will be canonicalized and var s = "a" "b"; will not. A single literal like "ab" is always canonicalized.
The canonicalization is part of constant evaluation, and the VM does not handle "constant expressions" like "a" + "b" that are not required to be constant in the constant evaluation.

This does ensure that const ["a" "b"] and const ["ab"] become the same object. Any string which is part of a composite constant object will be canonicalized.
It just doesn't guarantee that the element of that list is identical to var x = "a" "b";.

The JS compilation is limited by the underlying platform. A JS program cannot distinguish two strings with the same content in any way, so they effectively canonicalize all string values. We do not want the VM to do the same, so we will definitely need to allow more canonicalization than what the specification requires.

We should probably specify the exact behavior that we require.
If we go for the current VM behavior, then we will have to change the definition of compile-time constant expressions to only include those that need to be constant (it's based on context, not syntax), rather than promising that 1 + 2 is a compile-time constant expression even when it's not needed to be.
I think that's actually an improvement.

A compile-time constant expression is:

  • The initializer expression of a constant variable declaration.
  • A list, map, or set literal preceded by const.
  • A constructor invocation preceded by const or @.
  • The expression of a case clause.
  • Any sub-expression of a compile-time constant expression.

It is a compile-time error if a compile-time constant expression is not one of ....
(listing all the currently approved forms).

Then we can say that:

If two compile-time constant expressions evaluate to instances of String with the same content, they evaluate to the same instance. That is, constant strings are canonicalized.
Strings created by non-constant expressions may be canonicalized, but they are not required to be so. (Compilation to JavaScript effectively canonicalize all strings that are equal.)

@Cat-sushi
Copy link

@eernstg

We haven't specified the behavior of operator == on strings

The API reference seems to show the behavior of operator == on String

https://api.dart.dev/stable/2.8.3/dart-core/String/operator_equals.html

@lrhn
Copy link
Member

lrhn commented May 27, 2020

We have indeed specified both String.== and identical in the library documentation.
The behavior of identical interacts with the language specification which specifies certain things must be identical. If we assume both specifications are correct, then that effectively specifies canonicalization rules (library documentation for identical says it's true if it's the same object, or if it's a num with the same representation, language spec says that certainidentical invocations must be true).

It's still better to specify canonicalization explicitly instead of implicitly. (There have been a lot of different ways to read that part of the spec over time).

@leafpetersen
Copy link
Member

So do we wish to proceed and change the specification to make identical() implementation dependent on strings, or do we wish to choose a specific behavior and get that implemented?

@munificent, @lrhn, @leafpetersen, WDYT?

Neither? Why is this a priority right now?

@munificent
Copy link
Member

It's definitely not a priority. I think it probably came up because we just migrated the language/identity tests which have some tests around strings and identical() that are more precise than the language requires and that I think fail on some platforms.

@eernstg eernstg added the technical-debt Dealing with a part of the language which needs clarification or adjustments label Aug 6, 2021
@eernstg eernstg added the enhanced-const Requests or proposals about enhanced constant expressions label Feb 22, 2024
@eernstg
Copy link
Member Author

eernstg commented Feb 22, 2024

Feb 2024: Note that the original example prints 'true' in many (perhaps all?) configurations including dart, dart compile js, and dart compile exe. This could imply that the discrepancy is disappearing over time, and more constant expressions are actually evaluated using constant evaluation:

void main() {
  var s1 = "ab";
  var s2 = "a" "b"; // or `"a" + "b"`.
  print(identical(s1, s2)); // 'true'.
}

@lrhn
Copy link
Member

lrhn commented Feb 26, 2024

That used to work, but the VM still does not canonicalize var s3 = "a" + "b";, where it does canonicalize const s3 = "a" + "b";.

It's not that "a" + "b" is not a constant expression, it's allowed in void foo([String s = "a" + "b"]){}, it's just a constant expression which is not canonicalized. (Possibly because it's not recognized and evaluated as constant unless necessary.)

@rakudrama
Copy link
Member

The specification uses the term iff, which implies that identical(c1, c2) must be false if none of the listed conditions are met. If c1 and c2 are Strings but not constants, they need to be the same object for the result to be true. Operations that return String values often use words like create or new. The doc for String.fromCharCode says Allocates a new string containing the specified charCode.

This implies that identical(String.fromCharCode(42), String.fromCharCode(42)) returns false.
This is impossible to implement in JavaScript.

One might reasonably expect an optimizing compiler to replace String.fromCharCode(42) with "*". This is not permitted without an unrealistic proof obligation.

We might square the definition of identical() with JavaScript by scrubbing terms like allocate and new from the documentation of operations. They return a String without specifying whether it is created or an existing String.
Still, it would be better for the reader if specification for identical(c1, c2) was explicit about the unreliability (aka platform and optimization dependence) of identity of Strings.

@eernstg
Copy link
Member Author

eernstg commented Apr 30, 2025

That's an interesting example, @rakudrama!

The currently specified behavior with respect to canonicalization is pretty clear: (1) 'constant context' is defined, (2) a <constObjectExpression> yields a canonicalized object (including the case where const is implicit because it occurs in a constant context), (3) a constant collection literal yields a canonicalized object (const can again be implicit), (4) symbol literals and values obtained from Symbol.new are canonicalized, (5) function closurizations, when constant, are canonicalized (including the ones obtained by generic function instantiation), (6) objects of type Type which are obtained as reified types from constant type expressions are canonicalized.

For strings, we have the rule in the specification of identical that you already alluded to, saying that identical(c1, c2) is true if

c1 and c2 are constant strings and c1 == c2

For the phrase 'constant strings', we'll have to assume that it means 'an object whose run-time type is a subtype of String which was obtained as the result of an evaluation of a constant expression'.

This actually means that 'constant strings' do not have to be canonicalized, but if they are not canonicalized then identical must "see through" the representation (of constant strings only!), and just consider the contents.

I don't know if we can always test at run time whether or not any given object is "a constant string" (unless it is canonicalized, in which case they might be allocated in a special memory region—but then we don't need it!).

On top of this, we have this:

This implies that identical(String.fromCharCode(42), String.fromCharCode(42)) returns false.
This is impossible to implement in JavaScript.

I certainly think it would be reasonable to change the documentation of String.fromCharCode to say that it may return an existing object rather than a newly created one. It is already declared as an external factory so there wouldn't be a need to change anything about the language level properties of String.fromCharCode.

Considering all these things, I think it would be a reasonable way ahead to keep the specification unchanged, and achieve the specified treatment of strings by actually canonicalizing all strings that are the value of a constant expression.

I don't think there are any other kinds of constant expressions with a similar issue: They are either inherently canonicalized because they are simple values with a special representation, or they have an explicit const (on the construct itself, or on an enclosing construct). So it's just about strings.

Finally:

@lrhn wrote:

.. [we could] change the definition of compile-time constant expressions to only include those that need to be constant .. based on context

We could consider that, too. It's hard to tell how breaking this change would be (it could certainly cause some identical strings like the result of evaluating "a" "b" and "ab" to become separate objects where they used to be the same object). It might also make certain things run slightly slower (that one is hotly debated these days, though ;-). In any case, I think that's a separate discussion.

@lrhn
Copy link
Member

lrhn commented Apr 30, 2025

This specification, as currently written, is a complete definition of the platform identical function.

It says that the identical function must return true when comparing an object to itself. That's good.
And the "OR"s are inclusive, so the the remaining cases only matter if we have two argunents which are not the same value.

The specification says

... iff
...

  • c1 and c2 are constant strings and c1 == c2, OR

About "constant string", I wouldn't necessarily restrict it to only strings created by a constant expression. Let's say that it definitely includes any string object which is the value of a constant expression, and it's (somehow) a property of the string object. That doesn't preclude that other strings can also be designated as constant strings, even if they are allocated as a "new string" according to docs. (Also, don't trust docs of a factory constructor if it says "new object".)

Then it's not a problem if String.fromCharCode(42) returns a "constant string". Nothing says it cannot, only that if it does, that string must be considered identical to any other constant string that it is equal to. ✓

That makes web compilation not inherently in violation of this specification, it just makes all strings constant strings.

...

I still think it's better to specify canonicalization directly, not indirectly through identical.

Any function which returns a double or int will return the same object for every value every time. There is only at most one object in the runtime for every integer value or double bit-representation.

Any function which returns a string may choose to return an existing string with the same content, rather than create a new string.

If two expressions are evaluated a constant, and they should evaluate to equal strings, they evaluate to the same string.
(Something, something, could be constant and is a literal ⇒ evaluated as constant.)

@eernstg
Copy link
Member Author

eernstg commented May 1, 2025

That makes web compilation not inherently in violation of this specification, it just makes all strings constant strings.

This seems to imply that every expression of type String is a constant expression. I'm not convinced that we want to go there.

However, I suspect that the specified behavior of identical isn't precisely what is implemented.

@kustermann, @mraleph, @rakudrama, is it true that there is no run-time representation of the status of an instance of String-or-a-subtype-thereof as 'a constant string'?

I would assume that identical(o1, o2) could be called at run time with two objects, and they could turn out to have type String, and at that time there would not be a way to tell whether or not they are 'constant objects'.

If this is true then it implies that there is no way to implement identical as specified, unless the run-time representation of strings is modified such that we are able to tell whether they are constant or not. However, in that case I suspect that it would be a better approach to change the spec such that we don't need to answer that question.

One way out would be to say that the specification of identical with strings is modified as follows: identical(c1, c2) is true if

c1 and c2 are strings and c1 == c2

I believe this would be essentially the same thing as making all strings constant strings, except that it avoids introducing something to the notion of 'constant strings' which is not compatible with the existing notion of 'constant expressions'.

This could be costly because it forces identical to have a special branch where it detects that it is working on two strings that are not the same object, and then testing their contents. Note that this branch will be visited on basically every invocation of identical, not just the ones that have actual arguments which are strings.

An alternative (that I'd prefer) would be to ensure more consistently that every object that we wish to consider a 'constant string' is canonical. In that case identical doesn't have to know about strings at all, it just relies on object identity to yield the correct answer for strings (just like most other objects). This makes identical faster. It still allows developers to get the contents based treatment when that's what they need: They just have to write s1 == s2 rather than identical(s1, s2).

In order to specify that the latter approach is used, the specification would be changed such that identical does not have a special rule (they are just covered by the default case, which is that we're comparing an object to itself), and then we'd specify that constant expressions whose value is a string are canonicalized. (And that's constant expressions, not just expressions that occur in a constant context.)

@mraleph
Copy link
Member

mraleph commented May 1, 2025

VM has special bits on constant objects, but I would suggest that we remove anything from the spec that implies special treatment for strings which originated from constant expressions. In JS you can't distinguish two string values if they have the same content even if they are represented as two different objects. So I would strongly encourage to relax the spec for identical(...).

@lrhn
Copy link
Member

lrhn commented May 2, 2025

I actually do want constant strings to be canonicalized, so you know you can depend on identical for them.
(At least we changed the rules for constant sets to use "must not be ==" rather than "must not be same object.)

The question is what a "constant string" is. Anything computed in a context which must be constant (and computed at compile time) is a given, but is a simple string literal like "ab" in a non-constant context included. I'd want it to be, because that allows depending on identity of literals, otherwise you'll need hacks to introduce a constant context just to get the effect. So the question isn't whether something not in a constant context can be const, but how much is. Something like "a" "b", "a$b" where b is a constant variable containing "b", "a" + "b", "a" * 42?

As specified, these are constant expressions, except the last one, String * int is not constant.
We can discuss whether that's the correct level of making constant, but I think it's not unreasonable. (Every one of those strings is one you could just expand manually and write as a literal, at least as long as it doesn't refer to String.fromEnvironment in some way.)

@eernstg
Copy link
Member Author

eernstg commented May 2, 2025

I actually do want constant strings to be canonicalized

Sounds good! This whole description matches exactly what I suggested as the preferred approach here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhanced-const Requests or proposals about enhanced constant expressions question Further information is requested specification technical-debt Dealing with a part of the language which needs clarification or adjustments
Projects
None yet
Development

No branches or pull requests

7 participants