-
Notifications
You must be signed in to change notification settings - Fork 1.9k
added in standard conversions from types to ReadOnlyMemory<char> #5106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
After syncing with @harishsk, we thought that since all C# types natively support a |
@michaelgsharp To clarify: Once this PR goes in, we can remove the ToStringTransformer from Microsoft.ML.Featurizers and use the existing TypeConvertingTransformer. Is that correct? |
@michaelgsharp Please look at the test failures. You may need to update the baselines as a result of this change. |
You are correct Harish, once this goes in we can remove the ToStringTransformer. |
@@ -121,7 +121,7 @@ private sealed class TestStringClass | |||
public string A; | |||
} | |||
|
|||
[Fact] | |||
[Fact, TestCategory("RunSpecificTest")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be reverted? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should. Reverting now. #Resolved
Codecov Report
@@ Coverage Diff @@
## master #5106 +/- ##
==========================================
- Coverage 75.61% 75.56% -0.05%
==========================================
Files 993 995 +2
Lines 178514 179337 +823
Branches 19191 19294 +103
==========================================
+ Hits 134979 135523 +544
- Misses 38302 38554 +252
- Partials 5233 5260 +27
|
Can you also please add a note in docs\code\IDataViewTypeSystem.md that we conversions to string are now supported? |
public void Convert(in BL src, ref TX dst) => dst = src.ToString().AsMemory(); | ||
public void Convert(in TS src, ref TX dst) => dst = string.Format("{0:c}", src).AsMemory(); | ||
public void Convert(in DT src, ref TX dst) => string.Format("{0:o}", src).AsMemory(); | ||
public void Convert(in DZ src, ref TX dst) => string.Format("{0:o}", src).AsMemory(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation behind this change?
My concern here is this could potentially increase garbage collector pressure, however, in the case of a transformer you could control that pressure by re-using the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microsoft.ML.Featurizers has a few transformers that have overlapping functionality with transformers that are already present in ML.NET. The motivation for this change is to bring in the those featurizers and merge their functionality with existing transformers. The ToStringTransformer only implements conversion from basic data types to string. So we are looking to merge it with the more general TypeConvertingTransformer by adding string support for it.
In a transformer, can we truly reuse the buffer? Wouldn't we hand over the reference to the string back to the caller after the conversion?
Do you see this conversion function being in the hot path to cause garbage collector pressure?
Or do you think this garbage collection concern is serious enough that we should have string conversion being done in a separate transformer with a separate API?
In reply to: 423448756 [](ancestors = 423448756)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller hands over a reference to the transfomer which is why the getters of the transformers take output as ref
type, for example look at the getter for ImageLoader Transform. We reuse buffer and keep the GC pressure low, this is a fundamental principle in ML .NET that gives all the performance benefits that have been outline in KDD'2019 publication. With strings you will try to reuse the buffer when you can and only expand when needed using ReadOnlyMemory.
I have a reason to believe this can be a hot path, may be if you can share benchmarks to show me this isn't by actually running performance tests by using several string conversions using this PR vs using transformer where you can reuse the buffer then it would be great. The last time we overhauled the type system we did extensive benchmarks and testing that you can see here. I expect the same anytime type system is changed or updated.
CC: @eerhardt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Zeeshan,
In this case the caller hands over a ReadOnlyMemory<char>
reference. From my understanding, since its readonly, we cannot modify it or the underlying buffer, and must allocate a new one. I also looked at the documentation for ReadOnlyMemory and couldn't see any way to modify it at all. Do you know of a way to modify it? Or do you know if we have any examples in code I can use as a reference? I was unable to find anything.
Assuming we can't modify it and need to return a new instance, I did some testing converting an int to a string (not in ML.Net) and then returning a ReadOnlyMemory<char>
pointing to it.
The first way took a ref ReadOnlyMemory<char>
and assigned it to value.ToString().AsMemory()
.
for (int x = 0; x < NumberOfIterations; x++)
{
// readOnlyMemory is passed in by ref
readOnlyMemory = x.ToString().AsMemory();
}
The second way was to also take a ref ReadOnlyMemory<char>
but also a ref StringBuilder
and use the StringBuilder to create the string and memory.
for (int x = 0; x < NumberOfIterations; x++)
{
// stringBuilder and readOnlyMemory are passed in by ref.
// the stringBuilder is using Clear, which should keep the underlying buffer
stringBuilder.Clear();
stringBuilder.Append(x);
readOnlyMemory = stringBuilder.ToString().AsMemory();
}
In Release mode with 214 million allocations, the first approach of ToString().AsMemory()
took about 6 seconds on average where the second approach took about 8 seconds on average. Unless there is another faster way, which I couldn't find, the fastest way to convert an int to ReadOnlyMemory<char>
is the first approach. Each approach also used about 13 MB of memory total.
I have also created a benchmark in ML.Net that uses the first approach to create the ReadOnlyMemory<char>
. It ran 10 million rows in 1.25 seconds, and the entire test only used 30 MB of memory (I am assuming the majority of this memory is from ML.Net itself as the first tests I did had more rows and used less memory). I have not done any other benchmarks tests as of yet.
@codemzs does this address your concern? Or is there something else I can do to address it? Or do you know a better way to deal with the ReadOnlyMemory<char>
?
@eerhardt do you know of a better way to handle the `ReadOnlyMemory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With ML.NET's current infrastructure, this is probably as good as you can make it. But in later versions of .NET Core (if we wanted to start targeting them), you could reshare the buffers in some cases by using the following methods:
https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.memorymarshal.trygetarray
and
https://docs.microsoft.com/en-us/dotnet/api/system.int32.tryformat
The idea being something like this:
public void Convert(in I4 src, ref TX dst)
{
if (MemoryMarshal.TryGetArray(dst, out ArraySegment<char> segment))
{
if (src.TryFormat(segment.Array, out int charsWritten))
{
dst = segment.Array.AsMemory(0, charsWritten);
return;
}
}
char[] newBuffer = new char[11]; // 11 is the most characters needed to represent all 32-bit ints
src.TryFormat(newBuffer, out int charsWritten);
dst = newBuffer.AsMemory(0, charsWritten);
}
What's dangerous about this approach is that dst
might be someone else's array that we are clobbering over in order to reuse it. To prevent that, we could write a little sentinel value at the beginning of the array to mark it as ours. Say something like 2 char.MaxValue
elements. And then we could check if the array we got from MemoryMarshal.TryGetArray
was "ours" by checking the sentinel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we want to use the StringBuilder overloads directly above this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case the conversion being sought is from some data type like int
to string
and not to StringBuilder
. If we wanted to use the above conversions, we would have to create a local StringBuilder
convert the given data type to StringBuilder
and then convert it back to string
. @michaelgsharp has discovered this method is slower.
In reply to: 428265503 [](ancestors = 428265503)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that those APIs are supported on .NET Core 3.1 and 2.1 which ML.NET supports but not supported on netfx461 which ML.NET needs.
In that case, is it okay to go ahead with this PR until those APIs are fully supported on all ML.NET configs?
In reply to: 428262800 [](ancestors = 428262800)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in the standard transformations we don't support converting types to a string representation. Since all C# types support
ToString()
functionality, this PR adds in standard transformations to typeReadOnlyMemory<char>
into theTypeConvertingTransformer
.