-
Notifications
You must be signed in to change notification settings - Fork 577
UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
https://bugs.python.org/issue35195 In 2018 Python identified this problem. Py ticket remains open ATM Feb 2025. IDK enough arch/API/design/tech info to understand all the comments in the cPy tickets if there is a proposed fix or reject fix or unfairly rejected fix in those 2 tickets. |
UCRT works; many bugs went away when we converted to use it. |
I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:
Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be. |
Forgot to add in the OP. Since 5.37.10 and commit 8a548d1 The P5P repo's .t files , and less so CPAN, will call Line 857 in 16196ae
make test . Copy pasted from a GH runner, blead perl has 1.2 million tests.
100K*4ms= 6.6 minutes faster core 55 millisecond is 1.7 frames at 30 frames per second. Blead perl currently has a 33,000 OP*s executed timer, before the first time it polls the Win32 GUI loop. Its crazy "link av.obj hv.obj perl.obj /delayload:user32.dll -o perl541.dll" really helps with blead perl core self So The question now is, does WinPerl selectivly replace cherry picked, problematic, slow, libc calls in Because perl.exe has the choice of which one to call at runtime, they both are available at all times inside a perl process. The call stacks, profiler reports, and my benchmarks show an ex^^^^ponential multiple orders of magnitude performance difference, between 2 difference implementations, of the same exact C standard lib function. Next question, why is WinPerl even C linking against MS's Would slurping/looping U8 values 0x00-0xFF, 1x on process start, through MS UCRT's Nobody can justify enumerating all 250 country codes on earth in a SQL DB/for loop+ You can't upper case an ASCII string, for each 8 bit character, you posting a new job ad on LinkedIn, interview and hiring a new developer and agree on a consulting contract and fee schedule, he reads the ASCII char and writes with a pen, 01000001, and hands you the paper with 01000001 written on it, and you hand him a check for $500, and his employment at you company terminates. He was paid $500 for 15-25 seconds of work. Great company to work for. 5 stars employer. Thats what UCRT is doing internally. 3 rd possible fix, the most difficult fix, which is beyond my expertise, figure out why The API docs for So did perl.exe/perl5xx.dll/perl5porters do something wrong and explicitly disable the cache logic inside ucrtbase.dll? Or this is a bug inside ucrtbase.dll, which only Microsoft can fix, and a member of the public must file a public bug ticket with MS, and MS devs must recompiling and publishing a new higher build number of ucrtbase.dll? Beyond scope for me to diag this. IDK enough. |
Maybe my PS I've spend 3 days searching ReactOS for what is the limit for U8's per "char" for a "MBCS" code page on a technical MS NLS C API level. I believe
BTW I believe
IDK enough. Maybe this toupper()/isupper() bug has something to do with that newish in Perl many reader single writer locking process global locale inter-OS thread serializing/anti-race code.
What are Perl in C's mandatory requirement for vendor C std lib toupper()/isupper() ? https://en.cppreference.com/w/cpp/string/byte/toupper says no
As you and me both agreed on IRC, there is some really poor quality Win32 only code, inside https://github.com/Perl/perl5/blob/blead/win32/perlhost.h that turns the But I'm less concerned about performance of creating ithread # 2 in a WinOS proc, vs perl interp executing this broken slow
|
Another idea, on WinPerl, is a codebase wide grep 9 stack That branch in If libperl.dll always passes a locale_t as arg 2, that Perl process-wide thread-wide locale settling race bug with WinPerl serializing multi-OS thread access, using a very poor DIY-ed by Perl re-implementation of MS's Slim reader/writer (SRW) API https://learn.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks that whole API thing, basically will disappear through macros/etc from WinPerl/libperl.dll, maybe the exported lock variables stay for less than perfect CPAN XS code, but nothing in libperl.dll will ever obtain that serialize lock ever again, And MS UCRT Devs probably can't even see the It doesn't matter in 2025, but IIRC |
We (probably @khwilliamson ) could change perl to use _create_locale() and the Unfortunately _create_locale() doesn't match the behaviour of POSIX newlocale, since you can't modify an existing locale object to mix locales the way you can with newlocale(). To behave the same we'd need to keep separate locale objects for each category, and that has problems for functions that work with more than one locale category (strftime at least), though I believe such mixing is usually a bad idea. But, even if isupper() is 133x slower, how much of an effect does that have on real code? It might be worth benchmarking and profiling related code (regexps?) to see. |
I didn't really look into this at all, but if there's a performance regression, perhaps it would make sense to report it to Microsoft? Technically, UCRT is part of the OS and reporting a bug in Windows requires a support contract (well, there's also the Feedback Hub, but no one reads it...). However, there's a workaround: UCRT bugs reported on Visual Studio Developer Community are often forwarded to the Windows team. That said, I imagine performance regressions are low priority for them. |
Some links https://learn.microsoft.com/en-us/cpp/c-runtime-library/recommendations-for-choosing-between-functions-and-macros?view=msvc-170
some limited quotes from UCRT headers,
There is more questionable MS code in the UCRT .cpp files that is important to look at. MS Devs occasionally put comments saying why they did things the way they did, or what end user hazard they were coding around were. But the UCRT/MSVC compiler has a source available license, not a FOSS license, so I'd rather not copy paste large methods/function bodies into this archival GH ticket. Steve Hay, Tonyc, etc I know all of you have the UCRT src code on your systems just like I do. Notice Also WinPerl use |
Do you have www links to the posix or msdn or p5p GH git repo APIs ? Follow Sarathy's principles for WinPerl, "POSIX" can't and never will exist inside WinPerl. Either P5P fakes it, or MS CRT fakes it. All of "libc" features/vm state are fictional and only exist inside 1 winperl process address space. There is no interop/ipc/IO communication with other procs using POSIX [tokens] APIs. Either P5P or the CRT, always converted the posix-ese into Kernel32.dll-exe. I'm wondering if WinPerl do
So that looks to me like WinOS and the concept of Locales run very deep into Ring 0 in Windows. Basically MS designed it so the text debugging log of a kernel sound card driver, will chance the thousands separator character in a sys admin debug log file, within 1000 microseconds/1 millisecond/1 video frame, after OS wide Maybe UCRT needs compliance with this, but Perl and PP state doesn't. Due to lack of knowledge IDK what Perl is trying to exactly fix in WinPerl, and who the consumer/end user is of the fix. And is the "fix" for actual observed defects in Perl end user production code, or is the 5.36/5.38/5.40 locale safety/serialization code, trying to fix a My generic instinct say, since APIs like Thats the reason I keep thinking
I caught this in a profiler to attached to
I'd have to recomp perl.dll with a bunch of There might be a P5P bug, that somehow PP or CPAN XS or p5p C, is passing locale "en-US; English; United States; NY; New York; 6 Empire Plaza; Suite 715;" which is "legal"' but a denormal local setting, that needs to be parsed/normalized in a loop O(n) against the NLS/Registry master disk data each time to normalize it to "en-US; English; United States;". IDK enough. |
other things I forgot, UCRT's 2nd, this is a comment from the CRT src code. I can't copy paste the whole thing here, but this comment is really important.
leaf function in locale usage terms My reading of that means its P5P's bug and P5P's defect for failing to
or in PerlXS API p5p is doing on a
|
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170 I'd have to benchmark it to prove it, but I have a suspicion all of UCRT's internal cache code regarding parsing locale wide or ascii string names and its If UCRT's internal TLS structs don't "match", the ultimate source of truth for POSIX locales/is_/to_ after all the bloat and very badly written C++ classes, UCRT will be getting the truth for is_/to_ from kernel32's Since the concepts of Latin-1/7 bit ASCII/mbcs/8 bit characters, don't exist on Windows OS, except through the opaque casting function called |
similar src code problem in mingw 8.3.0/Strawberry 5.32's bundled gcc Reading Mingw's "ASCII to Wide" code, Im seeing the same O(n^2)-ness that UCRT is doing. Implementing each byte of the for loop as a dozen calls into kernel32.dll which is a dozen TLS calls best cast, worst case 100s of TLS calls and 100s of files that need to be decompressed and parsed. IsDBCSLeadByteEx() calls NlsValidateLocale() https://github.com/wine-mirror/wine/blob/8d40da7ffda5e8dde9200f733e7d2cebf0196bc3/dlls/kernelbase/locale.c#L723 and NlsValidateLocale() sets off a chain of decompressing and parsing a couple 100/1000 locales on disk/on a mmap 100s/1000s of permutations in there update: the IsDBCSLeadByteEx() byte by byte, algorithm seems to be copy paste from SO https://stackoverflow.com/a/27196334 and is probably repeated over and over by college students or young on certain unreliable tech social media sites, known for anonymous users, trying to ChatGPT their way to the next account ribbon/flair. |
maybe related to this ticket https://bugs.python.org/issue7442 https://bugs.python.org/issue31900 https://vstinner.github.io/python3-locales-encodings.html general tech prose, probably not related to whatever im benchmarking |
More complaints about this particular WinPerl bug and MS's any-age CRT .dll file complaints https://stackoverflow.com/questions/36686381/windows-c-runtime-toupper-slow-when-locale-set I am starting to believe this WinPerl bug is "unfixable" on Windows OS. After another attempt to read UCRT's src code and understand it, and seeing multiple other C/C++ FOSS project in the 2010s,"end user" visible deprecated and ripped out, 99% to 100% of runtime executes of MS's setlocale()/wsetlocale() functions, as a stupid quick and easy fix/closure of their I think I found a src code comment in the UCRT, where a MS dev intentionally "temporarily" [forever] commented out the C/C++ method call, to push/validate/resync the new ISO C ASCII locale string, into the UCRT's locale cache logic C++ object/class instance. I don't think MS will ever uncomment that source code line, and that C++ method/linker C++ symbol is optimized out in an actual ucrtbase.dll file. That symbol's ascii string name isn't inside the ucrtbase.pdb file either, which proves the method call is optimized out and doesn't exist in ucrtbase.dll. More specifically, the UCRT has a non exported global var called I'm going to exclude thinking about any 100% certified-mentally-insane C code implementing "MS C89" object/class destructor methods here that execute as part of a C89 SEH catch frame swallowing and throwing a SEGV/SIGILL event object. Even if there is somewhere in ucrtbase.dll's machine code, a "InterX(FALSE);" statement in a SEGV SEH catch frame aka C++ "smart_ptr", that CPU op code in ucrtbase,dll is unreachable from WinPerl's viewpoint. More specifically,b/c That src code comment, and that commented out method that was supposed to do If MS devs "temporarily" commented out that method, in what I think is UCRT from VC 2015/2013 era, and src code traces of that method vanished by VC 2022 era. MS staff devs will never reverse their decision. It wasn't an accident by a random MS staff member. It was a slow, very careful, very intentional decision, done by atleast 2 different people, separated by atleast 12 months, or >= 1 years time. IDK exactly why MS devs disabled, and deleted that code that was supposed execute a
The last line remark, explains the problem. Very slowly read https://en.cppreference.com/w/c/locale/setlocale and https://en.cppreference.com/w/c/locale/LC_categories or the actual C23 .pdf spec. The ISO TC wrote down, and officially spec-ed the C grammer token The ISO TC never documented arg number # 2 in LibC's Public API TLDR Parody: "Did you know all 3 compilers, Clang, GCC, and MSVC will const-fold away ISO C's definition of Long version:
or
@bulk88 wrote ^^^^ because setlocale() can't be documented. Even Microsoft Regarding WinPerl's choices, if MS UCRT's setlocale()/wsetlocale() fn call, is
Like really? you want to change the STDOUT character encoding or your C locale state, 17 times while your/our FOSS software rasterizes 8 byte value I (@bulk88) has seen on SO or Reddit or GH someone with a legit design reason to do ^^^^ and they used mentioned Yes, Historical DATE STAMPED LOCALES make perfect sense to me. But that requirement sounds 2^48 API layers higher than what So unless Khw/Tony/etc have a different idea, I can't think of another answer for WinPerl, beyond Choice 1 reimplement Choice 2 remove 100% of calls to MS's UCRT's setlocale() and do something else. The word "interop" is N/A, we just don't care what "C" does, This is the Purl 5.0 .git repo on GitHub, Choice 3 do a compromise deal with MS's UCRT's setlocale() and implement with a polling timer between set between 1-30 millisecond. Zero chance of breaking 3rd party legit production GUI apps. No multithreaded legit TUI/GUI process can draw > 30 frames per second anyway. How can a human even see Unicode/UTF-8 text written at >= 30 FPS to a shell console terminal window? Windows shell or Linux shell, same question! WinPerl and this WinPerl Bug, I don't think Choice 3 is correct and don't think it applies here, for WinPerl's problem, If I understand what that P_something_SQL_something library was fixing in C code. Choice 3 was done by a project maintainer from the P_something_SQL_something project on GH, and he closed the GH bug claiming 5 Someone's shell/console running @ 30 FPS refers to this https://en.wikipedia.org/wiki/ANSI_escape_code I couldn't figure out what that Win32-only polling timer fix in the P_something_SQL_something library was exactly fixing. I think that C library wanted to READ AND PARSE the ASCII string return by MS's UCRT's setlocale(), because they have their own DIY-ed vsnprintf() implementation, and they need correctly format their ASCII date stamps in their tracing log files. Correct if wrong, WinPerl/libperl.so attempting to READ AND PARSE the ASCII string returned from Linux's/AnyOS's/MS's UCRT's The Perl Interp is proud to be born in California, United States of America. Is "locale" a word from French or Italian? Perl only speaks English. The Perl lang/grammer/interp core/PP is documented in https://perldoc.perl.org/functions/sprintf as locale unaware, Reading Not a jk, blead/stable WinPerl I am going to pretend JHI never introduce x86 assembly language into Perl in 5.23.6 in year 2015, example at 572cd85 But IDK I keep thinking dmerphq authored all the code that looks like
in core, and he did that somewhere late 5.1X era, not in 5.2X era, and I'm not going git blame anymore, why did LibPerl very recently learn to speak Assembler when speaking Assembler is taboo. ISO C has never defined the bit patterns that are created after you call Real life/Production Perl problems, the project maintainers/authors of GCC/Clang/MSVC, can't decide The ISO C TC very carefully engineering the C spec, that C grammer tokens float/double are plug-n-play abstract base classes, not implemented by C. Please contact your Operating System Vendor to mail you a copy of https://man7.org/linux/man-pages/man3/dlsym.3.html if you still have any technical questions about C grammer tokens GCC devs and LLVM devs have multiple times, created their own bugs in both C compilers, because ARM LLC (United Kingdom) has NEVER DOCUMENTED the CPU opcodes on how to do floating pointing math on the ARM CPU. ARM LLC if they feel generous will email 3-5 .pdf files if you ask that question, but that isn't exactly true. Linux Kernel devs have reverse engineered 57 different ABI ARM ISA .pdf files so far, see https://github.com/torvalds/linux/tree/master/arch/arm GCC/LLVM bug trackers have a class of head banging macro soup bugs, because ARM is the 1 and only "CPU arch" inside Clang/LLVM, with more than 1 hardware FP math instruction under 1 So ARM LLC only makes $$$ selling website content editing services since 1994. when ARM LLC accidentally open sourced its own CPU, and later on, Samsung/Google/Apple pirated Arm's CPU, and then Sam/Goog/Apple decided to donate charity hush $$$ to ARM LLC after stealing the ARM src code, and accepting that hush $$$ is a very good deal , compared to ARM LLC litigating vs Samsung/Apple. It didn't work for https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc. or https://en.wikipedia.org/wiki/SCO_Group,_Inc._v._International_Business_Machines_Corp. Why is attempt # 3 doing the same thing, but now Arm LLC Vs Google/Apple/Samsung/Amazon LLC going to work? Even Microsoft decided to play Pirates-n-Warez and won't read ARM's DMCA notices anymore, example at https://github.com/chsasank/ARM7 Ubuntu and Debian's software engineers payed an electrical engineer to reverse engineer the Arm CPU chip on a Samsung Galaxy or Google Nexus/Pixel phone, or just used a large rock to smash a Pixel/Galaxy phone until they learn which ARM CPU ISA .pdf to read before editing Linux Makefile that builds the Ubuntu for ARM32 .iso file. See https://docs.kernel.org/arch/arm/index.html#soc-specific-documents and all 57 ARM ISAs at https://github.com/torvalds/linux/tree/master/arch/arm . Canonical LLC will give exactly 1 C compile attempt, to select the correct ARM CPU identifier, as a interview question to a potential C developer. Just 1 try. There are 57 macros to pick. There is no 2nd attempty since that person is unsuitable for the position :-) If a ARM C developer can't do their job, perhaps they should rewrite their resume to say "x86 C dev", Intel hasn't edited their 1 and only ISA The last edit done by Intel on their x86 platform was in 1996, with Pentium Pro adding CMOV instruction. Nothing has changed between 1996 and 2025 for writing executable machine code for x86-32 on A5 paper with a ink pen. SMID/SSE/Vector/AVX, AVX512/bit product opcodes, aren't part of Intel's x86 ISA.. The feature SSE/Vector/AVX, AVX512, always are prefixed with opcode 0xF0 which was documented a "undefined behavior CPU opcode, reserved for end user to purchase a non-intel 3rd party CPUs and use opcode 0xf0 prefix to send string of random hex bytes from their Official Intel CPU, to a random non-intel 3rd mfged CPU chip plugged into the 2nd CPU socket. " This feature in real life is where all SSE/Vector/AVX, AVX512, live, behind a single x86 0xf0 opcode. The rest is UB to discuss where. Until KHW decided what he wants to choices suggest, or I see some POSS project fix for MS setlocate() vs OS threads vs 1 address space vs MS's "we wrote MSVC 2022's UCRTbase.dll to strictly confirm with ISO C23 document. and unit tests to prove MSVC;s implementation of setlocale() passes all unitests published by ISO C23 organization If you dont like Either WinPerl does the 30 millisecond polling timer trick, or converts to 100% clean usage of Arg 2/3 of type
|
#21611 is probably the same UCRT bug I am describing here. There is analysis/workup in there that comes to the same conclusion as I did above. |
Module:
Description
A certain profiling call stack caught my eye and the final report from my profiler said 8% of all cpu time of perl is spent inside.
isupper()
/toupper()
from ucrtbase.dll, these are floating between place 4- place 8 as highest CPU hogs on random core .t'es. upper() Reaching # 1 was jaw dropping. Hence I investigated.some research this is 1 call about 1 U8 BTW, ::LocaleUpdate has 6 FlsGetValue calls (wraped with glerr preserving), toupper() fires::LocaleUpdate() every time, errorno in ucrt added another 4-5 FLSGV calls __acrt_LCMapStringA�() fires ::LocaleUpdate again ,
soon after
a few cpu ins addrs later (remember lines of code have loops)
kernelbase.dll tries building a tree of nodes or iterating all country codes on earth, data being searched by KernelBase.dll!GetNamedLocaleHashNode looks like
but this is raw memory with unprintables regexped out, i think its country codes but im not going rev eng it
benchmarks its horrible
with psudo threads 3 cores, idk enough if this is scaling or lock contention perl side or ms side is happening
Steps to Reproduce
Expected behavior
Half joke half serious, but remove UCRT from default build config win perl and link against msvcrt.dll.
Perl configuration
The text was updated successfully, but these errors were encountered: