Skip to content

UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bulk88 opened this issue Feb 27, 2025 · 16 comments
Open

UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037

bulk88 opened this issue Feb 27, 2025 · 16 comments
Assignees

Comments

@bulk88
Copy link
Contributor

bulk88 commented Feb 27, 2025

Module:

Description
A certain profiling call stack caught my eye and the final report from my profiler said 8% of all cpu time of perl is spent inside. isupper()/toupper() from ucrtbase.dll, these are floating between place 4- place 8 as highest CPU hogs on random core .t'es. upper() Reaching # 1 was jaw dropping. Hence I investigated.

Image

Image

some research this is 1 call about 1 U8 BTW, ::LocaleUpdate has 6 FlsGetValue calls (wraped with glerr preserving), toupper() fires::LocaleUpdate() every time, errorno in ucrt added another 4-5 FLSGV calls __acrt_LCMapStringA�() fires ::LocaleUpdate again ,

	KernelBase.dll!LCMapStringEx�()	Unknown
 kernel32.dll!LCMapStringExStub�()	Unknow
ucrtbase.dll!__acrt_LCMapStringA_stat()	Unknown
 ucrtbase.dll!__acrt_LCMapStringA�()	Unknown
 ucrtbase.dll!toupper�()	Unknown

soon after

KernelBase.dll!DefaultSortVersion()	Unknown
 KernelBase.dll!VersionValue�()	Unknown
 	KernelBase.dll!SortChangeCase�()	Unknown 
  KernelBase.dll!LCMapStringEx�()	Unknown

a few cpu ins addrs later (remember lines of code have loops)

MinSortChangeCase�()
KernelBase.dll!GetNamedLocaleHashNode()	Unknown
KernelBase.dll!SortChangeCase�()	Unknown
KernelBase.dll!LCMapStringEx�()	Unknown
kernel32.dll!LCMapStringExStub�()	Unknown
ucrtbase.dll!__acrt_LCMapStringA_stat()	Unknown
ucrtbase.dll!__acrt_LCMapStringA�()	Unknown
ucrtbase.dll!toupper�()	Unknown

kernelbase.dll tries building a tree of nodes or iterating all country codes on earth, data being searched by KernelBase.dll!GetNamedLocaleHashNode looks like

but this is raw memory with unprintables regexped out, i think its country codes but im not going rev eng it

benchmarks its horrible

C:\sources\crtslow\CRTSlow>perl -Mblib -MCRTSlow -e"CRT::Be();";
cache wake                       78063695 us Ln 337
tolower                        2440267443 us Ln 341
_tolower                         18608920 us Ln 345
toLOWER_A                        17415609 us Ln 349
toLOWER_L1                       17737359 us Ln 353
isgraph                        2562596668 us Ln 357
isGRAPH_A                        18712061 us Ln 361
isGRAPH_L1                       17790779 us Ln 365
isalnum                        2520004815 us Ln 369
isALPHANUMERIC_A                 18165949 us Ln 373
isALPHANUMERIC_L1                18033632 us Ln 377
isalnum msvcrt                  111663798 us Ln 392
isalnum msvcr100                 98945415 us Ln 402
isalnum msvcr120                 97687178 us Ln 412
C:\sources\crtslow\CRTSlow>

with psudo threads 3 cores, idk enough if this is scaling or lock contention perl side or ms side is happening

C:\sources\crtslow\CRTSlow>perl -Mblib -MCRTSlow -e"$r = fork(); $t = fork(); ex
it if !$t && $r;  CRT::Be();";
cache wake                       32082575 us Ln 337
cache wake                       33263148 us Ln 337
cache wake                       40918586 us Ln 337
tolower                        3157700393 us Ln 341
tolower                        3165074351 us Ln 341
_tolower                         22321171 us Ln 345
_tolower                         20705845 us Ln 345
toLOWER_A                        22430065 us Ln 349
toLOWER_A                        20589965 us Ln 349
toLOWER_L1                       22076263 us Ln 353
toLOWER_L1                       23926635 us Ln 353
tolower                        3348870216 us Ln 341
_tolower                         23873216 us Ln 345
toLOWER_A                        24134972 us Ln 349
toLOWER_L1                       23457776 us Ln 353
isgraph                        3397327541 us Ln 357
isGRAPH_A                        20434638 us Ln 361
isGRAPH_L1                       24083196 us Ln 365
isgraph                        3679713786 us Ln 357
isgraph                        3694650315 us Ln 357
isGRAPH_A                        24393851 us Ln 361
isGRAPH_A                        25253907 us Ln 361
isGRAPH_L1                       18637274 us Ln 365
isGRAPH_L1                       23057951 us Ln 365
isalnum                        3123988931 us Ln 369
isALPHANUMERIC_A                 23420793 us Ln 373
isALPHANUMERIC_L1                20040976 us Ln 377
isalnum                        3220601143 us Ln 369
isalnum                        3333259367 us Ln 369
isALPHANUMERIC_A                 26456259 us Ln 373
isALPHANUMERIC_A                 25220212 us Ln 373
isALPHANUMERIC_L1                19804287 us Ln 377
isALPHANUMERIC_L1                26882794 us Ln 377
isalnum msvcrt                  144513972 us Ln 392
isalnum msvcrt                  151367705 us Ln 392
isalnum msvcrt                  151673430 us Ln 392
isalnum msvcr100                133158199 us Ln 402
isalnum msvcr100                124220690 us Ln 402
isalnum msvcr100                123605543 us Ln 402
isalnum msvcr120                129325137 us Ln 412
isalnum msvcr120                118722581 us Ln 412
isalnum msvcr120                121761334 us Ln 412
C:\sources\crtslow\CRTSlow>

Steps to Reproduce

#define PERL_NO_GET_CONTEXT
#define  WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"


/* Global Data */
LARGE_INTEGER Frequency = { 0 };
#define g_Frequency Frequency

START_MY_CXT


/* BTIME = BENCH TIME*/
#define BTIMESTART do { \
    LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds; NV nv1; NV nv2; \
    QueryPerformanceCounter(&StartingTime)

#define BTIMEEND(label) \
    QueryPerformanceCounter(&EndingTime); \
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart; \
    ElapsedMicroseconds.QuadPart *= 1000000000; \
    ElapsedMicroseconds.QuadPart = ElapsedMicroseconds.QuadPart           \
    /((LARGE_INTEGER*)(&Frequency))->QuadPart; \
    printf("%-30s %10I64u us Ln %u\n", label, ElapsedMicroseconds.QuadPart, __LINE__); \
} while(0)

#define VP(_vp) ((size_t)(_vp))
#define CP(_cp) ((char *)(_cp))
#define VPP(_p) ((void**)(_p))
#define LST for(n=0; n < 10; n++) { p = CP(low); while(p < CP(hi)){ c = *p;
#define LEND  c1 += r; p++;}   }
#  ifndef MIN
#    define MIN(a,b) ((a) < (b) ? (a) : (b))
#  endif
#  ifndef MAX
#    define MAX(a,b) ((a) > (b) ? (a) : (b))
#  endif

MODULE = CRT		PACKAGE = CRT

void
Be()
PPCODE:
  PUTBACK;
  HMODULE h_orig;
  GetModuleHandleExW(
  GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS,
  &PL_No[0],
    &h_orig
  );
      const IMAGE_DOS_HEADER* h = (const IMAGE_DOS_HEADER*)h_orig;
      PIMAGE_NT_HEADERS nt_header = (PIMAGE_NT_HEADERS const)(VP(h) + VP(h->e_lfanew));
	WORD NumberOfSections = nt_header->FileHeader.NumberOfSections;
  /* IMAGE_FIRST_SECTION() macro is universal, and corrects for OPTIONAL32 vs OPTIONAL64 */
	const IMAGE_SECTION_HEADER * sec = IMAGE_FIRST_SECTION(nt_header);
	const IMAGE_SECTION_HEADER * secEnd = sec + NumberOfSections;
    char * low =~0;
  char * hi = 0;
    char * lowi;
  char * hii;
  U32 r;
  U32 c1 = 0;
  U8 c;
  char * p;
  int n =0;
	for(; sec < secEnd ; sec++) {
    lowi = VP(h) + VP(sec->VirtualAddress);
    hii = lowi + VP(sec->SizeOfRawData);
   low = MIN(low,lowi);
   hi = MAX(hi,hii);
  }
 BTIMESTART;
LST r = c; LEND
BTIMEEND("cache wake");
BTIMESTART; LST
    tolower(c);
LEND
BTIMEEND("tolower");
BTIMESTART; LST
    _tolower(c);
    LEND
      BTIMEEND("_tolower");
    BTIMESTART; LST
    toLOWER_A(c);
    LEND
      BTIMEEND("toLOWER_A");
    BTIMESTART; LST
    toLOWER_L1(c);
    LEND
      BTIMEEND("toLOWER_L1");
    BTIMESTART; LST
    isgraph(c);
    LEND
      BTIMEEND("isgraph");
    BTIMESTART; LST
    isGRAPH_A(c);
    LEND
      BTIMEEND("isGRAPH_A");
    BTIMESTART; LST
    isGRAPH_L1(c);
    LEND
      BTIMEEND("isGRAPH_L1");
    BTIMESTART; LST
    isalnum(c);
    LEND
      BTIMEEND("isalnum");
    BTIMESTART; LST
    isALPHANUMERIC_A(c);
    LEND
      BTIMEEND("isALPHANUMERIC_A");
    BTIMESTART; LST
    isALPHANUMERIC_L1(c);
    LEND
      BTIMEEND("isALPHANUMERIC_L1");
  const char * dllname = 0;
  unsigned char flag;
  unsigned char f_type;
  unsigned char f_len;
  HANDLE h2;
  dllname = (char *)&ms_crt_dllnames;
  typedef int (__cdecl * isctypefn_t)(int);
    h2 = LoadLibrary("msvcrt");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcrt");
      }
    }
    h2 = LoadLibrary("msvcr100.dll");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcr100");
      }
    }
    h2 = LoadLibrary("msvcr120.dll");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcr120");
      }
    }
    iscntrl(c);
    isCNTRL_A(c);
    isCNTRL_L1(c);
    ispunct(c);
    isPUNCT_A(c);
    isPUNCT_L1(c);
    isspace(c);
    isSPACE_A(c) ;
    isSPACE_L1(c) ;
    isxdigit(c);
    isXDIGIT_A(c);
    isXDIGIT_L1(c);
    isdigit(c);
    isDIGIT_A(c);
    isDIGIT_L1(c);
    isalpha(c);
    isALPHA_A(c);
    isALPHA_L1(c);
    XSRETURN_IV(c1);
    return;
 

MODULE = CRTSlow		PACKAGE = CRTSlow		


BOOT:
{
    MY_CXT_INIT;
    /* If any of the fields in the my_cxt_t struct need
       to be initialised, do it here.
     */
    QueryPerformanceFrequency(&g_Frequency);
}

Expected behavior

Half joke half serious, but remove UCRT from default build config win perl and link against msvcrt.dll.

Perl configuration

Summary of my perl5 (revision 5 version 41 subversion 7) configuration:
  Derived from: 73172a67eaae5671dffc06b427f005810d151472
  Platform:
    osname=MSWin32
    osvers=6.1.7601
    archname=MSWin32-x64-multi-thread
    uname=''
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cl'
    ccflags ='-nologo -GF -W3 -MD -TC -DWIN32 -D_CONSOLE -DNO_STRICT -DWIN64 -D
CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE -D_WINSOCK_DEPRECATED_NO_WA
NINGS -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DWIN32_NO_REG
STRY -DUSE_PERLIO'
    optimize='-O1 -Zi -GL -fp:precise'
    cppflags='-DWIN32'
    ccversion='19.36.32535'
    gccversion=''
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=undef
    longlongsize=8
    d_longdbl=define
    longdblsize=8
    longdblkind=0
    ivtype='__int64'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='__int64'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='link'
    ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:\pb64
lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'
    libpth="C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MS
C\14.36.32532\lib\x64"
    libs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.l
b advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.
ib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.li
 vcruntime.lib ucrt.lib
    perllibs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg
2.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2
32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcr
.lib vcruntime.lib ucrt.lib
    libc=ucrt.lib
    so=dll
    useshrplib=true
    libperl=perl541.lib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c
\pb64\lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'


Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_TIMES
    HAVE_INTERP_INTERN
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_NO_REGISTRY
    USE_PERLIO
    USE_PERL_ATOF
    USE_THREAD_SAFE_LOCALE
  Locally applied patches:
    uncommitted-changes
  Built under MSWin32
  Compiled at Dec 20 2024 10:03:46
  %ENV:
    PERL_DOBK="1"
    PERL_DOBP="1"
    PERL_DODB="1"
  @INC:
    C:/pb64/site/lib/MSWin32-x64-multi-thread
    C:/pb64/site/lib
    C:/pb64/lib

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 27, 2025

python/cpython#79376

https://bugs.python.org/issue35195

In 2018 Python identified this problem. Py ticket remains open ATM Feb 2025. IDK enough arch/API/design/tech info to understand all the comments in the cPy tickets if there is a proposed fix or reject fix or unfairly rejected fix in those 2 tickets.

@khwilliamson
Copy link
Contributor

UCRT works; many bugs went away when we converted to use it.

@tonycoz
Copy link
Contributor

tonycoz commented Feb 27, 2025

I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:

  • it's trying to emulate the case-insensitivity of the Win32 UTF-16 environment with localized 8-bit (possibly MBCS) strings
  • it's using locale sensitive functions to do it, so if you change locale the order may become invalid
  • the ::Add function pushes new strings on the end and the does a sort, making it $O(N \log N)$ when it could be $O(N)$ (the inserting would be $O(N)$, finding the right place to put it is $O(\log N)$ )
  • the copy constructor calls ::Add for each entry in the table, making it $O(N^2 \log N)$ when simply allocating a new array and copying over the entries would be $O(N)$.

Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be.

@tonycoz tonycoz self-assigned this Feb 28, 2025
@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

Forgot to add in the OP.

Since 5.37.10 and commit 8a548d1 msvcrt.dll yes that file, C:\windows\System32\msvcrt.dll is static DLL linked into the perl process virtual addr space 100% of the time, through ws2_32.dll aka winsock static DLL linking against it. I am very dissatisfied at commit 8a548d1 since delay loading winsock was the best milliseconds saving optimization ever done to WinPerl. Especially how EU::MM makefiles have dozens and dozens of very short lived (15-100ms lifetime) perl processes to build a module.

The P5P repo's .t files , and less so CPAN, will call sub runperl() at

perl5/t/test.pl

Line 857 in 16196ae

sub runperl {
10Ks IDK 100Ks of times during a core blead make test. Copy pasted from a GH runner, blead perl has 1.2 million tests.

Files=2910, Tests=1193052, 418 wallclock secs (94.06 usr 10.78 sys + 493.74 cusr 65.30 csys = 663.88 CPU)

100K*4ms= 6.6 minutes faster core make test. 4ms is a wild guess, it will vary machine to machine user to user. Its probably 10 or 20 ms for me and my older equip. I have plans to return ws2_32.dll back to delay loading. My current returning delay loading winsock patch is failing a couple tests and needs some refactor to remove some irrelevant code edits. Its also stalled by another patch I have making user32.dll a delay load, which is not simple, b/c WinPerl's runloop always has monitored the Windows GUI event loop since day 1, and CPAN GUI libs can't be broken by WinPerl's runloop suddenly not dispatching SIGNALS Windows OS GUI messages from user32.dll. But I already protoyped and ran 3 implementations of delay-loaded user32.dll and have 4 more implementations brainstormed in my comments. Did you know 50% of perl.exe'es, in blead perl gmake test exit()/process terminate, before reaching 35/50 milliseconds point of their process run life?

55 millisecond is 1.7 frames at 30 frames per second. Blead perl currently has a 33,000 OP*s executed timer, before the first time it polls the Win32 GUI loop. Its crazy "link av.obj hv.obj perl.obj /delayload:user32.dll -o perl541.dll" really helps with blead perl core self gmake test. No other changes. A PR with just /delayload:user32.dll in it, is inappropriate IMO, b/c technical details I won't go into.

So UCRTbase.dll vs isupper() vs toupper() vs msvcrt.dll. Because msvcrt.dll is static linked into perl since 5.37.10. And my benchmarks in OP, were done with UCRTbase.dll and msvcrt.dll inside the same process, and the perl process already had msvcrt.dll in VA space, I didn't explicitly load it.

The question now is, does WinPerl selectivly replace cherry picked, problematic, slow, libc calls in ucrtbase.dll, with their same name libc calls in msvcrt.dll?

Because perl.exe has the choice of which one to call at runtime, they both are available at all times inside a perl process. The call stacks, profiler reports, and my benchmarks show an ex^^^^ponential multiple orders of magnitude performance difference, between 2 difference implementations, of the same exact C standard lib function.

Next question, why is WinPerl even C linking against MS's isupper()/toupper(), when Perl/P5P implements their own portable isupper()/toupper()? Is that itself a bug? My benchmarks included the Perl portable replacements for the platform vendor/OS/CC provided versions. They exist, they are fast/perfect.

Would slurping/looping U8 values 0x00-0xFF, 1x on process start, through MS UCRT's isupper()/toupper(), and caching the output 0/1 bits or U8s, work?

Nobody can justify enumerating all 250 country codes on earth in a SQL DB/for loop+strcmp(), loopmalloc()ing a red black tree, loop balancing it, loop walking the nodes until match, then loop free(), for each execution of if(isupper(char var_char)) {0;} else {0;} on a 8-bit integer.

You can't upper case an ASCII string, for each 8 bit character, you posting a new job ad on LinkedIn, interview and hiring a new developer and agree on a consulting contract and fee schedule, he reads the ASCII char and writes with a pen, 01000001, and hands you the paper with 01000001 written on it, and you hand him a check for $500, and his employment at you company terminates. He was paid $500 for 15-25 seconds of work. Great company to work for. 5 stars employer. Thats what UCRT is doing internally.

3 rd possible fix, the most difficult fix, which is beyond my expertise, figure out why ucrtbase.dll's cache logic are disabled at runtime inside a perl.exe process. I physically can see my ucrtbase.dll has caching mechanisms/conditional branches and shortcut mechanisms/conditional branches, but for unknown reasons, at runtime, every single time, where an if(){}else{} conditional jmp opportunity exists, the "heavy" no-cache branch of machine code was picked to execute, and not the "fast cache" array/C struct bitfield execution path. I'm not going to single step it again right now, but I counted atleast 3 if/else branch "fast cache, quick return" opportunities inside ucrtbase.dll, that didn't execute, before ucrtbase.dll handed over control to kernel32.dll, and kernel32.dll/kernelbase.dll LCMapStringEx() started searching a SQL DB of country codes, or digging through a memory mapped file of locales and country codes on my disk and inflating that "binary disk file" with variable length intergers, or gzip/huffman encoded vals, 0-indexed offsets, to CPU friendly C structs.

The API docs for LCMapStringEx()@kernel32.dll https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex#remarks actually spell out, the user should NOT call LCMapStringEx() 1x per char, but pass LCMAP_SORTKEY, and obtain a reference counted (???) sort object from LCMapStringEx() and feed that "sort cache object" into some other API call. ucrtbase.dll is calling LCMapStringEx() 1x per char, which the API docs of LCMapStringEx() specifically say is a bad idea.

So did perl.exe/perl5xx.dll/perl5porters do something wrong and explicitly disable the cache logic inside ucrtbase.dll?

Or this is a bug inside ucrtbase.dll, which only Microsoft can fix, and a member of the public must file a public bug ticket with MS, and MS devs must recompiling and publishing a new higher build number of ucrtbase.dll? Beyond scope for me to diag this. IDK enough.

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:

  • it's trying to emulate the case-insensitivity of the Win32 UTF-16 environment with localized 8-bit (possibly MBCS) strings

Maybe my ucrtbase.dll is going "crazy" because it thinks I'm passing asian MBCS upto 2^16 fake-ascii and it needs to be compliant with MBCS rules, even tho I live in USA, and if I have a 2nd locale or keyboard installed for personal reasons, it will be a locale/keyboard layout somewhere in Continental Europe, but quite far from London or Paris (not Latin-1 but still a normal 0x80-0xFF code page). I have no idea why a MBCS code path would ever trigger on my system, but cPython has the exact same bug I have but its reported and reproduced by other people. Its not unique to me.

PS I've spend 3 days searching ReactOS for what is the limit for U8's per "char" for a "MBCS" code page on a technical MS NLS C API level. I believe MultiByteToWideChar allows 1 MBCS character to be of length I32_MAX or input buffer length in U8, or U8 '\0'. You can do it by using a MBCS continuation mask U8 byte and repeating it forever to the end of the your input byte buffer. A 1 MB long character will become 2 bytes of output, aka UTF16 replacement char L'?'. The other direction, UTF16 to legacy MBCS can't generate such nonsense since UTF16's maximum output to legacy is, 4 bytes of UTF16 generates 4 bytes of Legacy CP or 4 bytes of UTF8 CP. Perhaps with surrogate abuse 4 bytes Wide becomes 6 bytes legacy output.

use Encode;
use Devel::Peek;
my $source = Encode::decode("UTF-16LE","\xF0\xDB\xF0\xDB");
Dump($source);
SV = PV(0x48e1f8) at 0x50a0a0
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x2208aa8 "\xEF\xBF\xBD\xEF\xBF\xBD"\0 [UTF8 "\x{fffd}\x{fffd}"]
  CUR = 6
  LEN = 16

BTW I believe RtlAnsiStringToUnicodeString/RtlMultiByteToUnicodeN and NTFS will not put up with MBCS continuation character abuse. And they are NOT identical to MultiByteToWideChar for all inputs. Benchmarking RtlAnsiStringToUnicodeString vs MultiByteToWideChar is on my todo list, ive never seen someone do it before. I strongly RtlAnsiStringToUnicodeString and siblings will be faster since they take alot of shortcuts, such as being incapable of stopping conversion to wide, at formally unassigned by spec characters in legacy 1 byte per 1 char CPs. "incapable of stopping conversion to wide" means of L"?" wchar_ts in output.

char_low = *char_ptr;
if(char_low & MBCS_CONTINUE) {
    char_high = *(char_ptr+1);
    wide = (char_hi <<16) | char_low;
    char_ptr += 2; /* game over */
}
  • it's using locale sensitive functions to do it, so if you change locale the order may become invalid

IDK enough. Maybe this toupper()/isupper() bug has something to do with that newish in Perl many reader single writer locking process global locale inter-OS thread serializing/anti-race code.

Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be.

What are Perl in C's mandatory requirement for vendor C std lib toupper()/isupper() ?

https://en.cppreference.com/w/cpp/string/byte/toupper says no 2^32 or 2^64 abuse

If the value of ch is not representable as unsigned char and does not equal [EOF] , the behavior is undefined.

As you and me both agreed on IRC, there is some really poor quality Win32 only code, inside https://github.com/Perl/perl5/blob/blead/win32/perlhost.h that turns the toupper() O(n^2) (?) bug into a O(n^n) (?) bug in 1 rare place in WinPerl not UnixPerl and only if you make a 2nd ithread/use threads.pm; fork(); in WinPerl.

But I'm less concerned about performance of creating ithread # 2 in a WinOS proc, vs perl interp executing this broken slow toupper() from the run loop in very generic, very common, very production, PP code executing through very vanilla cross-platform Perl C code in Perl_pp_some_op();. Image # 2 in the first post, shows a callstack, where toupper() was executed by the interpreter, from very vanilla cross platform posix-ish Perl C code. Not #ifdef WIN32\n #endif code. Vanilla PP code executing toupper() at a high frequency from generic typical PP code scares me alot more. This is report generated from a C symbol file of all callers of these problem UCRT functions in perl5.41.10.dll. 95% of callers are cross platform perl code, half of them are extreme frequency runloop Perl_pp_*() ops.

call    cs:__imp_isalnum S_find_byclass
call    cs:__imp_isalnum S_isFOO_lc
call    cs:__imp_isalnum S_new_ctype
call    cs:__imp_isalnum S_regmatch
call    cs:__imp_isalpha S_isFOO_lc
call    cs:__imp_isalpha S_new_ctype
call    cs:__imp_iscntrl Perl_mem_collxfrm_
call    cs:__imp_iscntrl Perl_pp_fttext
call    cs:__imp_iscntrl Perl_yyerror_pvn
call    cs:__imp_iscntrl S_find_byclass
call    cs:__imp_iscntrl S_isFOO_lc
call    cs:__imp_iscntrl S_new_ctype
call    cs:__imp_iscntrl S_regmatch
call    cs:__imp_iscntrl S_sv_display
call    cs:__imp_isdigit S_isFOO_lc
call    cs:__imp_isdigit S_new_ctype
call    cs:__imp_isgraph S_isFOO_lc
call    cs:__imp_isgraph S_new_ctype
call    cs:__imp_islower S_isFOO_lc
call    cs:__imp_islower S_new_ctype
call    cs:__imp_isprint Perl_pp_fttext
call    cs:__imp_isprint Perl_yyerror_pvn
call    cs:__imp_isprint S_isFOO_lc
call    cs:__imp_isprint S_new_ctype
call    cs:__imp_isprint S_sv_display
call    cs:__imp_ispunct S_find_byclass
call    cs:__imp_ispunct S_isFOO_lc
call    cs:__imp_ispunct S_new_ctype
call    cs:__imp_ispunct S_regmatch
call    cs:__imp_isspace Perl_pp_fttext
call    cs:__imp_isspace Perl_pp_split
call    cs:__imp_isspace S_isFOO_lc
call    cs:__imp_isspace S_new_ctype
call    cs:__imp_isupper S_isFOO_lc
call    cs:__imp_isupper S_new_ctype
call    cs:__imp_isxdigit S_isFOO_lc
call    cs:__imp_isxdigit S_new_ctype
call    cs:__imp_tolower Perl__to_utf8_fold_flags
call    cs:__imp_tolower Perl__to_utf8_lower_flags
call    cs:__imp_tolower Perl_pp_fc
call    cs:__imp_tolower Perl_pp_lc
call    cs:__imp_tolower Perl_pp_ucfirst
call    cs:__imp_tolower S_new_ctype
call    cs:__imp_toupper Perl__to_utf8_title_flags
call    cs:__imp_toupper Perl__to_utf8_upper_flags
call    cs:__imp_toupper Perl_pp_uc
call    cs:__imp_toupper Perl_pp_ucfirst
call    cs:__imp_toupper S_new_ctype
call    cs:__imp_toupper compare
call    cs:__imp_toupper lookup

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

Another idea, on WinPerl, is a codebase wide grep isupper() -> isupper_l() needed and 100% removal of isupper() forever? If you use isupper_l() and pass a locale ptr in arg 2, The horribly inefficient LocaleUpdate::_LocaleUpdate which calls FlsGetValue() atleast 5x or more. Will take a 2 CPU instructions big shortcut and instant return.

9 stack push non-vol reg+test+cond_jmp+sse_mov_ptr_to_reg+sse_mov_reg_to_ptr+abs_jmp+9 stack pop non-vol reg+x86 ret

That branch in LocaleUpdate::_LocaleUpdate is a NOOP compared to what LocaleUpdate::_LocaleUpdate workload/cpu op countwise/runtime overhead wise, does right now inside the perl.exe proc.

If libperl.dll always passes a locale_t as arg 2, that Perl process-wide thread-wide locale settling race bug with WinPerl serializing multi-OS thread access, using a very poor DIY-ed by Perl re-implementation of MS's Slim reader/writer (SRW) API https://learn.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks that whole API thing, basically will disappear through macros/etc from WinPerl/libperl.dll, maybe the exported lock variables stay for less than perfect CPAN XS code, but nothing in libperl.dll will ever obtain that serialize lock ever again,

And MS UCRT Devs probably can't even see the LocaleUpdate::_LocaleUpdate method call since its a "POD" void * class member field in their source code and they forgot its a getter setter op overload method and not an machine type, and LocaleUpdate::_LocaleUpdate can't constant fold anything b/c rules of C/C++ (FlsGetValue() is a C symbol, you multi evaled it, inline and constant folding it is illegal in C).

It doesn't matter in 2025, but IIRC ucrtbase.dll on WinXP with VC2013, calls GetProcAddress("FlsGetValue"), GPA returns NULL, then ucrtbase.dll executes TlsGetValue() every single time (5x-6x).

@tonycoz
Copy link
Contributor

tonycoz commented Mar 11, 2025

We (probably @khwilliamson ) could change perl to use _create_locale() and the _is*_l() functions instead of the C standard functions, though this would be yet more Win32 specific code.

Unfortunately _create_locale() doesn't match the behaviour of POSIX newlocale, since you can't modify an existing locale object to mix locales the way you can with newlocale().

To behave the same we'd need to keep separate locale objects for each category, and that has problems for functions that work with more than one locale category (strftime at least), though I believe such mixing is usually a bad idea.

But, even if isupper() is 133x slower, how much of an effect does that have on real code?

It might be worth benchmarking and profiling related code (regexps?) to see.

@xenu
Copy link
Member

xenu commented Mar 11, 2025

I didn't really look into this at all, but if there's a performance regression, perhaps it would make sense to report it to Microsoft?

Technically, UCRT is part of the OS and reporting a bug in Windows requires a support contract (well, there's also the Feedback Hub, but no one reads it...).

However, there's a workaround: UCRT bugs reported on Visual Studio Developer Community are often forwarded to the Windows team.

That said, I imagine performance regressions are low priority for them.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 13, 2025

I didn't really look into this at all, but if there's a performance regression, perhaps it would make sense to report it to Microsoft?

Technically, UCRT is part of the OS and reporting a bug in Windows requires a support contract (well, there's also the Feedback Hub, but no one reads it...).

However, there's a workaround: UCRT bugs reported on Visual Studio Developer Community are often forwarded to the Windows team.

That said, I imagine performance regressions are low priority for them.

Some links

https://learn.microsoft.com/en-us/cpp/c-runtime-library/recommendations-for-choosing-between-functions-and-macros?view=msvc-170
https://learn.microsoft.com/en-us/cpp/c-runtime-library/global-state?view=msvc-170

setlocale is so "unsafe"/"unstable", notice MS sandboxed user mode code from MS user mode code in same addr space with the __o_() feature.

some limited quotes from UCRT headers,

_Check_return_ _CRT_JIT_INTRINSIC _ACRTIMP int __cdecl isupper(_In_ int _C);
_Check_return_ _ACRTIMP int __cdecl _isupper_l(_In_ int _C, _In_opt_ _locale_t _Locale);
#if defined _CRT_DISABLE_PERFCRIT_LOCKS && !defined _DLL && !defined __cplusplus
        #define isupper(c)  (MB_CUR_MAX > 1 ? _isctype(c, _UPPER) : __chvalidchk(c, _UPPER))
#endif
_When_(_Param_(1) == 0, _Post_equal_to_(0))
_Check_return_ _CRT_JIT_INTRINSIC _ACRTIMP int __cdecl toupper(_In_ int _C);
extern "C" extern __inline int (__cdecl _isupper_l)(int const c, _locale_t const locale){
    _LocaleUpdate locale_update(locale);
    return _isupper_l(c, locale_update.GetLocaleT());
}
extern "C" extern __inline int (__cdecl isupper) (int const c) {
    return __acrt_locale_changed() ? (_isupper_l)(c, nullptr) : fast_check(c, _UPPER);
}

There is more questionable MS code in the UCRT .cpp files that is important to look at. MS Devs occasionally put comments saying why they did things the way they did, or what end user hazard they were coding around were. But the UCRT/MSVC compiler has a source available license, not a FOSS license, so I'd rather not copy paste large methods/function bodies into this archival GH ticket. Steve Hay, Tonyc, etc I know all of you have the UCRT src code on your systems just like I do.

Notice extern __inline and _CRT_JIT_INTRINSIC tags, MS devs maybe thing, or in C++ mode on their projects, the perf bug in this ticket is inline/optimized away on C++ code or MS in-house or compliant UWP code, but WinPerl is doing something wrong/different, and WinPerl at CC time hit a extern "C" linker symbol which stop "constant folding".

Also WinPerl use -O1, not -O2 (TODO this really needs to be fixed since 1996, fix is NOT adding -O2 , but a carefully decided '-O1 -Gw -Gy -Oi -Os -Ob******. Carefully picking the -Ob***** is why I haven't written such a PR yet. The wrong '-Ob', or plain '-O2', will inline ALL staticdecl-ed functions in Perl 5 in every.o. Including all the Perl_debug_log` and ice cold panic/debugging/assert C code we don't care about perf wise.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 13, 2025

We (probably @khwilliamson ) could change perl to use _create_locale() and the _is*_l() functions instead of the C standard functions, though this would be yet more Win32 specific code.

Unfortunately _create_locale() doesn't match the behaviour of POSIX newlocale, since you can't modify an existing locale object to mix locales the way you can with newlocale().

Do you have www links to the posix or msdn or p5p GH git repo APIs ?

Follow Sarathy's principles for WinPerl, "POSIX" can't and never will exist inside WinPerl. Either P5P fakes it, or MS CRT fakes it. All of "libc" features/vm state are fictional and only exist inside 1 winperl process address space. There is no interop/ipc/IO communication with other procs using POSIX [tokens] APIs. Either P5P or the CRT, always converted the posix-ese into Kernel32.dll-exe.

I'm wondering if WinPerl do setlocale by calling K32/NLS APIs directly, since UCRT refuses to use its UCRT locale cache (P5P bug!), and keeps calling out to K32 all the time. So if WinPerl does a setlocale with raw MS NLS API/K32 API, it will automatically appear inside ucrtbase.dll state, and perhaps ucrtbase.dll's locale cache WILL WORK if it data from K32/NTDLL state and not end user libc UCRT public API state. K32/NLS/NT kernel/ring 0 support for locales is unusually high in amateur opinion. ntdll.dll has

NtQueryDefaultLocale
NtSetDefaultLocale
RtlLocaleNameToLcid
RtlGetLocaleFileMappingAddress
RtlpQueryDefaultUILanguage
NtQueryDefaultUILanguage
NtQueryInstallUILanguage
RtlGetProcessPreferredUILanguages
RtlGetSystemPreferredUILanguages
RtlGetThreadPreferredUILanguages
RtlGetUserPreferredUILanguages
RtlpVerifyAndCommitUILanguageSettings

So that looks to me like WinOS and the concept of Locales run very deep into Ring 0 in Windows. Basically MS designed it so the text debugging log of a kernel sound card driver, will chance the thousands separator character in a sys admin debug log file, within 1000 microseconds/1 millisecond/1 video frame, after OS wide RtlTheRealSetLocale() executes from user mode.

Maybe UCRT needs compliance with this, but Perl and PP state doesn't. Due to lack of knowledge IDK what Perl is trying to exactly fix in WinPerl, and who the consumer/end user is of the fix. And is the "fix" for actual observed defects in Perl end user production code, or is the 5.36/5.38/5.40 locale safety/serialization code, trying to fix a fuzz testing/spectre race condition khw discovered in Windows?

My generic instinct say, since APIs like toupper() no _l() and setlocale are inter-thread proc wide nasty static buffer APIs like localetime or asctime, MS devs don't perf test or pay src code attention to C functions they aren't allowed to use/commit code with in-house, and automation blocks those unsafe C functions (think SDL policy). I'll guess the only calls to toupper() no _l() and setlocale in MS in-house code are unit test files, that a manager sign off on to whitelist that mention to toupper() b/c its inside a unit test.

Thats the reason I keep thinking touppper_l(obj) is the way out, b/c thats how all MS devs have to write their in house code, so toupper_l(obj) gets benchmarked/regresssion-ed at MS, toupper() and toupper_l(NULL) dont.

To behave the same we'd need to keep separate locale objects for each category, and that has problems for functions that work with more than one locale category (strftime at least), though I believe such mixing is usually a bad idea.
But, even if isupper() is 133x slower, how much of an effect does that have on real code?

I caught this in a profiler to attached to perl.exe when doing various .t'es from perl core make test. And sometimes on personal/cpan PP code making the same slow backtraces and then identified by the profiler. This isn't trying to optimize a die() or croak() branch.

It might be worth benchmarking and profiling related code (regexps?) to see.

I'd have to recomp perl.dll with a bunch of __debugbreak()s and spend alot of time to figure out what PP or CPAN XS code caused crawling slow UCRT toupper() to be used instead of P5P fast native ToUPPER(). But locale APIs I barely know anything about so im least qualified to say what is right/wrong/bug/less than perfect/perfect in that topic.

There might be a P5P bug, that somehow PP or CPAN XS or p5p C, is passing locale "en-US; English; United States; NY; New York; 6 Empire Plaza; Suite 715;" which is "legal"' but a denormal local setting, that needs to be parsed/normalized in a loop O(n) against the NLS/Registry master disk data each time to normalize it to "en-US; English; United States;". IDK enough.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 13, 2025

other things I forgot,

UCRT's _strlwr() calls __acrt_locale_changed()/LocaleUpdate::_LocaleUpdate only ONCE. not O(n) or O(n*15). If perl used that and passed a proper longish char* instead of tolower, problem gone.

2nd, this is a comment from the CRT src code. I can't copy paste the whole thing here, but this comment is really important.

Note that we don't need _LocaleUpdate in this function. The main reason being, that this is a leaf function in locale usage terms.

leaf function in locale usage terms

My reading of that means its P5P's bug and P5P's defect for failing to

LOC_T* lobj = get_globale_OS_proc_locale_obj();
for(i=0;i<end;i++} {
    arr[i]=toupper_l(arr[i], lobj);
}
free_loc_obj(lobj);

or in PerlXS API p5p is doing on a tied() TIESCALAR

SV* sv = ST(0);
for(i=0;i<SvCUR(sv);i++} {
    SvGETMAGIC(sv);
    arr = SvPV_force_nolen(sv);
    arr[i]=  arr[i]  >= 'a' && arr[i]  <= 'z' ? arr[i] - ('a' - 'A') : arr[i] ;
    SvSETMAGIC(sv);
}

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 15, 2025

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170

I'd have to benchmark it to prove it, but I have a suspicion all of UCRT's internal cache code regarding parsing locale wide or ascii string names and its ctype table cache, needs configthreadlocale() to be called to manipulate UCRT's private TLS struct and a UCRT C global var called __globallocalestatus, that ultimately controls calling sync_legacy_variables_lk, but IDK what sync_legacy_variables_lk becomes at runtime, since it is inlined away (but not constant folded), and doesn't appear in MS's .pdb file as a string, so it can't be seen in a MSVC call stack. I think sync_legacy_variables_lk optimizes to 4 useless calls in a row to a symbol called __crt_state_management::get_current_state_index() which is a TlsGetValue/FlsGetValue getter. The retval of __crt_state_management::get_current_state_index() which is a pointer (reg RAX) is never checked or used by the 2 callers, which are setlocale and wsetlocale.

If UCRT's internal TLS structs don't "match", the ultimate source of truth for POSIX locales/is_/to_ after all the bloat and very badly written C++ classes, UCRT will be getting the truth for is_/to_ from kernel32's GetStringTypeW https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-getstringtypew#remarks and CT_CTYPE1/C1_UPPER/C1_LOWER, indirectly through GetStringTypeA https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getstringtypea and obviously GetStringTypeA isn't a real function since this is WinNT v6-v10, not Win95, and all the *A functions are autogenerated C++ templates inside kernel32/kernelbase, and rarely autogenerated inside the UCRT (UWP app store prohibits linking to *A symbols or something like that).

Since the concepts of Latin-1/7 bit ASCII/mbcs/8 bit characters, don't exist on Windows OS, except through the opaque casting function called MultiByteToWideChar, the whitespace/upper/lower/digit status of a IEEE 754 double 7 bit ASCII character can't be determined except by casting the IEEE 754 double 7 bit ASCII character to a Unicode character with MultiByteToWideChar.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 27, 2025

similar src code problem in mingw 8.3.0/Strawberry 5.32's bundled gcc

https://github.com/mingw-w64/mingw-w64/blob/90da6c6535c503159e952dc8b44f8778bed6f622/mingw-w64-crt/misc/mbrtowc.c#L4

Reading Mingw's "ASCII to Wide" code, Im seeing the same O(n^2)-ness that UCRT is doing.

Implementing each byte of the for loop as a dozen calls into kernel32.dll which is a dozen TLS calls best cast, worst case 100s of TLS calls and 100s of files that need to be decompressed and parsed.

IsDBCSLeadByteEx() calls NlsValidateLocale() https://github.com/wine-mirror/wine/blob/8d40da7ffda5e8dde9200f733e7d2cebf0196bc3/dlls/kernelbase/locale.c#L723 and NlsValidateLocale() sets off a chain of decompressing and parsing a couple 100/1000 locales on disk/on a mmap

https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f

100s/1000s of permutations in there

update:

the IsDBCSLeadByteEx() byte by byte, algorithm seems to be copy paste from SO https://stackoverflow.com/a/27196334 and is probably repeated over and over by college students or young on certain unreliable tech social media sites, known for anonymous users, trying to ChatGPT their way to the next account ribbon/flair.

@bulk88
Copy link
Contributor Author

bulk88 commented Mar 27, 2025

python/cpython#79376

https://bugs.python.org/issue35195

In 2018 Python identified this problem. Py ticket remains open ATM Feb 2025. IDK enough arch/API/design/tech info to understand all the comments in the cPy tickets if there is a proposed fix or reject fix or unfairly rejected fix in those 2 tickets.

maybe related to this ticket

https://bugs.python.org/issue7442

https://bugs.python.org/issue31900

https://vstinner.github.io/python3-locales-encodings.html general tech prose, probably not related to whatever im benchmarking

@bulk88
Copy link
Contributor Author

bulk88 commented Apr 28, 2025

More complaints about this particular WinPerl bug and MS's any-age CRT .dll file complaints

https://stackoverflow.com/questions/36686381/windows-c-runtime-toupper-slow-when-locale-set

I am starting to believe this WinPerl bug is "unfixable" on Windows OS.

After another attempt to read UCRT's src code and understand it, and seeing multiple other C/C++ FOSS project in the 2010s,"end user" visible deprecated and ripped out, 99% to 100% of runtime executes of MS's setlocale()/wsetlocale() functions, as a stupid quick and easy fix/closure of their this FOSS app+MS CRT DLL+WinOS+setlocale() == horrible bug tickets.

I think I found a src code comment in the UCRT, where a MS dev intentionally "temporarily" [forever] commented out the C/C++ method call, to push/validate/resync the new ISO C ASCII locale string, into the UCRT's locale cache logic C++ object/class instance. I don't think MS will ever uncomment that source code line, and that C++ method/linker C++ symbol is optimized out in an actual ucrtbase.dll file. That symbol's ascii string name isn't inside the ucrtbase.pdb file either, which proves the method call is optimized out and doesn't exist in ucrtbase.dll.

More specifically, the UCRT has a non exported global var called long __acrt_locale_changed_data. It is a .data var CC time initialized to value FALSE. wsetlocale() does an InterlockedExchange(TRUE) on that global on entry to wsetlocale(). I think, but I can't prove this, that branch with InterX(TRUE) is unconditional on all branches/control flows inside wsetlocale(). That InterX(TRUE) CPU op, can't be avoided or prevented from executing.

I'm going to exclude thinking about any 100% certified-mentally-insane C code implementing "MS C89" object/class destructor methods here that execute as part of a C89 SEH catch frame swallowing and throwing a SEGV/SIGILL event object. Even if there is somewhere in ucrtbase.dll's machine code, a "InterX(FALSE);" statement in a SEGV SEH catch frame aka C++ "smart_ptr", that CPU op code in ucrtbase,dll is unreachable from WinPerl's viewpoint.

More specifically,b/c wsetlocale() will always execute wsetlocale_nolock(), unless wsetlocale_nolock() throws a SEGV, wsetlocale() unconditionally executes InterX(TRUE) after wsetlocale_nolock(). There is no way to reverse or write to that private C glovar variable living inside the UCRT, without restarting the WinOS Perl process, or libperl.dll, doing an equally 100% crazy and insane 100% public API FreeLibrary()/LoadLibraryEx() cycle on disk file ucrtbase,dll's virtual address space. Unless the user mode process does a "InterX((long*)& __acrt_locale_changed_data, FALSE);` statement, UCRT's high speed locale cache logic layer will never execute again, for the rest of that proc's lifetime.

That src code comment, and that commented out method that was supposed to do InterX((long*)& __acrt_locale_changed_data, FALSE);, can only be found in the UCRT src code from VC 2013 or VC 2015 era. All traces of it have disappeared inside the UCRT src code included with my "VC 2022 build number=right-now-Apr-2025".

If MS devs "temporarily" commented out that method, in what I think is UCRT from VC 2015/2013 era, and src code traces of that method vanished by VC 2022 era. MS staff devs will never reverse their decision. It wasn't an accident by a random MS staff member. It was a slow, very careful, very intentional decision, done by atleast 2 different people, separated by atleast 12 months, or >= 1 years time.

IDK exactly why MS devs disabled, and deleted that code that was supposed execute a InterX((long*)& __acrt_locale_changed_data, FALSE); to reenable the UCRT locale lagic. The very src code comment that disabled that method, says something like

// Performance doesn't matter here,
// Windows OS GUI apps can't call wsetlocale().
// Reenable this line when the ISO C TC stops smoking hard drugs and writes down WTF setlocale() is actually is supposed to do.

The last line remark, explains the problem. Very slowly read https://en.cppreference.com/w/c/locale/setlocale and https://en.cppreference.com/w/c/locale/LC_categories or the actual C23 .pdf spec. The ISO TC wrote down, and officially spec-ed the C grammer token setlocale() to have the the exactly same runtime behavior that happens when a SIGSEGV event is executed :-)

The ISO TC never documented arg number # 2 in LibC's Public API setlocale() func call does

TLDR Parody: "Did you know all 3 compilers, Clang, GCC, and MSVC will const-fold away ISO C's definition of setlocale() if you turn on LTO?"

Long version:

Please write a letter to the UN Recognized Ambassador of your country requesting that he or she,
at the next UN General Assembly meet, take off their shoe and throw it another Diplomat's head
on camera to assert your nations sovereignty.

or

The position of the ISO C TC is that we are not the UN. We will never write down a
list of countrys for arg 2 in the prototype of setlocale(). Never happening,
Remember that if you are proposing to document argument # 2. You have to
write the unit tests. This TC must accommodate all C developers, including
including C developers from North Korea, Iran, China, Islamic State of Levant,
Cuba, Libya, Balochistan, Saudi Arabia, South Sudan and Somalia.
I hope this committee has not discouraged you from future collaboration
regarding improving the C language specification :-)

@bulk88 wrote ^^^^ because setlocale() can't be documented. Even Microsoft
devs realized it is impossible to write down what it does, they actually tried.
See their attempt at https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f

Regarding WinPerl's choices, if MS UCRT's setlocale()/wsetlocale() fn call, is if(("as-designed" || "not-a-bug") && "100%_broken") {}. Then WinPerl need to do what other FOSS and GPL licensed SW projects did with MS/WinOSs/VCs CRT or UCRT lib

  • choice # 1 "end user" visible, do a very short deprecation timeframe, and ripped out linking against C symbol setlocale from ucrtbase,dll, MS Windows users have to re-read our manpage/README/CHANGES file to find out what our new .tar.gz will now do on their Windows PC.

  • choice # 2 only execute C func call setlocale()/wsetlocale() on a 1 millisecond, 30 milliseconds, maybe 60 millisecond timer, to protect against that unfixable multi-threaded race condition design bug. This FOSS WILL NOT execute the extern C func setlocale() every time U8 base10 integer value 37 (also known as ( (U7) '%' )) is loaded into a CPU register. Changing the ISO C locale 17 TIMES!!!! from a OS thread, # 2 while OS thread # 1 is executing somewhere deep inside snprintf(), while snprintf() is rendering string "8.1878.18783387417606e-10" on the end user's OS thread # 1. This is not a defect. It is invalid code or invalid behavior. Status: Rejected - User Error.

Like really? you want to change the STDOUT character encoding or your C locale state, 17 times while your/our FOSS software rasterizes 8 byte value "8.1878.18783387417606e-10" into a RGB pixel grid? Sorry, you only a fresh locale variable every 1 millisecond or 60 milliseconds between each polling call to Microsoft's setlocale() function on Windows. If you don't like it, write your own C code implementation of snprintf() that understand 11-bits long unsigned chars, with dynamic AI intelligence between your design requirement, to support 6 different Traditional Chinese locales plus the Babylonian locale and the Klingon locale, while a C lang IEEE 754 float of type long double on your screen.

I (@bulk88) has seen on SO or Reddit or GH someone with a legit design reason to do ^^^^ and they used mentioned historic Chinese and the word GUI. This is a legit request to a legit problem. It is a university/academic/historic dictionary SW project.

Yes, Historical DATE STAMPED LOCALES make perfect sense to me. But that requirement sounds 2^48 API layers higher than what snprintf() and C will give you for free. That user needs to write their own GUI/Unicode/Graphical Font Rendering C code. That fantasy C library they want to download, doesn't exist pre-written by someone else for that SO/Reddit user's hyper-narrow hyper-specialized C code they need to write for their job at Building 5/42 of the Department of Anthropology at the National University of Camelia.

So unless Khw/Tony/etc have a different idea, I can't think of another answer for WinPerl, beyond

Choice 1 reimplement setlocale() P5P-style (click here to learn more -> https://github.com/Perl/perl5/blob/blead/win32/win32.c )

Choice 2 remove 100% of calls to MS's UCRT's setlocale() and do something else. The word "interop" is N/A, we just don't care what "C" does, This is the Purl 5.0 .git repo on GitHub, C is a .pdf file, we don't care what is inside that .pdf file, Read Purl's manpage or go away. How did you find this FOSS project's (/bin/purl's) bug tracker page on github.com if you couldn't find our README file on github.com?

Choice 3 do a compromise deal with MS's UCRT's setlocale() and implement with a polling timer between set between 1-30 millisecond. Zero chance of breaking 3rd party legit production GUI apps. No multithreaded legit TUI/GUI process can draw > 30 frames per second anyway. How can a human even see Unicode/UTF-8 text written at >= 30 FPS to a shell console terminal window? Windows shell or Linux shell, same question!

WinPerl and this WinPerl Bug, I don't think Choice 3 is correct and don't think it applies here, for WinPerl's problem, If I understand what that P_something_SQL_something library was fixing in C code.

Choice 3 was done by a project maintainer from the P_something_SQL_something project on GH, and he closed the GH bug claiming 5 vsnprintf()s per millisecond to 300 vsnprintf()s per millisecond is fast enough, and benchmarks within +/- 100 vsnprintf()s per ms, of his Linux vsnprintf() benchmarks. That was good enough to fix the complaint. "If our Linux binary hangs on a Linux rack server, and then rebooting to Windows OS on the SAME HARDWARE and hanging the rack server again with our Windows binary, that is a totally diff bug ticket and will be fixed from the Linux side, Im closing this Win32-only bug ticket."

Someone's shell/console running @ 30 FPS refers to this https://en.wikipedia.org/wiki/ANSI_escape_code

I couldn't figure out what that Win32-only polling timer fix in the P_something_SQL_something library was exactly fixing. I think that C library wanted to READ AND PARSE the ASCII string return by MS's UCRT's setlocale(), because they have their own DIY-ed vsnprintf() implementation, and they need correctly format their ASCII date stamps in their tracing log files.

Correct if wrong, WinPerl/libperl.so attempting to READ AND PARSE the ASCII string returned from Linux's/AnyOS's/MS's UCRT's setlocale() is undefined behavior, unportable, illogical and out of scope for Perl/P5P to do. Perl is unaware of C's "locale feature" according to https://perldoc.perl.org/functions/sprintf . C's "locale feature" simply doesn't exist.

The Perl Interp is proud to be born in California, United States of America. Is "locale" a word from French or Italian? Perl only speaks English. The Strasse street signs are right back up in Paris because dealing with uneducated Parsians that cant read or write is Too Strassful.

Perl lang/grammer/interp core/PP is documented in https://perldoc.perl.org/functions/sprintf as locale unaware, Reading sv.c and reading sv_vcatpvfn_flags() S_F0convert() sv_2num(). Those 3 have zero src code that is aware of locales (*white lies, sv.c contracted the Locale disease around >= 5.38-ish ). Blead perl's Perl_isinfnan() still assumes C lang's "NaN feature of type double" is implemented by OS Vendor's LibC library using its internal secret TLS HTTP AJAX API key and a Wifi connection to talk back to IEEE's Float Point rack servers. The IEEE's 754 App from Google Play costs $1.99 a month to use. Get a 30 free trial of 754 if you subscribe to C's Patron channel this week only. Limited Time Offer!!!! Act now before its too late!!!!

Not a jk, blead/stable WinPerl Perl_isinfnan() executes UCRT's dclass() symbol, b/c the end user is still deciding if to buy a Motorola or buy a Intel branded upgrade card for their i386 PC, because Duke Nukem is too slow to play without buying a "floating upgrade package" the salesman at the PC Store tried to hustle them with 6 months ago.

I am going to pretend JHI never introduce x86 assembly language into Perl in 5.23.6 in year 2015, example at 572cd85

But IDK I keep thinking dmerphq authored all the code that looks like

#if defined(NAN_COMPARE_BROKEN)
            LIKELY(!Perl_isinfnan(was))
#else
            LIKELY(!Perl_isinf(was))
#endif

in core, and he did that somewhere late 5.1X era, not in 5.2X era, and I'm not going git blame anymore, why did LibPerl very recently learn to speak Assembler when speaking Assembler is taboo. ISO C has never defined the bit patterns that are created after you call float class object pointer's or double classed object pointer's memcpy() method to serialize a double class or float class object pointer between 2 different processes or 2 virtual address spaces or pass a double object to the stdio class.

Real life/Production Perl problems, the project maintainers/authors of GCC/Clang/MSVC, can't decide if( !! (Inf-Inf)) {} will SEGV or constant fold. This problem is a political problem, not a software problem. 2 people, both have GCC commit bit, both are GCC project manager, can't agree what that C code string will do.

The ISO C TC very carefully engineering the C spec, that C grammer tokens float/double are plug-n-play abstract base classes, not implemented by C. Please contact your Operating System Vendor to mail you a copy of https://man7.org/linux/man-pages/man3/dlsym.3.html if you still have any technical questions about C grammer tokens float/double. <<<--- probably hasn't mattered and is irrelevant for the Perl VM since Perl 5 VM dropped the Cray CPU arch and Cray's ones complement electronics. I think Cray was the last C compiler/OS kernel/CPU chip, that Perl executed on, that REQUIRED float/double to be runtime swapped as if the end user is writing dlsym() dlopen() in their C code. This "C code" problem will never affect Perl again. This "C code problem" in 2010s/2020s only affects the authors of Clang/GCC/LLVM/etc.

GCC devs and LLVM devs have multiple times, created their own bugs in both C compilers, because ARM LLC (United Kingdom) has NEVER DOCUMENTED the CPU opcodes on how to do floating pointing math on the ARM CPU. ARM LLC if they feel generous will email 3-5 .pdf files if you ask that question, but that isn't exactly true.

Linux Kernel devs have reverse engineered 57 different ABI ARM ISA .pdf files so far, see https://github.com/torvalds/linux/tree/master/arch/arm

GCC/LLVM bug trackers have a class of head banging macro soup bugs, because ARM is the 1 and only "CPU arch" inside Clang/LLVM, with more than 1 hardware FP math instruction under 1 #ifdef HAS_ARM_WITH_UNEMULATED_FLOATS. There are 3-5 .pdfs. NEON.pdf is 1 of 3 .pdf files that GCC/LLVM ppl must support for "hardware-not-software C lang types float/double". ARM went bankrupt in 1992. DEC got the ARM intellectual property rights for free in 1994 because of an idiotic bankruptcy trustee/law firm where a paralegal thought the BSD License from their other Client was the ARM LLC Merger Contract.

So ARM LLC only makes $$$ selling website content editing services since 1994. when ARM LLC accidentally open sourced its own CPU, and later on, Samsung/Google/Apple pirated Arm's CPU, and then Sam/Goog/Apple decided to donate charity hush $$$ to ARM LLC after stealing the ARM src code, and accepting that hush $$$ is a very good deal , compared to ARM LLC litigating vs Samsung/Apple.

It didn't work for https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc. or https://en.wikipedia.org/wiki/SCO_Group,_Inc._v._International_Business_Machines_Corp. Why is attempt # 3 doing the same thing, but now Arm LLC Vs Google/Apple/Samsung/Amazon LLC going to work? Even Microsoft decided to play Pirates-n-Warez and won't read ARM's DMCA notices anymore, example at https://github.com/chsasank/ARM7

Ubuntu and Debian's software engineers payed an electrical engineer to reverse engineer the Arm CPU chip on a Samsung Galaxy or Google Nexus/Pixel phone, or just used a large rock to smash a Pixel/Galaxy phone until they learn which ARM CPU ISA .pdf to read before editing Linux Makefile that builds the Ubuntu for ARM32 .iso file. See https://docs.kernel.org/arch/arm/index.html#soc-specific-documents and all 57 ARM ISAs at https://github.com/torvalds/linux/tree/master/arch/arm . Canonical LLC will give exactly 1 C compile attempt, to select the correct ARM CPU identifier, as a interview question to a potential C developer. Just 1 try. There are 57 macros to pick. There is no 2nd attempty since that person is unsuitable for the position :-)

If a ARM C developer can't do their job, perhaps they should rewrite their resume to say "x86 C dev", Intel hasn't edited their 1 and only ISA .pdf file for more than 29-31 years, see https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html vs https://pdos.csail.mit.edu/archive/6.097/readings/intelv2.pdf

The last edit done by Intel on their x86 platform was in 1996, with Pentium Pro adding CMOV instruction. Nothing has changed between 1996 and 2025 for writing executable machine code for x86-32 on A5 paper with a ink pen. SMID/SSE/Vector/AVX, AVX512/bit product opcodes, aren't part of Intel's x86 ISA.. The feature SSE/Vector/AVX, AVX512, always are prefixed with opcode 0xF0 which was documented a "undefined behavior CPU opcode, reserved for end user to purchase a non-intel 3rd party CPUs and use opcode 0xf0 prefix to send string of random hex bytes from their Official Intel CPU, to a random non-intel 3rd mfged CPU chip plugged into the 2nd CPU socket. " This feature in real life is where all SSE/Vector/AVX, AVX512, live, behind a single x86 0xf0 opcode. The rest is UB to discuss where.

Until KHW decided what he wants to choices suggest, or I see some POSS project fix for MS setlocate() vs OS threads vs 1 address space vs MS's "we wrote MSVC 2022's UCRTbase.dll to strictly confirm with ISO C23 document. and unit tests to prove MSVC;s implementation of setlocale() passes all unitests published by ISO C23 organization If you dont like setlocale() is so badly speced out so useless it sometimes can get constant folded away. petition the ISO C TC to amend the spec it is liike a production code, and not a laywer written document announce lMS has obtain judicial legal immunity from all tort claims filed by citizens wanting compentation because UCRT's func call setlocale() cause the plantiff permanent pain and suffering, health costs, lost time from their family, and large financial damages at their job/empoyer.

Either WinPerl does the 30 millisecond polling timer trick, or converts to 100% clean usage of Arg 2/3 of type locale_t * every, inside the Perl C interp backeds. Or Perl turns on Ms UCRT's official non-standard brand new "thread localized global vars, which used to dangerous memory unsafe process global race condition vars." Or KHW wrote the code totally wrong from day 1 because

setlocale(LC_ALL,"C");, setlocale(LC_ALL,""); and setlocale(LC_ALL, setlocale(LC_ALL, NULL));, all 3 forever disabled the UCRT's locale cache logic. My suggestion, why isn't WinPerl doing

char * old = setlocale(LC_ALL, NULL);
sleep(1000); 
for(i=0 i< 30000;i++) {
   Perl_runops_standard(PL_op);
}
char * new = setlocale(LC_ALL, NULL);
if(strcmp(old, new) != 0) {
    setlocale(LC_ALL, old); // very sad and very tragic branch if this branch actually executes in production,
    //the WinPerl Process will now have the 133x slow down for rest of price life after this statement
}

@bulk88
Copy link
Contributor Author

bulk88 commented May 6, 2025

#21611 is probably the same UCRT bug I am describing here. There is analysis/workup in there that comes to the same conclusion as I did above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants