Use ICU4X for str_normalize
Categories
(Core :: JavaScript: Standard Library, enhancement, P3)
Tracking
()
People
(Reporter: hsivonen, Assigned: hsivonen)
References
(Blocks 2 open bugs)
Details
Attachments
(1 file)
Migrate the str_normalize
built-in to ICU4X. (In order to be able to drop the ICU4C normalizer from the build eventually.)
Assignee | ||
Comment 1•11 months ago
|
||
Actually landing this has to wait until we have ICU4X 2.0 in m-c.
Assignee | ||
Comment 2•11 months ago
|
||
Assignee | ||
Comment 3•11 months ago
|
||
Updated•11 months ago
|
Assignee | ||
Updated•10 months ago
|
Assignee | ||
Comment 4•3 months ago
|
||
Assignee | ||
Comment 5•3 months ago
|
||
The perf results here are disappointing and not aligned with out-of-Gecko ICU4C vs. ICU4X benchmarking. This needs performance investigation and optimization.
Assignee | ||
Comment 6•2 months ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #5)
This needs performance investigation and optimization.
In the case of already-NFC Polish to NFC, both ICU4C and ICU4X should end up iterating over UTF-16 code units by advancing a pointer and comparing every code unit to be less than U+0300, which is read from data. Since this case shows ICU4X as slower and not by a constant indicating setup overhead but by a factor that can get worse with longer input, it's a sign that something is going very, very wrong. Outside of Gecko, the case that takes the less-than-U+0300 fast path has been as fast between ICU4C and ICU4X when both are writing to a buffer. Here, the case is when both loop without writing when skipping the already-normalized prefix, which ends up being the whole input.
Assignee | ||
Comment 7•2 months ago
|
||
This does not appear to be an opt_level=2 or PGO artifact.
Assignee | ||
Comment 8•2 months ago
|
||
It seems that rustc makes iteration by pointer one instruction longer and puts the magic value in stack memory instead of a register.
ICU4C fast path:
0x00000000041aa340 <+208>: add $0x2,%r13
0x00000000041aa344 <+212>: mov %r13,%r14
0x00000000041aa347 <+215>: mov %r14,%r13
0x00000000041aa34a <+218>: cmp %r12,%r14
0x00000000041aa34d <+221>: je 0x41aa72f <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+1215>
0x00000000041aa353 <+227>: movzwl 0x0(%r13),%eax
0x00000000041aa358 <+232>: cmp %r9w,%ax
0x00000000041aa35c <+236>: jb 0x41aa340 <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+208>
ICU4X fast path:
0x000000000b72b1c0 <+768>: mov %r15,%rcx
0x000000000b72b1c3 <+771>: xor %eax,%eax
0x000000000b72b1c5 <+773>: cmp %r13,%r15
0x000000000b72b1c8 <+776>: setne %al
0x000000000b72b1cb <+779>: lea (%r15,%rax,2),%r15
0x000000000b72b1cf <+783>: je 0xb72bd8a <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+3786>
0x000000000b72b1d5 <+789>: movzwl (%rcx),%r14d
0x000000000b72b1d9 <+793>: cmp %r14d,-0x44(%rbp)
0x000000000b72b1dd <+797>: ja 0xb72b1c0 <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+768>
Assignee | ||
Comment 9•2 months ago
|
||
(Switching from opt_level=2
to opt_level=3
does not fix the perf.)
Assignee | ||
Comment 10•2 months ago
|
||
lldb output on aarch64 Mac is harder to follow than gdb output on x86_64 Linux, but it seems the corresponding parts on Apple Silicon are:
ICU4C:
XUL[0x718d3c] <+224>: add x0, x23, #0x2
XUL[0x718d40] <+228>: mov x23, x0
XUL[0x718d44] <+232>: cmp x0, x22
XUL[0x718d48] <+236>: b.eq 0x7191f8 ; <+1436> at normalizer2impl.cpp:1844:1
XUL[0x718d4c] <+240>: ldrh w10, [x23]
XUL[0x718d50] <+244>: cmp w10, w25
XUL[0x718d54] <+248>: b.lo 0x718d3c ; <+224> at normalizer2impl.cpp:1753:17
ICU4X:
XUL[0x69616cc] <+548>: mov x8, x28
XUL[0x69616d0] <+552>: cmp x28, x22
XUL[0x69616d4] <+556>: cset w9, ne
XUL[0x69616d8] <+560>: add x28, x28, w9, uxtw #1
XUL[0x69616dc] <+564>: b.eq 0x6961fc4 ; <+2844> at lib.rs
XUL[0x69616e0] <+568>: ldrh w23, [x8]
XUL[0x69616e4] <+572>: cmp w12, w23
XUL[0x69616e8] <+576>: b.hi 0x69616cc ; <+548> [inlined] core::slice::iter::Iter$LT$T$GT$::new::h9583c40a83f784cf at iter.rs:104:13
AFAICT, the pass-through bound ends up in a register in both cases. However, the pointer advancement seems more complicated in the ICU4X case, including advancing by register addition instead of adding an immediate, which seems strange to me.
Assignee | ||
Comment 11•2 months ago
•
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #10)
However, the pointer advancement seems more complicated in the ICU4X case, including advancing by register addition instead of adding an immediate, which seems strange to me.
I see the compiler tries to be clever and unconditionally execute the addition but it materializes 1 or 0 in a register and the register is shifted left by one to get 2 or 0, so that the pointer doesn't actually advance when about to exit the loop.
Either this trickery is bad for perf, or there are some other effects like speculating too far into the code below this bit that makes the perf bad.
Assignee | ||
Comment 12•2 months ago
|
||
Assignee | ||
Comment 13•2 months ago
|
||
Assignee | ||
Updated•2 months ago
|
Description
•