Open Bug 1932875 (icu_normalizer) Opened 11 months ago Updated 2 months ago

Use ICU4X for str_normalize

Categories

(Core :: JavaScript: Standard Library, enhancement, P3)

enhancement

Tracking

()

ASSIGNED

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file)

Migrate the str_normalize built-in to ICU4X. (In order to be able to drop the ICU4C normalizer from the build eventually.)

Actually landing this has to wait until we have ICU4X 2.0 in m-c.

Blocks: js-lang
Severity: -- → N/A
Priority: -- → P3
Alias: icu_normalizer

The perf results here are disappointing and not aligned with out-of-Gecko ICU4C vs. ICU4X benchmarking. This needs performance investigation and optimization.

(In reply to Henri Sivonen (:hsivonen) from comment #5)

This needs performance investigation and optimization.

In the case of already-NFC Polish to NFC, both ICU4C and ICU4X should end up iterating over UTF-16 code units by advancing a pointer and comparing every code unit to be less than U+0300, which is read from data. Since this case shows ICU4X as slower and not by a constant indicating setup overhead but by a factor that can get worse with longer input, it's a sign that something is going very, very wrong. Outside of Gecko, the case that takes the less-than-U+0300 fast path has been as fast between ICU4C and ICU4X when both are writing to a buffer. Here, the case is when both loop without writing when skipping the already-normalized prefix, which ends up being the whole input.

This does not appear to be an opt_level=2 or PGO artifact.

It seems that rustc makes iteration by pointer one instruction longer and puts the magic value in stack memory instead of a register.

ICU4C fast path:

   0x00000000041aa340 <+208>:   add    $0x2,%r13
   0x00000000041aa344 <+212>:   mov    %r13,%r14
   0x00000000041aa347 <+215>:   mov    %r14,%r13
   0x00000000041aa34a <+218>:   cmp    %r12,%r14
   0x00000000041aa34d <+221>:   je     0x41aa72f <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+1215>
   0x00000000041aa353 <+227>:   movzwl 0x0(%r13),%eax
   0x00000000041aa358 <+232>:   cmp    %r9w,%ax
   0x00000000041aa35c <+236>:   jb     0x41aa340 <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+208>

ICU4X fast path:

   0x000000000b72b1c0 <+768>:   mov    %r15,%rcx
   0x000000000b72b1c3 <+771>:   xor    %eax,%eax
   0x000000000b72b1c5 <+773>:   cmp    %r13,%r15
   0x000000000b72b1c8 <+776>:   setne  %al
   0x000000000b72b1cb <+779>:   lea    (%r15,%rax,2),%r15
   0x000000000b72b1cf <+783>:   je     0xb72bd8a <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+3786>
   0x000000000b72b1d5 <+789>:   movzwl (%rcx),%r14d
   0x000000000b72b1d9 <+793>:   cmp    %r14d,-0x44(%rbp)
   0x000000000b72b1dd <+797>:   ja     0xb72b1c0 <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+768>

(Switching from opt_level=2 to opt_level=3 does not fix the perf.)

lldb output on aarch64 Mac is harder to follow than gdb output on x86_64 Linux, but it seems the corresponding parts on Apple Silicon are:

ICU4C:

XUL[0x718d3c] <+224>:  add    x0, x23, #0x2
XUL[0x718d40] <+228>:  mov    x23, x0
XUL[0x718d44] <+232>:  cmp    x0, x22
XUL[0x718d48] <+236>:  b.eq   0x7191f8                  ; <+1436> at normalizer2impl.cpp:1844:1
XUL[0x718d4c] <+240>:  ldrh   w10, [x23]
XUL[0x718d50] <+244>:  cmp    w10, w25
XUL[0x718d54] <+248>:  b.lo   0x718d3c                  ; <+224> at normalizer2impl.cpp:1753:17

ICU4X:

XUL[0x69616cc] <+548>:  mov    x8, x28
XUL[0x69616d0] <+552>:  cmp    x28, x22
XUL[0x69616d4] <+556>:  cset   w9, ne
XUL[0x69616d8] <+560>:  add    x28, x28, w9, uxtw #1
XUL[0x69616dc] <+564>:  b.eq   0x6961fc4                 ; <+2844> at lib.rs
XUL[0x69616e0] <+568>:  ldrh   w23, [x8]
XUL[0x69616e4] <+572>:  cmp    w12, w23
XUL[0x69616e8] <+576>:  b.hi   0x69616cc                 ; <+548> [inlined] core::slice::iter::Iter$LT$T$GT$::new::h9583c40a83f784cf at iter.rs:104:13

AFAICT, the pass-through bound ends up in a register in both cases. However, the pointer advancement seems more complicated in the ICU4X case, including advancing by register addition instead of adding an immediate, which seems strange to me.

(In reply to Henri Sivonen (:hsivonen) from comment #10)

However, the pointer advancement seems more complicated in the ICU4X case, including advancing by register addition instead of adding an immediate, which seems strange to me.

I see the compiler tries to be clever and unconditionally execute the addition but it materializes 1 or 0 in a register and the register is shifted left by one to get 2 or 0, so that the pointer doesn't actually advance when about to exit the loop.

Either this trickery is bad for perf, or there are some other effects like speculating too far into the code below this bit that makes the perf bad.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: