The perf results here are disappointing and not aligned with out-of-Gecko ICU4C vs. ICU4X benchmarking. This needs performance investigation and optimization.

Henri Sivonen (:hsivonen)

Assignee

Comment 6

•

2 months ago

(In reply to Henri Sivonen (:hsivonen) from comment #5)

This needs performance investigation and optimization.

In the case of already-NFC Polish to NFC, both ICU4C and ICU4X should end up iterating over UTF-16 code units by advancing a pointer and comparing every code unit to be less than U+0300, which is read from data. Since this case shows ICU4X as slower and not by a constant indicating setup overhead but by a factor that can get worse with longer input, it's a sign that something is going very, very wrong. Outside of Gecko, the case that takes the less-than-U+0300 fast path has been as fast between ICU4C and ICU4X when both are writing to a buffer. Here, the case is when both loop without writing when skipping the already-normalized prefix, which ends up being the whole input.

Henri Sivonen (:hsivonen)

Assignee

Comment 7

•

2 months ago

This does not appear to be an opt_level=2 or PGO artifact.

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

2 months ago

It seems that rustc makes iteration by pointer one instruction longer and puts the magic value in stack memory instead of a register.

ICU4C fast path:

   0x00000000041aa340 <+208>:   add    $0x2,%r13
   0x00000000041aa344 <+212>:   mov    %r13,%r14
   0x00000000041aa347 <+215>:   mov    %r14,%r13
   0x00000000041aa34a <+218>:   cmp    %r12,%r14
   0x00000000041aa34d <+221>:   je     0x41aa72f <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+1215>
   0x00000000041aa353 <+227>:   movzwl 0x0(%r13),%eax
   0x00000000041aa358 <+232>:   cmp    %r9w,%ax
   0x00000000041aa35c <+236>:   jb     0x41aa340 <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+208>

ICU4X fast path:

   0x000000000b72b1c0 <+768>:   mov    %r15,%rcx
   0x000000000b72b1c3 <+771>:   xor    %eax,%eax
   0x000000000b72b1c5 <+773>:   cmp    %r13,%r15
   0x000000000b72b1c8 <+776>:   setne  %al
   0x000000000b72b1cb <+779>:   lea    (%r15,%rax,2),%r15
   0x000000000b72b1cf <+783>:   je     0xb72bd8a <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+3786>
   0x000000000b72b1d5 <+789>:   movzwl (%rcx),%r14d
   0x000000000b72b1d9 <+793>:   cmp    %r14d,-0x44(%rbp)
   0x000000000b72b1dd <+797>:   ja     0xb72b1c0 <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+768>

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

2 months ago

(Switching from opt_level=2 to opt_level=3 does not fix the perf.)

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

2 months ago

lldb output on aarch64 Mac is harder to follow than gdb output on x86_64 Linux, but it seems the corresponding parts on Apple Silicon are:

ICU4C:

XUL[0x718d3c] <+224>:  add    x0, x23, #0x2
XUL[0x718d40] <+228>:  mov    x23, x0
XUL[0x718d44] <+232>:  cmp    x0, x22
XUL[0x718d48] <+236>:  b.eq   0x7191f8                  ; <+1436> at normalizer2impl.cpp:1844:1
XUL[0x718d4c] <+240>:  ldrh   w10, [x23]
XUL[0x718d50] <+244>:  cmp    w10, w25
XUL[0x718d54] <+248>:  b.lo   0x718d3c                  ; <+224> at normalizer2impl.cpp:1753:17

ICU4X:

XUL[0x69616cc] <+548>:  mov    x8, x28
XUL[0x69616d0] <+552>:  cmp    x28, x22
XUL[0x69616d4] <+556>:  cset   w9, ne
XUL[0x69616d8] <+560>:  add    x28, x28, w9, uxtw #1
XUL[0x69616dc] <+564>:  b.eq   0x6961fc4                 ; <+2844> at lib.rs
XUL[0x69616e0] <+568>:  ldrh   w23, [x8]
XUL[0x69616e4] <+572>:  cmp    w12, w23
XUL[0x69616e8] <+576>:  b.hi   0x69616cc                 ; <+548> [inlined] core::slice::iter::Iter$LT$T$GT$::new::h9583c40a83f784cf at iter.rs:104:13

AFAICT, the pass-through bound ends up in a register in both cases. However, the pointer advancement seems more complicated in the ICU4X case, including advancing by register addition instead of adding an immediate, which seems strange to me.

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

2 months ago

•

Edited

(In reply to Henri Sivonen (:hsivonen) from comment #10)

However, the pointer advancement seems more complicated in the ICU4X case, including advancing by register addition instead of adding an immediate, which seems strange to me.

I see the compiler tries to be clever and unconditionally execute the addition but it materializes 1 or 0 in a register and the register is shifted left by one to get 2 or 0, so that the pointer doesn't actually advance when about to exit the loop.

Either this trickery is bad for perf, or there are some other effects like speculating too far into the code below this bit that makes the perf bad.

Henri Sivonen (:hsivonen)

Assignee

Comment 12

•

2 months ago

https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=146906

Henri Sivonen (:hsivonen)

Assignee

Comment 13

•

2 months ago

https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=146909

Henri Sivonen (:hsivonen)

Assignee

Updated

•

2 months ago

See Also: → https://github.com/unicode-org/icu4x/issues/6788

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Use ICU4X for str_normalize

Categories

(Core :: JavaScript: Standard Library, enhancement, P3)

Tracking

()

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 2 open bugs)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type