Closed Bug 1719535 Opened 4 years ago Closed 2 years ago

Integrate ICU4x segmenter into Gecko and use it to back the existing intl::mozilla::LineBreaker API.

Categories

(Core :: Internationalization, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
118 Branch
Tracking Status
firefox118 --- fixed

People

(Reporter: dminor, Assigned: m_kato)

References

(Blocks 4 open bugs, Regressed 4 open bugs)

Details

Attachments

(15 files, 3 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

The purpose of this work is to integrate the uax14_rs based segmenter in ICU4x into Gecko and use it replace the current implementation of the intl::mozilla::LineBreaker API. This will allow us to do some initial experimentation and testing inside of Gecko to understand the risks of switching to the ICU4x implementation.

It is possible that the results will be better than our current implementation for Linux. If that is the case, we may want to land this code, perhaps only enabled for Nightly, to provide segmentation for Linux.

It is also possible that it is too risky to land this code now, in which case, we'll at least have the integration ready for when the ICU4x implementation is more complete.

Bug 1713136 has some similar work that was investigating using ICU4x for NumberFormat, PluralRules, etc. Some of those commits may be helpful here:

We'll also need to expand the FFI to support the segmenter. Here's an example of a pullrequest adding more functions to the FFI: https://github.com/unicode-org/icu4x/pull/803. Supporting segmenter will be slightly more difficult since we'll need to add a completely new interface instead of extending an existing one.

Please let me know if there's anything I can do to help with the FFI or the integration into Gecko.

Severity: -- → S3
Priority: -- → P3

Dan,

I wanted to point out that the LineBreaker is currently a blocker for being able to roll out win32k lockdown: https://bugzilla.mozilla.org/show_bug.cgi?id=1713973 (I blocked on bug 1684927 but the line breaking is certainly critical because that's what we crash on).

I'm emphasizing this because I see you state:

...It is possible that the results will be better than our current implementation for Linux...

But we absolutely need this on Windows too, even if it's not "better" than the current implementation (which calls MS code).

Flags: needinfo?(dminor)

Another example is bug 1719537 where there's talk about comparing performance, but due to win32k lockdown we're looking at the performance vs a fully remoted implementation (which we fear is prohibitively slow, and hence we did not start to implement this) vs ICU4x.

If ICU4x Linebreaking would be so slow it noticeably affects page load times, I can see we have an issue (but we'd see it on our current tests, I'd hope?). But simply being a bit slower than the Microsoft implementation shouldn't even be a factor.

I wanted to point out that the LineBreaker is currently a blocker for being able to roll out win32k lockdown: https://bugzilla.mozilla.org/show_bug.cgi?id=1713973 (I blocked on bug 1684927 but the line breaking is certainly critical because that's what we crash on).

Newer line breaker (and word breaker) don't use Win32 API. But newer one has performance issue for East Asia languages. If we need to turn on win32 lockdown soon, we have to discuss this that we replace platform API with LSTM/dictionary segmenter.

If ICU4x Linebreaking would be so slow it noticeably affects page load times,

This is perf data on Linux. https://docs.google.com/spreadsheets/d/1rcvPSbUYxNmM9B0iJTJinWegsKStnZ72VGgmzXx3BsY/. But I guess that this is same issue on Windows although I don't check uniscriber API.

Hi Gian-Carlo,

Thank you for the additional information.

The plan is to switch to ICU4X for segmentation, but ICU4X is pre-1.0 and it may not be ready for production use yet, which is why this bug is scoped in terms of experimentation and testing, and maybe enabling it for Linux if things go well.

ICU4X does not yet support word breaking, Makoto is implementing that currently, with the plan for that to be done in October. We need both line and word breaking to be able to replace the existing windows code, so even if everything goes perfectly, we're looking at November for integration work. And that is assuming we don't hit any serious bugs, schedule slips for other parts of ICU4X that we depend on, etc.

I agree with Makoto, if we need to turn on win32 lockdown soon, we should meet to discuss timelines and options.

Flags: needinfo?(dminor)
Depends on: 1746770

Depends on D167670

Since icu_capi crate has a lot of dependencies we don't want, this adds a
subset of the icu_capi crate.

Depends on D167671

Fixing tests for new segmenter rules.

Depends on D167677

Due to https://github.com/rust-lang/rust/issues/106674, searchfox cannot create
indexing for ICU4X. So we should disable it until rust or ICU4X side is fixed.

Depends on D167678

Attachment #9313785 - Attachment description: WIP: Bug 1719535 - Part 6. Add ICU4X based segmenter modules. → WIP: Bug 1719535 - Part 5. Add ICU4X based segmenter modules.
Attachment #9313784 - Attachment description: WIP: Bug 1719535 - Part 5. mach vendor rust for ICU4X crates. → WIP: Bug 1719535 - Part 6. mach vendor rust for ICU4X crates.
Attachment #9313780 - Attachment description: WIP: Bug 1719535 - Part 1. Update ICU4X data generator script. → Bug 1719535 - Part 1. Update ICU4X data generator script.
Attachment #9313781 - Attachment description: WIP: Bug 1719535 - Part 2. Add ICU4X data postcard. → Bug 1719535 - Part 2. Add ICU4X data postcard.
Attachment #9313782 - Attachment description: WIP: Bug 1719535 - Part 3. Add the subset of icu_capi in tree. → Bug 1719535 - Part 3. Add the subset of icu_capi in tree.
Attachment #9313783 - Attachment description: WIP: Bug 1719535 - Part 4. Add ICU4X Data Provider. → Bug 1719535 - Part 4. Add ICU4X Data Provider.
Blocks: icu4x

ugh, https://github.com/rust-diplomat/diplomat/issues/326. No way to build on Windows.....

Attachment #9313781 - Attachment is obsolete: true
Attachment #9313780 - Attachment description: Bug 1719535 - Part 1. Update ICU4X data generator script. → WIP: Bug 1719535 - Part 1. Update ICU4X data generator script. r=#i18n-reviewres,TYLin

When using baked data provider in icu_capi, it uses icu_testdata crate.
But we add customized baked data as patched icu_testdata.

Depends on D167670

No longer use postdata since we can use backed data provider.

Depends on D178156

Attachment #9313782 - Attachment is obsolete: true
Attachment #9313783 - Attachment description: Bug 1719535 - Part 4. Add ICU4X Data Provider. → WIP: Bug 1719535 - Part 4. Add ICU4X Data Provider. r=#i18n-reviewers,TYLin
Attachment #9313784 - Attachment description: WIP: Bug 1719535 - Part 6. mach vendor rust for ICU4X crates. → WIP: Bug 1719535 - Part 6. mach vendor rust for ICU4X crates. r=#i18n-reviewers!,#supply-reviewers!
Attachment #9313913 - Attachment description: WIP: Bug 1719535 - Part 11. Enable icu4x segmenter on Linux/Android Nightly. → WIP: Bug 1719535 - Part 10. Enable icu4x segmenter on Nightly.
Attachment #9313789 - Attachment is obsolete: true
Attachment #9313780 - Attachment description: WIP: Bug 1719535 - Part 1. Update ICU4X data generator script. r=#i18n-reviewres,TYLin → Bug 1719535 - Part 1. Update ICU4X data generator script. r=#i18n-reviewres,TYLin
Attachment #9333858 - Attachment description: WIP: Bug 1719535 - Part 2. Add customized icu_testdata crate in tree. r=#i18n-reviewers,TYLin → Bug 1719535 - Part 2. Add customized icu_testdata crate in tree. r=#i18n-reviewers,TYLin
Attachment #9333859 - Attachment description: WIP: Bug 1719535 - Part 3. Remove config/external/icu4x. r=#i18n-reviewers → Bug 1719535 - Part 3. Remove config/external/icu4x. r=#i18n-reviewers
Attachment #9313785 - Attachment description: WIP: Bug 1719535 - Part 5. Add ICU4X based segmenter modules. → Bug 1719535 - Part 5. Add ICU4X based segmenter modules.
Attachment #9313780 - Attachment description: Bug 1719535 - Part 1. Update ICU4X data generator script. r=#i18n-reviewres,TYLin → Bug 1719535 - Part 1. Update ICU4X data generator script. r=#platform-i18n-reviewers,TYLin
Attachment #9333858 - Attachment description: Bug 1719535 - Part 2. Add customized icu_testdata crate in tree. r=#i18n-reviewers,TYLin → Bug 1719535 - Part 2. Add customized icu_testdata crate in tree. r=#platform-i18n-reviewers,TYLin
Attachment #9333859 - Attachment description: Bug 1719535 - Part 3. Remove config/external/icu4x. r=#i18n-reviewers → Bug 1719535 - Part 3. Remove config/external/icu4x. r=#platform-i18n-reviewers
Attachment #9313783 - Attachment description: WIP: Bug 1719535 - Part 4. Add ICU4X Data Provider. r=#i18n-reviewers,TYLin → Bug 1719535 - Part 4. Add ICU4X Data Provider. r=#platform-i18n-reviewers,TYLin
Attachment #9313785 - Attachment description: Bug 1719535 - Part 5. Add ICU4X based segmenter modules. → Bug 1719535 - Part 5. Add ICU4X based segmenter modules. r=TYLin
Attachment #9313784 - Attachment description: WIP: Bug 1719535 - Part 6. mach vendor rust for ICU4X crates. r=#i18n-reviewers!,#supply-reviewers! → Bug 1719535 - Part 6. mach vendor rust for ICU4X crates. r=#platform-i18n-reviewers!,#supply-chain-reviewers!,#firefox-build-system-reviewers!
Attachment #9313786 - Attachment description: WIP: Bug 1719535 - Part 7. Build ICU4X as default. → Bug 1719535 - Part 7. Build ICU4X as default. r=#firefox-build-system-reviewers!,#platform-i18n-reviewers!
Attachment #9313787 - Attachment description: WIP: Bug 1719535 - Part 8. Update license. → Bug 1719535 - Part 8. Update license. r=#platform-i18n-reviewers!

ML/Plain text Serializer uses line break segmenter to wrap text. New
segmenter that is compatible with UAX#14 has different rules for old segmenter.

Ole segmenter have break opportunity before ASCII space, but UAX#14 doesn't
have it (https://www.unicode.org/reports/tr14/#LB7).

So we have to trim ASCII space at the tail for text wrap.

Depends on D167677

Fixing tests for new segmenter rules.

Depends on D180773

Attachment #9313788 - Attachment description: WIP: Bug 1719535 - Part 9. Update test results with new segmenter. → Bug 1719535 - Part 10.3. Update test results of layout with new segmenter. r=TYLin
Attachment #9313913 - Attachment description: WIP: Bug 1719535 - Part 10. Enable icu4x segmenter on Nightly. → Bug 1719535 - Part 11. Enable icu4x segmenter on Nightly. r=#platform-i18n-reviewers
Blocks: 1838941
Attachment #9338785 - Attachment description: Bug 1719535 - Part 9. Trim ASCII space at the tail for content serializer. r=kmag → Bug 1719535 - Part 9. Trim ASCII space at the tail for content serializer. r=#dom-core
Attachment #9338786 - Attachment description: Bug 1719535 - Part 10.1 - Update test results of a11y with new segmenter. r=Jamie → Bug 1719535 - Part 10.1. Update test results of a11y with new segmenter. r=Jamie
Attachment #9338787 - Attachment description: Bug 1719535 - Part 10.2 - Update test results of editor with new segmenter. r=masayuki → Bug 1719535 - Part 10.2. Update test results of editor with new segmenter. r=masayuki
Pushed by m_kato@ga2.so-net.ne.jp: https://hg.mozilla.org/integration/autoland/rev/caa2b8a731dc Part 1. Update ICU4X data generator script. r=TYLin,platform-i18n-reviewers,gregtatum https://hg.mozilla.org/integration/autoland/rev/45f9ad406428 Part 2. Add customized icu_testdata crate in tree. r=TYLin https://hg.mozilla.org/integration/autoland/rev/e1abdee07587 Part 3. Remove config/external/icu4x. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/f16f9e583378 Part 4. Add ICU4X Data Provider. r=platform-i18n-reviewers,dminor,sylvestre https://hg.mozilla.org/integration/autoland/rev/d615264f9f5f Part 5. Add ICU4X based segmenter modules. r=TYLin,jfkthame https://hg.mozilla.org/integration/autoland/rev/94536f0d5a1f Part 6. mach vendor rust for ICU4X crates. r=platform-i18n-reviewers,supply-chain-reviewers,firefox-build-system-reviewers,glandium,dminor https://hg.mozilla.org/integration/autoland/rev/52ca9bdc5bd4 Part 7. Build ICU4X as default. r=firefox-build-system-reviewers,platform-i18n-reviewers,glandium,dminor https://hg.mozilla.org/integration/autoland/rev/90f1f621031f Part 8. Update license. r=platform-i18n-reviewers,dminor,sylvestre https://hg.mozilla.org/integration/autoland/rev/d3f8850a343b Part 9. Trim ASCII space at the tail for content serializer. r=TYLin https://hg.mozilla.org/integration/autoland/rev/29977dc00fb3 Part 10.1. Update test results of a11y with new segmenter. r=Jamie https://hg.mozilla.org/integration/autoland/rev/09c216cdbbc8 Part 10.2. Update test results of editor with new segmenter. r=masayuki https://hg.mozilla.org/integration/autoland/rev/63b9d2064827 Part 10.3. Update test results of layout with new segmenter. r=TYLin https://hg.mozilla.org/integration/autoland/rev/3fdb2e999aba Part 11. Enable icu4x segmenter on Nightly. r=platform-i18n-reviewers,dminor,jfkthame https://hg.mozilla.org/integration/autoland/rev/338c2c8d91af Part 12. Update icu_provider_macros not to depend on syn 1.0. r=supply-chain-reviewers,firefox-build-system-reviewers,glandium

New word segmenter that is compatible with UAX#29 returns different results
against the old word segmenter. Adding both results.

Pushed by m_kato@ga2.so-net.ne.jp: https://hg.mozilla.org/integration/autoland/rev/8592592c5dc0 Part 1. Update ICU4X data generator script. r=TYLin,platform-i18n-reviewers,gregtatum https://hg.mozilla.org/integration/autoland/rev/f7c8d3fcc5f1 Part 2. Add customized icu_testdata crate in tree. r=TYLin https://hg.mozilla.org/integration/autoland/rev/923c978e0903 Part 3. Remove config/external/icu4x. r=platform-i18n-reviewers,dminor https://hg.mozilla.org/integration/autoland/rev/d6d90bcaff76 Part 4. Add ICU4X Data Provider. r=platform-i18n-reviewers,dminor,sylvestre https://hg.mozilla.org/integration/autoland/rev/32ffa85d211f Part 5. Add ICU4X based segmenter modules. r=TYLin,jfkthame https://hg.mozilla.org/integration/autoland/rev/f985278bdae0 Part 6. mach vendor rust for ICU4X crates. r=platform-i18n-reviewers,supply-chain-reviewers,firefox-build-system-reviewers,glandium,dminor https://hg.mozilla.org/integration/autoland/rev/f94f952254b7 Part 7. Build ICU4X as default. r=firefox-build-system-reviewers,platform-i18n-reviewers,glandium,dminor https://hg.mozilla.org/integration/autoland/rev/5eda0d228eaf Part 8. Update license. r=platform-i18n-reviewers,dminor,sylvestre https://hg.mozilla.org/integration/autoland/rev/40b0bd92366e Part 9. Trim ASCII space at the tail for content serializer. r=TYLin https://hg.mozilla.org/integration/autoland/rev/b892200683b4 Part 10.1. Update test results of a11y with new segmenter. r=Jamie https://hg.mozilla.org/integration/autoland/rev/03cd90cbd76d Part 10.2. Update test results of editor with new segmenter. r=masayuki https://hg.mozilla.org/integration/autoland/rev/01b343f383a4 Part 10.3. Update test results of layout with new segmenter. r=TYLin https://hg.mozilla.org/integration/autoland/rev/b65cd77013a5 Part 11. Enable icu4x segmenter on Nightly. r=platform-i18n-reviewers,dminor,jfkthame https://hg.mozilla.org/integration/autoland/rev/87e0b7b87a9e Part 12. Update icu_provider_macros not to depend on syn 1.0. r=supply-chain-reviewers,firefox-build-system-reviewers,glandium https://hg.mozilla.org/integration/autoland/rev/d6760db62dc1 Part 13. Fix browser_text_basics.js for new word segmenter. r=Jamie
Flags: needinfo?(m_kato)
Duplicate of this bug: 774965
Duplicate of this bug: 1569566
Duplicate of this bug: 56652
Duplicate of this bug: 345823
Duplicate of this bug: 820261
Duplicate of this bug: 933631
Duplicate of this bug: 1781989
Duplicate of this bug: 1267120
Duplicate of this bug: 359179
Duplicate of this bug: 1336266
Duplicate of this bug: 1838941
Duplicate of this bug: 1275486
Duplicate of this bug: 1645314
Blocks: 1568219
Duplicate of this bug: 771753
Duplicate of this bug: 1808001
Regressions: 1848025
Regressions: 1848049
Regressions: 1848091
Regressions: 1848282
Regressions: 1848354
Duplicate of this bug: 1682721
Duplicate of this bug: 1228810
Regressions: 1850620
Regressions: 1851323
Duplicate of this bug: 1732149
Duplicate of this bug: 1825458
No longer blocks: 1838941
Duplicate of this bug: 1611578
Duplicate of this bug: 628825
No longer duplicate of this bug: 628825
Duplicate of this bug: 1325807
Regressions: 1876874
Regressions: 1880362
No longer regressed by: 1879221
Regressions: 1879221
Regressions: 1961118
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: