Integrate ICU4x segmenter into Gecko and use it to back the existing intl::mozilla::LineBreaker API.
Categories
(Core :: Internationalization, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox118 | --- | fixed |
People
(Reporter: dminor, Assigned: m_kato)
References
(Blocks 4 open bugs, Regressed 4 open bugs)
Details
Attachments
(15 files, 3 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
The purpose of this work is to integrate the uax14_rs based segmenter in ICU4x into Gecko and use it replace the current implementation of the intl::mozilla::LineBreaker API. This will allow us to do some initial experimentation and testing inside of Gecko to understand the risks of switching to the ICU4x implementation.
It is possible that the results will be better than our current implementation for Linux. If that is the case, we may want to land this code, perhaps only enabled for Nightly, to provide segmentation for Linux.
It is also possible that it is too risky to land this code now, in which case, we'll at least have the integration ready for when the ICU4x implementation is more complete.
Bug 1713136 has some similar work that was investigating using ICU4x for NumberFormat, PluralRules, etc. Some of those commits may be helpful here:
- Skip the license check, which currently fails due to a bug in
cargo vendor
: https://phabricator.services.mozilla.com/D116133 - Add ICU4x crates: https://phabricator.services.mozilla.com/D116134
We'll also need to expand the FFI to support the segmenter. Here's an example of a pullrequest adding more functions to the FFI: https://github.com/unicode-org/icu4x/pull/803. Supporting segmenter will be slightly more difficult since we'll need to add a completely new interface instead of extending an existing one.
Please let me know if there's anything I can do to help with the FFI or the integration into Gecko.
Reporter | ||
Updated•4 years ago
|
Comment 1•4 years ago
•
|
||
Dan,
I wanted to point out that the LineBreaker is currently a blocker for being able to roll out win32k lockdown: https://bugzilla.mozilla.org/show_bug.cgi?id=1713973 (I blocked on bug 1684927 but the line breaking is certainly critical because that's what we crash on).
I'm emphasizing this because I see you state:
...It is possible that the results will be better than our current implementation for Linux...
But we absolutely need this on Windows too, even if it's not "better" than the current implementation (which calls MS code).
Comment 2•4 years ago
•
|
||
Another example is bug 1719537 where there's talk about comparing performance, but due to win32k lockdown we're looking at the performance vs a fully remoted implementation (which we fear is prohibitively slow, and hence we did not start to implement this) vs ICU4x.
If ICU4x Linebreaking would be so slow it noticeably affects page load times, I can see we have an issue (but we'd see it on our current tests, I'd hope?). But simply being a bit slower than the Microsoft implementation shouldn't even be a factor.
Assignee | ||
Comment 3•4 years ago
•
|
||
I wanted to point out that the LineBreaker is currently a blocker for being able to roll out win32k lockdown: https://bugzilla.mozilla.org/show_bug.cgi?id=1713973 (I blocked on bug 1684927 but the line breaking is certainly critical because that's what we crash on).
Newer line breaker (and word breaker) don't use Win32 API. But newer one has performance issue for East Asia languages. If we need to turn on win32 lockdown soon, we have to discuss this that we replace platform API with LSTM/dictionary segmenter.
If ICU4x Linebreaking would be so slow it noticeably affects page load times,
This is perf data on Linux. https://docs.google.com/spreadsheets/d/1rcvPSbUYxNmM9B0iJTJinWegsKStnZ72VGgmzXx3BsY/. But I guess that this is same issue on Windows although I don't check uniscriber API.
Reporter | ||
Comment 4•4 years ago
|
||
Hi Gian-Carlo,
Thank you for the additional information.
The plan is to switch to ICU4X for segmentation, but ICU4X is pre-1.0 and it may not be ready for production use yet, which is why this bug is scoped in terms of experimentation and testing, and maybe enabling it for Linux if things go well.
ICU4X does not yet support word breaking, Makoto is implementing that currently, with the plan for that to be done in October. We need both line and word breaking to be able to replace the existing windows code, so even if everything goes perfectly, we're looking at November for integration work. And that is assuming we don't hit any serious bugs, schedule slips for other parts of ICU4X that we depend on, etc.
I agree with Makoto, if we need to turn on win32 lockdown soon, we should meet to discuss timelines and options.
Assignee | ||
Comment 5•3 years ago
|
||
Assignee | ||
Comment 6•3 years ago
|
||
Depends on D167670
Assignee | ||
Comment 7•3 years ago
|
||
Since icu_capi crate has a lot of dependencies we don't want, this adds a
subset of the icu_capi crate.
Depends on D167671
Assignee | ||
Comment 8•3 years ago
|
||
Depends on D167672
Assignee | ||
Comment 9•3 years ago
|
||
Depends on D167673
Assignee | ||
Comment 10•3 years ago
|
||
Depends on D167674
Assignee | ||
Comment 11•3 years ago
|
||
Depends on D167675
Assignee | ||
Comment 12•3 years ago
|
||
Depends on D167676
Assignee | ||
Comment 13•3 years ago
|
||
Fixing tests for new segmenter rules.
Depends on D167677
Assignee | ||
Comment 14•3 years ago
|
||
Due to https://github.com/rust-lang/rust/issues/106674, searchfox cannot create
indexing for ICU4X. So we should disable it until rust or ICU4X side is fixed.
Depends on D167678
Assignee | ||
Comment 15•3 years ago
|
||
Depends on D167679
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 16•2 years ago
|
||
ugh, https://github.com/rust-diplomat/diplomat/issues/326. No way to build on Windows.....
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 17•2 years ago
|
||
When using baked data provider in icu_capi, it uses icu_testdata crate.
But we add customized baked data as patched icu_testdata.
Depends on D167670
Assignee | ||
Comment 18•2 years ago
|
||
No longer use postdata since we can use backed data provider.
Depends on D178156
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 19•2 years ago
|
||
ML/Plain text Serializer uses line break segmenter to wrap text. New
segmenter that is compatible with UAX#14 has different rules for old segmenter.
Ole segmenter have break opportunity before ASCII space, but UAX#14 doesn't
have it (https://www.unicode.org/reports/tr14/#LB7).
So we have to trim ASCII space at the tail for text wrap.
Depends on D167677
Assignee | ||
Comment 20•2 years ago
|
||
Fixing tests for new segmenter rules.
Depends on D180773
Assignee | ||
Comment 21•2 years ago
|
||
Depends on D180774
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 22•2 years ago
|
||
Depends on D167745
Updated•2 years ago
|
Updated•2 years ago
|
Comment 23•2 years ago
|
||
Comment 24•2 years ago
|
||
Backed out for causing mochitest failures in accessible/tests/browser/mac/browser_text_basics.js
Backout link: https://hg.mozilla.org/integration/autoland/rev/97a63d3e439d20f36a3d5c2dac82e63cfb18e846
Assignee | ||
Comment 25•2 years ago
|
||
New word segmenter that is compatible with UAX#29 returns different results
against the old word segmenter. Adding both results.
Comment 26•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Comment 27•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8592592c5dc0
https://hg.mozilla.org/mozilla-central/rev/f7c8d3fcc5f1
https://hg.mozilla.org/mozilla-central/rev/923c978e0903
https://hg.mozilla.org/mozilla-central/rev/d6d90bcaff76
https://hg.mozilla.org/mozilla-central/rev/32ffa85d211f
https://hg.mozilla.org/mozilla-central/rev/f985278bdae0
https://hg.mozilla.org/mozilla-central/rev/f94f952254b7
https://hg.mozilla.org/mozilla-central/rev/5eda0d228eaf
https://hg.mozilla.org/mozilla-central/rev/40b0bd92366e
https://hg.mozilla.org/mozilla-central/rev/b892200683b4
https://hg.mozilla.org/mozilla-central/rev/03cd90cbd76d
https://hg.mozilla.org/mozilla-central/rev/01b343f383a4
https://hg.mozilla.org/mozilla-central/rev/b65cd77013a5
https://hg.mozilla.org/mozilla-central/rev/87e0b7b87a9e
https://hg.mozilla.org/mozilla-central/rev/d6760db62dc1
Comment 30•2 years ago
|
||
Description
•