Closed Bug 1848025 Opened 2 years ago Closed 2 years ago

1.78% (2 MB) apk-size-fenixDebug (Android) regression on Mon August 7 2023

Categories

(GeckoView :: General, defect)

Unspecified
Android
defect

Tracking

(firefox116 unaffected, firefox117 unaffected, firefox118+ wontfix)

RESOLVED WONTFIX
Tracking Status
firefox116 --- unaffected
firefox117 --- unaffected
firefox118 + wontfix

People

(Reporter: cpeterson, Assigned: m_kato)

References

(Regression)

Details

(Keywords: perf-alert, regression)

Perfherder has detected a build_metrics performance regression from push aa7f3681b1656ab5ba5b76adcc919fe16f412379. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Test Platform Options Absolute values (old vs new)
2% apk-size-fenixDebug fenix-android-all 119,559,411.17 -> 121,683,661.75

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) may be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

This APK size regression was caused by new ICU4x data added in bug 1719535.

Severity: -- → S2
Component: Performance → General
Product: Testing → GeckoView
Regressed by: 1719535

:m_kato, since you are the author of the regressor, bug 1719535, could you take a look?

For more information, please visit BugBot documentation.

Flags: needinfo?(m_kato)

The C-FFI layer may be introducing unneeded data. If I am reading the ICU4x segmenter crate, specifically line.rs, correctly:

  • ICU4XLineSegmenter_create_auto calls down to ffi::ICU4XLineSegmenter_::create_auto
  • that calls down to LineSegmenter::new_auto(), try_new_auto_with_buffer_provider, and try_new_auto_with_buffer_provider
  • new_auto() should only call down to the LSTM stuff, ~70 KB segment_line_v1.data.rs
  • The two macro provider things don't seem to ultimately go down to any data.

I guess turning off the compiled_data feature on create segmenter can help -- if it's not off yet. That still could bring in unneeded word-level segmentation data though.

By bug 1847958, APK size will be reduced on Nigtly before landing ICU4X. Most incremented data is CJ (Chinese and Japanese). ICU4X team is considering/investigating to move dictionary based implementation to other approaches such as ML.

Assignee: nobody → m_kato
Flags: needinfo?(m_kato)
Duplicate of this bug: 1848091

Hi, I'm from the ICU4X team. I know from previous benchmarks that the whole LSTM-based line segmentation component is about 378 KB, code and data together. Here is an example: https://github.com/unicode-org/icu4x/pull/3866

If you need word segmentation on top of line segmentation, the Chinese/Japanese word segmentation dictionary in ICU4X is identical to the one in ICU4C. It is about 2 MB uncompressed and 1.5 MB gzipped. That is big, but it should not be bigger than the ICU4C equivalent. We're happy to mentor anyone who wants to implement an ML-based solution for CJ word segmentation, like we did for Southeast Asian in the LSTM. There are already some early design proposals written.

See Also: → 1848842

Thanks for the reference points Shane. Since no references to word segmentation has been added as far as I can tell, the larger-than-expected (378 K -> 2 MB) growth must be because of something else having been added. Could be dependencies, could be inadvertent introduction of unused code & data (is LTO having a good enough peek through the FFI?).

ICU4X datagen won't generate more custom filter. If locale is th and ja, and data is LSTM and dictionary, th data will be generated for LSTM and dictionary. So I think that I will patch to datagen in update_icu4x.sh script when generating data.

Ah, I have already remove dictionaries for EA.

  • It is necessary change to support Chinese and Japanese word segmenter since this is implemented by dictionary. Before landing this, we cannot handle these languages.
  • ICU4X have a future plan to move CJ dictionary to ML-based data to reduce data size etc.
  • We already move EA (th, lo, my and km) languages from the dictionary based implementation to LSTM's to reduce data size.
  • I have fixed other APK size issue by bug 1847958 for Nightly.
  • compiled_data feature (comment #3) is new feature from ICU4X 1.3. But current stable is 1.2. Even if upgrading to 1.3, it seems to case other problem that is cargo vendor bug. (I will investigate and file this issue)

So this is wontfix.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.