Open Bug 1974469 Opened 4 months ago Updated 3 months ago

completeUnitOfWork is 2.1x faster in Chrome than Firefox

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

People

(Reporter: jrmuizel, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Making Firefox as fast as Chrome on this function would reduce its time by 1.3% on TodoMVC-React-Redux

Profiles:
Firefox
Chrome

Sample counts for the inner for loop:
SM: 1296
V8: 629

Some commentary from Iain:

A few observations, in the order that I bumped into them:

  1. We're testing if child is null and then unboxing it as an object, which is slightly redundant. V8 does a tiny bit better here.
  2. We're emitting spectre mitigations on our shape guards.
  3. childExpirationTime/expirationTime are in slots 21/22, which is larger than max fixed slots (16), so they're stored out-of-line. V8 stores them inline.
  4. V8 is doing enough field type tracking to know that child.expirationTime and child.childExpirationTime are Smis without needing to guard. We have fallible unboxes.
  5. V8 is keeping newChildExpirationTime in a register as an unboxed value. We're stored it boxed, and unboxing/reboxing it each time we update it.

Here's a microbenchmark that I think captures what's going on here: https://gist.github.com/iainireland/3a09e5ddc47d89cf23aa241b9ba8a079
I get around 165ms in the SM shell with spectre mitigations enabled, ~100ms with spectre disabled, and ~60-65ms in V8

Attached file annotations.txt

I've attached my annotated disassembly. Pulling together a bit of additional analysis from the #perf-sp3 channel:

  1. Redundant null-check / unbox: this is probably pretty marginal performance-wise. It would be nice to fix if we had an easy way to do so, but I don't think it's a priority.
  2. The spectre mitigations are there because Jeff ran this in the shell. We should consider disabling spectre mitigations in the shell to prevent this kind of confusion in the future.
  3. It looks like with equivalently sized objects, if the number of nodes is increased so that the working set doesn't all fit in L1, we get a 10% speedup when time/childTime are in fixed slots instead of dynamic slots. Last time I talked to Jon about dynamic slots, he said that the slots/elements allocator work he did was a step towards being able to allocate objects with more fixed slots, but that we're still pretty far away.
  4. This is also probably fairly marginal. Our unboxing is already a little clunkier than unboxing a Smi, but making it infallible would only remove a couple of instructions with no dependencies. It's not obvious that the performance improvement from removing those instructions would outweigh the book-keeping necessary to justify the removal.
  5. Basically the problem here is that we have a system of phis that have an int32 input and a couple of LoadSlot inputs, which causes us to box the int32 input. It would be nice if we could observe that the uses of those phis are all unboxing as int32, and inserted speculative unboxes instead of conservative boxes (marking the unboxes with a new bailout kind so that we can disable the optimization if necessary to prevent bailout loops.) I simulated making this change by adding + 0 to each of the updates (eg (child.time > newTime) newTime = child.time + 0), which forces us to unbox the LoadSlot as an Int32, giving the phis the types we want. It gives us the codegen we want, but surprisingly doesn't seem to affect performance at all.
  6. V8 unrolls this loop. Turning it off doesn't meaningfully affect V8's performance, so I don't think it's important here.

The most promising opportunity here looks like #3, but that's not a short-term project.

I ran the full shell version of react-redux that Jeff had been using for the original tests with my tweaks to evaluate #3 and #5, then evaluated the number of samples in completeUnitOfWork. They both seemed like they might be 5-10% improvements, but the numbers were noisy enough that it was hard to tell. Neither of them doubled our speed single-handedly, anyway. To be fair, the inner for-loop seems to be less than half of the total time spent in this function, so the speed-up on the affected code might be a bit bigger?

My main takeaway from this exercise is that sample counts are very noisy.

Blocks: sm-js-perf
Severity: -- → S4
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: