Closed
Bug 1320812
Opened 8 years ago
Closed 8 years ago
stylo: Reduce/Remove work unit chunking for parallel traversal on non-ARMv7 platforms
Categories
(Core :: CSS Parsing and Computation, defect, P3)
Core
CSS Parsing and Computation
Tracking
()
RESOLVED
DUPLICATE
of bug 1366347
People
(Reporter: bholley, Unassigned)
References
(Blocks 1 open bug)
Details
We currently use units of 64 elements for parallel traversal. This causes us to lose out on a lot of potential parallelism, because we can't utilize additional cores until we reach very wide sections of the tree.
Patrick says that he added this chunking to reduce work stealing overhead ARMv7, which has very slow atomics. We should measure the impact, and either scale it down or eliminate it on x86.
| Reporter | ||
Comment 1•8 years ago
|
||
Here's some webconsole code I wrote to compute the number of elements at each depth in the DOM:
(function() {
function mergeDepthArrays(arrays) {
var result = [];
for (var array of arrays) {
for (var i = 0; i < array.length; ++i) {
result[i] = (result[i] || 0) + array[i];
}
}
return result;
}
function elementsByDepth(el) {
return Array.concat([el.childElementCount], mergeDepthArrays(Array.map(el.children, child => elementsByDepth(child))));
}
console.log(elementsByDepth(document));
})()
| Reporter | ||
Comment 2•8 years ago
|
||
Here's the discussion from IRC, for posterity:
<bholley> pcwalton: yt?
<@pcwalton> bholley: yup :)
<bholley> pcwalton: hey, how carefully-measured was the CHUNK_SIZE=64 on parallel traversal?
<bholley> pcwalton: that seems high
<@pcwalton> bholley: it was measured on an ARM board
<@pcwalton> ODROID XU3 I believe
<@pcwalton> it was an octa-core big.LITTLE device, ARMv7
<@pcwalton> ARMv7 has awful synchronization costs
<@pcwalton> it’s really kind of… broken for parallelism
<@pcwalton> I would be happy to reconsider this with an ARMv8
<@pcwalton> or even have different chunk sizes for systems with working parallelism and for those with broken parallelism :)
<bholley> pcwalton: yes, that would seems sensible
<@pcwalton> bholley: to be specific, in ARMv7 the only atomics involve full memory flushes
<@pcwalton> across all cores
<@pcwalton> this is, as you might expect, *very* slow :)
<bholley> pcwalton: well, modulo relaxed, no?
<@pcwalton> yeah, modulo relaxed
<@pcwalton> anything other than relaxed in ARMv7 is a full memory flush though
<bholley> pcwalton: so the goal of setting it that way on ARM is just to make work stealing happen very infrequently?
<@pcwalton> right
<@pcwalton> it’s to reduce load on the deques
<bholley> pcwalton: any idea how that interacts with rayon?
<@pcwalton> bholley: it may be totally different with Rayon because Rayon is more efficient than our old stuff was
<@pcwalton> I would say just drop fast ARMv7 support but there are a bunch of Galaxy phones on the market with Exynos SoCs and those are quad core ARMv7 :(
<@pcwalton> but maybe those will be replaced with ARMv8 at some point
<@pcwalton> we’ll have to watch the market
<bholley> pcwalton: ok, I'll file a bug to measure and tune later - do you remember what the optimal chunk size was before considering ARM?
<@pcwalton> bholley: 1 :)
<@pcwalton> the chunking mechanism was introduced *for* ARM :)
<notriddle> How long do we expect it to take before a Servo-based product is released for Android?
<bholley> notriddle: it's not targeted for the initial stylo release
<@pcwalton> well, I think Quantum is the relevant driver here
<bholley> notriddle: but shortly thererafter
<notriddle> Benching ARMv7 seems, intuitively, like aiming for where the ball is instead of where it's going to be.
<@SimonSapin> notriddle: depends if you count firefox with stylo as servo-based
<@pcwalton> notriddle: I would agree, except that I’m worried that people never replace their Android phones
<@pcwalton> still, that’s only a minor worry
<@pcwalton> and at worst we just detect ARMv7 and use a big chunk size there
<@pcwalton> and everywhere else use a small one
<bholley> pcwalton: did you notice a perf impact of the chunking mechanism on non-arm? Or did you not measure?
<bholley> pcwalton: seems to me like the impact could be pretty large
<@pcwalton> bholley: it did not have any measurable perf impact
<@pcwalton> on x86
<@pcwalton> however, this was with the old mechnaism
<@pcwalton> mechanism*
<@pcwalton> not rayon
<@pcwalton> it’s worth rebenchmarking all of this :)
<bholley> pcwalton: even so - the chunking basically prevents each additional core from kicking in until we have another 64 nodes at the same tree depth
<@pcwalton> yeah
<bholley> pcwalton: by my measurements, lots of sites don't get that wide until pretty far down the tree, if at all
<bholley> pcwalton: so rayon or deque, it seems like we'd just lose out on core utilization
<@pcwalton> interesting
<bholley> pcwalton: so I'm surprised it had no impact
<@pcwalton> it was only on a small number of pages at the time
<@pcwalton> since Servo couldn’t render more than a small number :)
<@pcwalton> it’s worth measuring again!
<bholley> pcwalton: https://gist.github.com/bholley/b0d8d869c7c64e5ca1db42d05e96cc50
<@pcwalton> yeah, like I said, feel free to tweak the number
<@pcwalton> I have 0 attachment to it :)
<bholley> pcwalton: thanks
<bholley> pcwalton: also, when you did your measurements on thread locality for the style sharing cache, was that before or after chunking?
| Reporter | ||
Comment 3•8 years ago
|
||
<@pcwalton> bholley: before chunking I think
<bholley> pcwalton: great
Comment 5•8 years ago
|
||
I believe chunking is nice for style sharing, so is this relevant anymore?
Flags: needinfo?(bobbyholley)
| Reporter | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•