Closed
Bug 1385982
Opened 7 years ago
Closed 7 years ago
stylo: parallel traversal and style sharing still don't play nice together
Categories
(Core :: CSS Parsing and Computation, enhancement, P3)
Tracking
()
RESOLVED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox55 | --- | wontfix |
firefox56 | --- | wontfix |
firefox57 | --- | fixed |
People
(Reporter: bzbarsky, Assigned: bzbarsky)
Details
Attachments
(1 file)
Testcase is: <!DOCTYPE html> <div id=first><span id="infirst"></span></div> <div id=second><span id="insecond"></span></div> <div id=third><span id="inthird"></span></div> I'm seeing us use one style sharing cache (by pointer-identity) when styling the three divs and the <span id="inthird">, a second one for <span id="insecond">, and a third on for <span id="infirst">. I would have expected all the work here to happen as a single work unit, on a single thread... I get no difference in behavior if I force may_dispatch_tail to false in traverse_nodes in parallel.rs. Looking at top_down_dom it seems to be a bit confused. On the one hand, it has: // Collect all the children of the elements in our work unit. This will // contain the combined children of up to WORK_UNIT_MAX nodes ... let mut discovered_child_nodes = SmallVec::<[SendNode<E::ConcreteNode>; 128]>::new(); On the other hand it does: for n in nodes { // If the last node we processed produced children, spawn them off // into a work item. ... if !discovered_child_nodes.is_empty() { /* Do travers_nodes on discovered_child_nodes */ discovered_child_nodes.clear(); } So in practice we never have more than one element's kids in discovered_child_nodes. And in this case we end up with three separate work units, in exactly the way described above; the "inthird" span doesn't get spawned via that nested if and lands in the traverse_nodes() call in the end, which may or may not be a tail call; in tail mode it gets handled on the same thread, in non-tail it may not be, whatever. I'm going to test making that inner check a bit less strict and waiting until we have a full work item before kicking off a separate traverse.
Assignee | ||
Updated•7 years ago
|
Summary: stylo: parallel traversal and style sharing still don't play nice together somehow → stylo: parallel traversal and style sharing still don't play nice together
Assignee | ||
Comment 1•7 years ago
|
||
OK, patch coming up. Without that patch, loading our usual test page at https://en.wikipedia.org/wiki/Barack_Obama as of today I see 7463 non-pseudo style contexts once things settle. With the patch, I see 5164. In STYLO_THREADS=1 mode, I see 2236.
Comment hidden (mozreview-request) |
Comment 3•7 years ago
|
||
Hm. So the current behavior is somewhat intentional, in the sense that we previously would get no parallelism until the tree got wide enough, and that really hurt us. I fixed it in bug 1366347, and the difference was pretty pronounced. But it's possible that we over-rotated. In particular, the work unit size was 64 before, and now it's 16. So I'm certainly willing to revisit this, but would like to do some measurements. Ideally we'd measure the parallel traversal on our (locally-cached) testcases: * obama * myspace * amazon product page * youtube homepage * facebook And compare both total traversal times as well as amount of style sharing. We should compare with/without this patch, and also with half/double workunit sizes. Is that something you're willing to do?
Flags: needinfo?(bzbarsky)
Comment hidden (mozreview-request) |
Assignee | ||
Comment 5•7 years ago
|
||
I can probably do that if you link me to the relevant cached testcases. As far as measuring traversal times, do we have existing logging to do this, or do you just do it via the profiler or something else?
Flags: needinfo?(bzbarsky) → needinfo?(bobbyholley)
Assignee | ||
Comment 6•7 years ago
|
||
Also, are you just measuring pageload, or some sort of toplevel restyle for your traversal measurements?
Comment 7•7 years ago
|
||
(In reply to Boris Zbarsky [:bz] from comment #5) > I can probably do that if you link me to the relevant cached testcases. As > far as measuring traversal times, do we have existing logging to do this, or > do you just do it via the profiler or something else? I bundled up all of the above besides facebook (which doesn't have style overhead logged out, and which I don't want to publish logged in). I haven't been measuring facebook much anyway. https://www.dropbox.com/s/a8mtk66jplcs1rj/stylo-sitetests.zip?dl=0 (In reply to Boris Zbarsky [:bz] from comment #6) > Also, are you just measuring pageload, or some sort of toplevel restyle for > your traversal measurements? Just pageload, usually with the gecko profiler profiler at 0.5ms resolution. Not profiling StyleThreads, just measuring time in Servo_TraverseSubtree.
Flags: needinfo?(bobbyholley)
Comment 8•7 years ago
|
||
mozreview-review |
Comment on attachment 8892144 [details] Bug 1385982. Don't start kicking off work units during parallel stylo traversal until they're actually full. https://reviewboard.mozilla.org/r/163122/#review168494 Canceling review until the measurements come in.
Attachment #8892144 -
Flags: review?(bobbyholley)
Comment 9•7 years ago
|
||
Oh, and if it's not too late, please also test http://tests.themasta.com/twitter-compact/twitter-main.html
Assignee | ||
Comment 10•7 years ago
|
||
OK, so data. Without the proposed change: Amazon: 2793 styles, 44.5ms, 45ms, 45.5ms, 46.5ms, 47.5ms Myspace: 1349 styles, 15ms, 8.5ms, 22.5ms, 7.5ms, 8.5ms Obama: 5415 styles, 18ms, 14.5ms, 19.5ms, 15ms, 16.5ms YouTube: 3683 styles, 52ms, 74ms, 64.5ms, 42.5ms, 44.5ms Twitter: 1631 styles, 73.5ms, 54.5ms, 63ms, 65ms, 83.5ms With the proposed change: Amazon: 2735 styles, 43.5ms, 46ms, 38.5ms, 41ms, 42ms Myspace: 1105 styles, 22ms, 11.5ms, 10ms, 13ms, 15ms Obama: 4227 styles, 16ms, 14ms, 14ms, 16.5ms, 17ms YouTube: 3002 styles, 61ms, 42ms, 66.5ms, 55.5ms, 64ms Twitter: 1461 styles, 61ms, 87.5ms, 63ms, 64ms, 72ms where "styles" means "steady-state number of non-pseudo ComputedValues live" (insofar as this makes sense for Youtube and Twitter; both are restyling continuously afaict) and the times are time under Servo_TraverseSubtree at 0.5ms resolution during pageload, with the same caveat for YouTube and Twitter; I'm sure it caught some of the restyling after the pageload finished.... The "styles" numbers are a bit noisy because they can depend on exactly when I stopped and took the measurement, expecially for the pages running script all the time. But the noise is plus minus 10-20 or so, afaict. With the timings, some of them were super-noisy, so I measured each one 5 times. My general conclusion is that there are sharing wins on all the pages. The myspace times do seem to get a bit slower, though it's hard to tell given how noisy it is. All the other timings stay as they are, pretty much. Obama may be a tiny bit faster; hard to say. Thoughts?
Flags: needinfo?(bobbyholley)
Assignee | ||
Comment 11•7 years ago
|
||
OK, per IRC discussion, here are some measurements, all _with_ the proposed change, of the first restyle time for these various pages (except Amazon, which has no obvious first restyle, and Youtube for which I list the first two restyles) at different work unit sizes: 16 -- for these you can see the numbers of styles in comment 10: Amazon: N/A | 2735 styles Myspace: 10.2ms, 8.5ms, 5.9ms, 5.9ms, 5.5ms | 1105 styles. Obama: 14.4ms, 16.14ms, 12.7ms, 12.7ms, 12.6ms | 4227 styles YouTube: 2.7/8.8, 2.1/8.2, 2.3/9.2, 2.4/10.4, 3.0/7.7 | 3002 styles Twitter: 49.7, 55.6, 74, 76, 77 | 1461 styles -- note: thermal throttling might have been kicking in here; I got 105ms on the next load. 12: Amazon: N/A | 2627 styles Myspace: 5.8ms, 5.2ms, 5.1ms, 5.1ms, 5.6ms | 1240 styles Obama: 14ms, 14.2ms, 13.0ms, 11.5ms, 12.8ms | 15593 styles YouTube: 2.2/7.2, 2.5/7.7, 2.3/10.5, 2.6/8.9, 2.2/6.9 | 1991 styles Twitter: 45.3, 42.5, 40.3, 56, 48.4 | 1423 styles The youtube thing is weird because it had different behavior this time around compared to comment 10. For example, the videos did not play and I didn't see the continuous restyling I saw before. So take that with a grain of salt. I don't know what to make of the other small drops in number of extant styles. Note the huge rise on Obama, though.
Assignee | ||
Comment 12•7 years ago
|
||
So I just remeasures the style contexts on obama a few times, and that number was some sort of fluke. I got 4659, 4576, 4639.
Assignee | ||
Comment 13•7 years ago
|
||
Continuing to test work unit sizes: 20: Amazon: N/A | 2652 styles Myspace: 6.7ms, 6.3ms, 8.5ms, 8.4ms, 6.3ms | 1158 styles Obama: 12.0ms, 11.7ms, 12.8ms, 10.9ms, 11.6ms | 4348 styles YouTube: 2.3/10.8, 2.5/7.2, 2.5/8.1, 3.6/11.6, 2.2/9.1 | 1770 styles Twitter: 52.5, 45.2, 44.2, 47.5, 50.1 | 1503 styles Why the numbers for number of styles are not across the board smaller than 16 here is hard to say. Presumably something about how we end up chunking things across threads... Or maybe it's all noise. :(
Assignee | ||
Comment 14•7 years ago
|
||
Comment on attachment 8892144 [details] Bug 1385982. Don't start kicking off work units during parallel stylo traversal until they're actually full. So given all that, I think 16 looks like a reasonable size to keep sticking with, fwiw...
Attachment #8892144 -
Flags: review?(bobbyholley)
Comment 15•7 years ago
|
||
mozreview-review |
Comment on attachment 8892144 [details] Bug 1385982. Don't start kicking off work units during parallel stylo traversal until they're actually full. https://reviewboard.mozilla.org/r/163122/#review168916 ::: servo/components/style/parallel.rs:37 (Diff revision 2) > /// Larger values will increase style sharing cache hits and general DOM locality > /// at the expense of decreased opportunities for parallelism. This value has not > /// been measured and could potentially be tuned. > pub const WORK_UNIT_MAX: usize = 16; Link to the measurements in the bug? ::: servo/components/style/parallel.rs:137 (Diff revision 2) > /// * Never process a child before its parent (since child style depends on > /// parent style). If this were to happen, the styling algorithm would panic. > /// * Prioritize discovering nodes as quickly as possible to maximize > -/// opportunities for parallelism. > +/// opportunities for parallelism. But this needs to be weighed against > +/// styling cousins on a single thread to improve sharing. > /// * Style all the children of a given node (i.e. all sibling nodes) on > /// a single thread (with an upper bound to handle nodes with an > /// abnormally large number of children). This is important because we use > /// a thread-local cache to share styles between siblings. This comment in general could be tweaked a bit more. ::: servo/components/style/parallel.rs:180 (Diff revision 2) > - // sibling directly on this thread without a spawn call. > + // children of the last sibling directly on this thread without a > + // spawn call. It's not really "last sibling" anymore. ::: servo/components/style/parallel.rs:193 (Diff revision 2) > + // full work item to do so. The former optimizes for speed of > + // discovery (we'll start discovering the kids of the things in > + // "nodes" ASAP). The latter gives us better sharing (e.g. we can > + // share between cousins much better, because we don't hand them off > + // as separate work items, which are likely to end up on separate Please describe/link to the measurements? ::: servo/components/style/parallel.rs:202 (Diff revision 2) > + // item is a deep tree which has (WORK_UNIT_MAX - 1) "linear" > + // branches, hence WORK_UNIT_MAX-1 elements at each level. Such a Technically I don't think the -1 is correct. If we have WORK_UNIT, it's even worse, because we lose the tail call, potentially context switch, and still process sequentially. ::: servo/components/style/parallel.rs:237 (Diff revision 2) > // Handle the children of the last element in this work unit. If any exist, > // we can process them (or at least one work unit's worth of them) directly This comment could use updating.
Attachment #8892144 -
Flags: review?(bobbyholley) → review+
Updated•7 years ago
|
Flags: needinfo?(bobbyholley)
Assignee | ||
Comment 16•7 years ago
|
||
Some more perf data bholley asked for: bloom-basic. Without my patch: bloom-basic: 85.59, 48.46, 47.93, 50.19, 66.51 bloom-basic-2: 53.55, 56.32, 53.00, 49.02, 45.44 With my patch: bloom-basic: 59.63, 51.94, 46.63, 47.62, 48.20 bloom-basic-2: 49.08, 48.39, 48.08, 47.12, 48.27
Assignee | ||
Comment 17•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8892144 [details] Bug 1385982. Don't start kicking off work units during parallel stylo traversal until they're actually full. https://reviewboard.mozilla.org/r/163122/#review168916 > Technically I don't think the -1 is correct. If we have WORK_UNIT, it's even worse, because we lose the tail call, potentially context switch, and still process sequentially. Good catch, fixed.
Assignee | ||
Comment 19•7 years ago
|
||
https://hg.mozilla.org/integration/autoland/rev/ee3040e6db09
Comment 20•7 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/ee3040e6db09
Assignee: nobody → bzbarsky
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Updated•7 years ago
|
status-firefox55:
--- → wontfix
status-firefox56:
--- → wontfix
status-firefox-esr52:
--- → unaffected
You need to log in
before you can comment on or make changes to this bug.
Description
•