Closed Bug 475343 Opened 16 years ago Closed 16 years ago

Region maintenance can get very expensive

Categories

(Core :: Web Painting, defect, P2)

x86
macOS
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bzbarsky, Assigned: roc)

Details

(Keywords: fixed1.9.1, perf)

Attachments

(5 files, 1 obsolete file)

Quoting from email: ------------------------------------------------------------------------ It seems Firefox is unusually slow compared to Chrome on the following JavaScript Mandelbrot Set: http://warp.povusers.org/MandScripts/javascript2.html (click on Proceed in the bottom) The problem, however, is not in JavaScript component (less than 1% of the cycles is in js3250.dll). Following is the VTune data about retired instructions. A huge 80% of the total cycles is spent in gkgfx, and 85% of gkgfx is in nsRegion.cpp. A lot of rectangle insertions, deletions, and traversals are happening. ------------------------------------------------------------------------ ------------------------------------------------------------------------ It looks like we spend (on my hardware) about 1300ms getting to close(). Opera takes 1500, Safari 400. After that, though, we have a multi-second unresponsive hang, which is presumably what Moh had profiled. ccing roc and vlad, since this stuff is their kettle of fish. It looks like the images load, one by one, and invalidate their rects. There are 10,000 such rects, of course. So we basically end up under [ChildView drawRect:] called from NSView code, and from there spend our time in a 2-1 ratio under nsViewManager::Refresh and nsThebesRegion::Union. The latter is certainly the building up of the region under drawRect. The former is largely the nsRegion::Or call from ConvertNativeRegionToAppRegion,with some smattering of actually painting. The time spent painting is almost entirely under nsDisplaClip::OptimizeVisibility doing region operations. In other words, we spend more time sorting out the right region to paint that it would take to just paint everything. That reminds me strongly of bug 253001. Would it make sense to apply a similar approach here? It looks like the opaque region thing got backed out as part of the display list landing, because all that code went away. But we should probably do it for both damage and opaque regions. ------------------------------------------------------------------------
Flags: wanted1.9.2?
Attached file Testcase
Keywords: perf
Is there any chance of a 1.9.1 fix? /be
Flags: wanted1.9.1?
Yeah.
Assignee: nobody → roc
Flags: wanted1.9.1? → wanted1.9.1+
Priority: -- → P2
For some reason my profiles look nothing like what bz described, although I do get a region with 30,000 rects in it.
We use SimpleSubtract in nsDisplayItem::OptimizeVisibility, so we might burn some cycles there unnecessarily but it shouldn't be too bad.
Attached patch fix?Splinter Review
This patch should take the heat off the spots that Boris was talking about. It doesn't seem to make any difference to me, which isn't surprising since I don't see those as hotspots.
Boris, was the second section of comment #0 your own testing? If so, maybe you could try this patch.
Yes, the second section was my own testing. And yes, this patch makes the beachballing I observed better (though we still seem to have a tad of lag there; not sure where; it's noticeably less snappy than Safari or Opera). Just out of curiosity, what _did_ your profiles look like?
Profiling the part after the DOM is constructed with the patch, the time seems to be spent under nsIFrame::Invalidate. In particular, we call nsChildView::Invalidate, which calls [NSView setNeedsDisplayInRect;] which calls [NSRegion addRect], which calls CGSUnionRegionWithRect, which calls shape_union_with_bounds, which calls shape_union. All the time is spent in shape_union, as far as I can see. setNeedsDisplayInRect and everything below it looks like it's part of AppKit or CoreGraphics, so if we're to do something about this we'd need to do it inside the nsChildView::Invalidate method, right?
My profiles showed a lot of time in frame construction and most of the rest of the time in inputstream callbacks, processing the image loads I presume. These profiles are in debug builds, which is somewhat bogus but should be OK here. I think I lied in comment #5. I got a display list with 30,000 items in it but I didn't prove that the region has thousands of rectangles. Are you on Tiger or Leopard? If you're on Tiger, I wonder if Leopard is doing some kind of throttling of the size of the update region. Fixing nsChildView::Invalidate to avoid a problem in Appkit would be tricky. We'd have to track our own idea of what the dirty rect is and bash Appkit's rect with a bounding-rect if it's getting too complicated. Ugh.
What were you profiling? Your profile sounds like the profile of the script execution, whereas I was profiling the beachball that we get when the image loads come in... I'm on Leopard.
What I see after I press the button is a delay, then the document is replaced by the Mandelbrot image overlaid with a grid which I presume is unloaded image alt-rendering, then there's another delay and the grid disappears, I presume because the images have loaded. I profiled both delay periods separately. Maybe I forgot to check deep enough inside the image load stuff, I'll have another go.
Yeah, that second delay is what I was profiling. Over here it was 4 seconds or so (compared to 1 sec for the first delay).
Also, note that "View -> Reload" is very slow. On Windows, the reload progress is observable almost line by line.
I just reran the test. For me it's about 4 seconds for the initial delay and (feels like) less for the second delay. I profiled the second delay with manual stop/start. It might be better if we instrumented the test to automatically start and stop Shark. My profile of the second delay spends about 22% of time painting along two different paths, about 54% of the time in PresShell::WillPaint flushing stuff (30% ProcessPendingRestyles and 23.5% ProcessReflowCommands; I presume this is due to the alt-image stuff being replaced by the real image frames), and 23.5% of the time is in nsInputStreamPump::OnInputStreamReady. In the input stream pump, 15% is OnStopRequest and 8% is OnDataAvailable. The OnStopRequest cost seems to all be secure browser and progress listener work. In OnDataAvailable we end up spending about 6% in nsIFrame::Invalidate ... eventually we spend 1.5% in nsChildView::Invalidate. Focusing on painting we spend 37% of the time under nsCSSRendering::PaintBackground, 11% of the time in nsImageFrame::PaintImage, and the rest (i.e. about 50%) is mostly frittered away on display list stuff, which isn't too bad since this testcase is pessimal for display lists. Remember this is a debug build. I don't know if that explains why my profile is quite different from yours. I guess I'll try with an opt build at some point but it takes a while to make on, I'll need to do it over lunch or at night or something. Do you think it's worth going ahead with the patch I attached?
Huh. That's totally different from what I saw... My profiling was in fact with an opt build. The attached patch does help me, so it might be worth doing. That said, it's a patch in Mac-specific code. The original performance issue was observed on Windows, I assume (c.f. comparison to Chrome and use of VTune to profile). Do we need similar fixes for our other widget impls?
It's rare to find DEBUG code faster than opt, but could that be happening here? roc, you love dogfooding debug builds but they are terrible profiler targets. Can you at least reproduce bz's results for opt? /be
We can do something similar for Windows, sure. I'll do a patch after I've had several hours to build in my Windows VM. Brendan: yeah, I'll do that later.
OK, in my Mac opt build, changeset 784122b49e33 (no patches), the delay between displaying the Mandelbrot image covered by the grid of missing-image borders, and then displaying the final image, is less than a second.
Maybe it's a network issue.
In my Windows opt build with the same changeset, I also see less than a second's delay.
Attached patch Windows fix?Splinter Review
This does for Windows what the other patch does for Mac. I'd be interested in hearing if it helps. Of course, since my builds are already pretty quick for some reason, it doesn't make any difference for me.
roc, This is quite an improvement. Previously, on Win32, the distribution of sampled instructions was: Module %INST #INST ------------ ------ -------------- gkgfx.dll 80.52% 31,236,600,000 win32k.sys 6.37% 2,471,600,000 gklayout.dll 4.98% 1,931,200,000 xpcom_core.dll 1.75% 679,400,000 ntoskrnl.exe 1.07% 414,200,000 thebes.dll 0.77% 298,200,000 imglib2.dll 0.76% 293,600,000 ntdll.dll 0.71% 274,800,000 gkparser.dll 0.67% 258,200,000 hal.dll 0.53% 205,400,000 js3250.dll 0.51% 196,600,000 With your patch in Comment #23, the profile is: Module %INST #INST --------------- ----- ------------- gklayout.dll 39.15% 1,909,600,000 xpcom_core.dll 14.63% 713,800,000 ntdll.dll 7.06% 344,600,000 imglib2.dll 6.52% 318,200,000 thebes.dll 6.29% 306,800,000 gkparser.dll 5.29% 258,200,000 js3250.dll 4.39% 214,200,000 The contribution of gkgfx.dll to the exec time has dropped from 81% to 2%. Altogether, now it takes 2 secs to paint. The profile of hot functions in gklayout.dll currently is: Name %INST #INST ---------------------- ----- ----------- SelectorMatches 7.46% 142,400,000 EnumerateAllRules 1.69% 32,200,000 HasStateDependentStyle 1.59% 30,400,000 RuleProcessorData 1.34% 25,600,000 Equals 1.27% 24,200,000 HasFlag 1.25% 23,800,000 GetOffsetTo 1.18% 22,600,000 QueryInterface 1.01% 19,200,000
Comment on attachment 361466 [details] [diff] [review] fix? OK, sounds like we should take this...
Attachment #361466 - Flags: review?(joshmoz)
Comment on attachment 361927 [details] [diff] [review] Windows fix? And this...
Attachment #361927 - Flags: review?(vladimir)
Attached patch fix (obsolete) — Splinter Review
I haven't even compiled this, but Karl, if you could build it and test it, that'd be great :-)
Attachment #361962 - Flags: review?(mozbugz)
Whiteboard: [needs review]
Comment on attachment 361927 [details] [diff] [review] Windows fix? This looks fine, but why don't we put MAX_RECTS_IN_REGION in nsBaseWidget.h under a suitable name to avoid duplicating it in 3 different places? Or do you foresee us wanting it to be different on different platforms?
Attachment #361927 - Flags: review?(vladimir) → review+
I don't really care, but I don't see a reason why it has to be the same on each platform.
Comment on attachment 361962 [details] [diff] [review] fix Without this patch, the maximum number of rectangles that I'm seeing in the region from the expose event on x86_64/Linux is 14, so this patch is not going to make any difference here. After clicking proceed, I see a delay of 3 seconds before the grid becomes visible, then ~1.5 seconds before the grid disappears. Reloading the page also takes ~4.5 seconds. Are these times slower than expected? If so, I can do some profiling. Are faster times expected on x86/32 systems? (Possibly resulting in more rectangles accumulating.)
Attachment #361962 - Flags: review?(mozbugz)
Those times are consistent with what I'm seeing, but not consistent with what Boris and Moh see. If GDK doesn't guarantee to cap the complexity of the update region, I'd actually still like to take the patch; it's simple enough.
Attached patch GTK precautionsSplinter Review
(In reply to comment #31) > If GDK doesn't guarantee to cap the complexity of the update region, I'd > actually still like to take the patch; The GDK code doesn't look like it applies any limit on the complexity. > it's simple enough. This is a little simpler. We don't need to use gdk_region_get_clipbox because the bounding box is already available in the GdkEventExpose. I assume that, if we expect to receive a region with a large number of rects, then we don't want to require a complex clip from cairo? ctx->NewPath(); for (r = rects; r < r_end; ++r) { ctx->Rectangle(gfxRect(r->x, r->y, r->width, r->height)); } ctx->Clip(); This patch clips only to the bounding box when the number of rects in the region is large.
Attachment #361962 - Attachment is obsolete: true
Attachment #362173 - Flags: review?(roc)
Yeah, we don't want a complex clip rect in cairo.
Attachment #362173 - Flags: review?(roc) → review+
Attachment #361466 - Flags: review?(joshmoz) → review+
Whiteboard: [needs review] → [needs landing]
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Whiteboard: [needs landing] → [needs 191 approval]
Comment on attachment 361466 [details] [diff] [review] fix? I'm not sure I should be approving my own patches, but this is simple enough...
Comment on attachment 361927 [details] [diff] [review] Windows fix? I'm not sure I should be approving my own patches, but this is simple enough...
Attachment #361927 - Flags: approval1.9.1+
Whiteboard: [needs 191 approval] → [needs 191 landing]
Is the test case supposed to be "faster" after the patch? It's not for me if that is supposed to be the case, actually looks to be a second or two longer. BUILD: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090216 Minefield/3.2a1pre (.NET CLR 3.5.30729) ID:20090216011536 ~B
I would be very surprised if it made the test any slower. But I'm not surprised if it doesn't make the test faster, for you. It didn't have any effect for me. But for me the test was already pretty fast. See comments #20, #22 and #30.
could this have caused bug 479904?
No (tested via backout).
Component: Layout: View Rendering → Layout: Web Painting
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: