Closed
      
        Bug 1407651
      
      
        Opened 8 years ago
          Closed 1 year ago
      
        
    
  
Crash in various points called from JS XDR
Categories
(Core :: JavaScript Engine, defect, P1)
Tracking
()
        RESOLVED
        INCOMPLETE
        
    
  
People
(Reporter: jesup, Unassigned)
References
(Blocks 1 open bug)
Details
(4 keywords, Whiteboard: [#jsapi:crashes-retriage])
Crash Data
[Tracking Requested - why for this release]:
Top wildptr crashes in 57/58
+++ This bug was initially created as a clone of Bug #1374702 +++
We have several XDR crashes with wildptrs in 57/58:
 	memcpy | js::XDRState<T>::codeBytes
 	memcpy | js::XDRState<T>::codeChars
 	js::XDRState<T>::codeUint32 
 	js::XDRScript<T> (Bug 790047)
 	js::AtomizeChars<T>   (bug 1373934)
(and that's just in the top 50 wildptr crashes for 57/58)
I expect these are all or mostly related to the same general problem
Together, these are one at least the #2 wildptr crash source, and maybe even #1
At least one of these goes back a Long Ways; the others might be the same or new bugs.
Bug 1374702 didn't get rid of these; perhaps it got rid of a subset
| Reporter | ||
| Updated•8 years ago
           | 
Flags: needinfo?(nihsanullah)
Flags: needinfo?(kmaglione+bmo)
| Updated•8 years ago
           | 
Group: core-security → javascript-core-security
| Reporter | ||
| Comment 1•8 years ago
           | ||
The overall frequency of this has been rising a lot, and the absolute number of crashes is worrying for when we go to release.  We're hitting ~750/day
| Comment 2•8 years ago
           | ||
(In reply to Randell Jesup [:jesup] from comment #1)
> The overall frequency of this has been rising a lot, and the absolute number
> of crashes is worrying for when we go to release.  We're hitting ~750/day
The reason it is so high is because this is already used in release (fx 56).
| Reporter | ||
| Comment 3•8 years ago
           | ||
Ted: many, but definitely not all, of these crashes are EXCEPTION_IN_PAGE_ERROR.  the addresses in the READ and WRITE errors seem *similar* to the EXCEPTION_IN_PAGE_ERRORs... what do you make of this?  ALso, why so many EXCEPTION_IN_PAGE_ERRORs with these signatures?  It seems odd
Flags: needinfo?(ted)
| Comment 4•8 years ago
           | ||
(In reply to Randell Jesup [:jesup] from comment #3)
> Ted: many, but definitely not all, of these crashes are
> EXCEPTION_IN_PAGE_ERROR.  the addresses in the READ and WRITE errors seem
> *similar* to the EXCEPTION_IN_PAGE_ERRORs... what do you make of this? 
> ALso, why so many EXCEPTION_IN_PAGE_ERRORs with these signatures?  It seems
> odd
I'm not sure what you mean about the addresses--these seem to be mostly 32-bit crashes, so the crash addresses just all look like "32-bit pointers" to me. Note also that on a 32-bit Windows OS, unless you alter your boot settings processes are limited to the lower 2GB of address space. There's a "Total Virtual Memory" field on crashes that can tell you that. It's usually 2GB for 32-bit Windows, and 4GB for 32-bit Firefox running on 64-bit Windows.
As to why we get a lot of EXCEPTION_IN_PAGE_ERROR, I'm not really sure. My best guess is that that memory was paged out and there was an error paging it back in? I looked at a few individual crash reports and none of them seemed to have low Available Physical Memory or a high System Memory Use Percentage, so it seems odd that they'd be hitting swap.
Flags: needinfo?(ted)
| Reporter | ||
| Comment 5•8 years ago
           | ||
js::XDRScript<T> seems to be normal addresses (and few page errors)
memcpy | js::XDRState<T>::codeBytes has a lot of page errors, and the addresses are mostly (almost all) XXXX0000 addresses, even the READ errors (which makes me think they're really PAGE errors).  Not quite 100% though.
Do some of these touch external DLLs that might be loaded over a network connection?  More to the point, why are the XDR signatures tied to PAGE errors?
| Updated•8 years ago
           | 
          tracking-firefox58:
          --- → +
|   | ||
| Comment 6•8 years ago
           | ||
[CC'ing a couple of folks, and will poke Naveed]
| Comment 7•8 years ago
           | ||
(In reply to Randell Jesup [:jesup] from comment #5)
> js::XDRScript<T> seems to be normal addresses (and few page errors)
> 
> memcpy | js::XDRState<T>::codeBytes has a lot of page errors, and the
> addresses are mostly (almost all) XXXX0000 addresses, even the READ errors
> (which makes me think they're really PAGE errors).  Not quite 100% though.
Usually these issues came from a length corruption, where we attempt to copy read/write too much, and reach a non mapped page.
This is probably due to a corruption of the content of the XDR buffer.
I will ni? tcampbell as he has been adding assertion to catch a similar buffer corruption in another bug.
Flags: needinfo?(nihsanullah) → needinfo?(tcampbell)
| Comment 8•8 years ago
           | ||
Note, this issue might be a duplicate of Bug 1373934.
|   | ||
| Updated•8 years ago
           | 
Priority: -- → P1
| Comment 9•8 years ago
           | ||
(In reply to Randell Jesup [:jesup] from comment #5)
> Do some of these touch external DLLs that might be loaded over a network
> connection?  More to the point, why are the XDR signatures tied to PAGE
> errors?
I don't have a great theory here beyond what I said in comment 4. If we access memory that's been paged out, it's possible to hit an I/O error when reading it back in from the pagefile, I suppose.
| Reporter | ||
| Comment 10•8 years ago
           | ||
https://blogs.msdn.microsoft.com/oldnewthing/20120608-00/?p=7423
EXCEPTION_IN_PAGE_ERROR 	
The first element of the array contains a read-write flag that indicates the type of operation that caused the access violation. If this value is zero, the thread attempted to read the inaccessible data. If this value is 1, the thread attempted to write to an inaccessible address. If this value is 8, the thread causes a user-mode data execution prevention (DEP) violation.
The second array element specifies the virtual address of the inaccessible data.
The third array element specifies the underlying NTSTATUS code that resulted in the exception. 
Value 8 (Data Execution Prevention) looks especially interesting/concerning.
Flags: needinfo?(ted)
| Comment 11•8 years ago
           | ||
Yes, I think I linked you to that page a while ago. Breakpad uses the NTSTATUS code, it's the second half of the crash reason, like "EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_DEVICE_ERROR".
Flags: needinfo?(ted)
| Reporter | ||
| Comment 12•8 years ago
           | ||
https://crash-stats.mozilla.com/report/index/1aff0cfb-9c6d-4871-a66a-2ab320171025 is an example of a 57b11 WRITE error with a wildptr or UAF
|   | ||
| Comment 13•8 years ago
           | ||
NBP tells us this is a dupe of bug 1404860 (via Naveed). NI Randell for further investigation/triage as he has been looking at this from triage team.
Flags: needinfo?(rjesup)
| Comment 14•8 years ago
           | ||
After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places while decoding.
For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF issue.  I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general.
I will note, that Bug 1410132 might increase the crash-rate of this signature, by moving other JS crashes to this particular signature, but without increasing the overall number of crashes, only moving them around.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
| Updated•8 years ago
           | 
          status-firefox56:
          affected → ---
          status-firefox57:
          fix-optional → ---
          status-firefox58:
          affected → ---
          status-firefox-esr52:
          affected → ---
          tracking-firefox57:
          + → ---
          tracking-firefox58:
          + → ---
| Reporter | ||
| Comment 15•8 years ago
           | ||
> After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places 
> while decoding.
"our XDR were corrupted" -> what exactly does this mean?  Do you believe something wrote (UAF) over memory you'd allocated?  If so, shipping ASAN browsers to a subset of users might be useful in catching it.
> For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF 
> issue.  I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general.
1404860 is not meant as a holding spot for otherwise unexplained UAFs, it's meant (I think) to generally track the volume and to consider ways to improve the overall attacks on UAFs.  
In this case, the types of errors (mostly PAGE errors; very unusual) and the number of XDR signatures are interesting and very different then the mass of UAFs we see, which is why I'd like to reopen this.  It also means (depending on the answer to my first question above) that there might be some mitigations or diagnostic tripwires we could land without major perf impact.  (validation of the data in various spots, changing the allocation strategy for the data, sanity assertions or deep verification in Nightly, etc).
Flags: needinfo?(rjesup) → needinfo?(nicolas.b.pierron)
| Reporter | ||
| Updated•8 years ago
           | 
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
|   | ||
| Comment 16•8 years ago
           | ||
Hi Naveed, please reassigned these bugs to developers who can investigate and fix them. Thanks!
Assignee: nobody → nihsanullah
|   | ||
| Comment 17•8 years ago
           | ||
Bug 1403348 is also interesting (particularly bug 1403348 comment 19), and may be related to this. We're clearly getting data corruption in the loaded scripts there. The one useful diagnostic error I've gotten so far says getBoolPzef is undefined on a source line that references getBoolPref. My guess is that that error is caused by corrupted XDR data, but there are other possibilities. The user's browser crashed with the same error several sessions in a row, so this is clearly reproducible, to some extent.
Flags: needinfo?(kmaglione+bmo)
| Reporter | ||
| Updated•8 years ago
           | 
Flags: needinfo?(nihsanullah)
|   | ||
| Comment 18•8 years ago
           | ||
Ted/Jan/Nicolas are there speculative fixes or tripwires we can land based on that what we know now to at least limit or clarify these issues? This volume is high enough that we need some due diligence here.
Flags: needinfo?(nihsanullah) → needinfo?(jdemooij)
| Comment 19•8 years ago
           | ||
The action we discussed in Lisbon was to do Bug 1410132. :nbp and I (and :kmag too I believe) have been adding tripwires over last month but the interesting crashes seem to come back to allocated buffers becoming corrupted.
I'll sync up with :nbp again now that I'm back and see if we can come up with a more concrete theory on why XDR is hit by so much of this UAF behaviour.
| Comment 21•7 years ago
           | ||
It probably makes sense to add a checksum to our XDR data - that's not too hard to do and then we can see if this has an effect on the other crashes that might be XDR related.
| Comment 22•7 years ago
           | ||
(In reply to Randell Jesup [:jesup] from comment #15)
> > After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places 
> > while decoding.
> 
> "our XDR were corrupted" -> what exactly does this mean?
That means the data contained in the TranscodeBuffer, no longer look like any data which could be stored in our TranscodeBuffer.
> Do you believe
> something wrote (UAF) over memory you'd allocated?  If so, shipping ASAN
> browsers to a subset of users might be useful in catching it.
Definitely, as it would be a way to get better reports from Bug 1404860 issues.
> In this case, the types of errors (mostly PAGE errors; very unusual) and the
> number of XDR signatures are interesting and very different then the mass of
> UAFs we see, which is why I'd like to reopen this.
The page error are caused by reading length value out of a corrupted buffer, thus causing the copy to read/write the buffer until it goes out of bounds, and fails when it goes reading/writing from/to an unmapped page.
If XDR is the source of these buffer corruption, then I have no way to investigate this issue, even with the tons of crashes that we have today.  We could add tons of canaries, but they would all fail at reporting the original UAF issue.
> It also means (depending
> on the answer to my first question above) that there might be some
> mitigations or diagnostic tripwires we could land without major perf impact.
> (validation of the data in various spots, changing the allocation strategy
> for the data, sanity assertions or deep verification in Nightly, etc).
Any of these would help us fail the execution of XDR, they would not help catch any buffer corruption.
(In reply to Jan de Mooij [:jandem] from comment #21)
> It probably makes sense to add a checksum to our XDR data - that's not too
> hard to do and then we can see if this has an effect on the other crashes
> that might be XDR related.
The issue reported in comment 17 highlights that some corruption might happen when we are saving the bytecode.
So this is definitely where a checksum might be helpful, to catch encoding issues (due to corruption or not).
However, the incremental encoder does not simplify the problem and whatever checksum we choose should at least be position independent or composable.
Also, I will note that we are noticing similar UAF issues in IonMonkey's LifoAlloc buffer, when it is being transferred from one thread to another.  So, having a checksum will give us a way to fail early but this might not prevent crashes to happen due to concurrent corruption in our TranscodeBuffer while we are decoding it.
| Comment 23•7 years ago
           | ||
I will note, that Bug 900784 (Bug 1405738) is now enabled by default and causes 30M/day calls through the XDR decoder based on the telemetry results, and this does not seems to be reflected in any increases in nightly crashes yet.
This issue might be specific to the MultiDecoder task, or the uses of XDR it does, or the sanity of the data which flows into XDR.
| Comment 24•7 years ago
           | ||
Assigning this bug to myself as part of general XDR-related crash investigation. I'm hoping Bug 1418894 will reduce the rate here until we are able to track down root causes.
| Comment 25•7 years ago
           | ||
Bug 1418894 greatly reduced number of crashes with XDR on the stack. A lot of memory corruption is caused by bad XDR data decoding creating dangerous script object leading to crashes. Another set of signatures that remains until we can figure out why XDR is corrupt is that XDRAtom and family will try to decode a string from the buffer but use a garbage length value. This results in memcpy or hash functions walking off the end of buffer and crashing at a 4k-page-aligned address.
I'm highly suspicious of the Gecko StartupCache right now. It reads/writes/seeks from multiple threads and it's coordination using plain old variables with no atomics or barriers. This could explain why the newly enabled JS Bytecode Cache did not cause a rise in XDR crashes: it uses the Necko cache for persistence.
| Comment 26•7 years ago
           | ||
Just for clarification, are checksums still an interesting approach, if the suspicions against the StartupCache are correct?
| Reporter | ||
| Comment 27•7 years ago
           | ||
The StartupCache sounds scary, sec-wise.  Can you file some sec bugs on it?
Flags: needinfo?(tcampbell)
| Comment 30•7 years ago
           | ||
(In reply to Randell Jesup [:jesup] from comment #15)
> > After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places 
> > while decoding.
> 
> "our XDR were corrupted" -> what exactly does this mean?  Do you believe
> something wrote (UAF) over memory you'd allocated?  If so, shipping ASAN
> browsers to a subset of users might be useful in catching it.
Definitely.
> > For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF 
> > issue.  I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general.
> 
> […]
> 
> In this case, the types of errors (mostly PAGE errors; very unusual) and the
> number of XDR signatures are interesting and very different then the mass of
> UAFs we see, which is why I'd like to reopen this.
Page errors can happen if the corrupted value is a length counter. In which case the code which is reading this length counter will follow the content to read/write it, and trip across unmapped pages.
The XDR buffers contains a bunch of inlined strings which are encoded as {length, bytes[length]}.  On the other hand, this might also be explained by a miss-encoded buffer where the decoded length is not at the same offset as the encoded length.
Flags: needinfo?(nicolas.b.pierron)
| Comment 31•7 years ago
           | ||
I opened Bug 1451371 to investigate thread-safety/coding-styles for StartupCache.
Flags: needinfo?(tcampbell)
| Reporter | ||
| Comment 32•7 years ago
           | ||
(In reply to Nicolas B. Pierron [:nbp] {backlog: ~36} from comment #30)
> (In reply to Randell Jesup [:jesup] from comment #15)
> > In this case, the types of errors (mostly PAGE errors; very unusual) and the
> > number of XDR signatures are interesting and very different then the mass of
> > UAFs we see, which is why I'd like to reopen this.
> 
> Page errors can happen if the corrupted value is a length counter. In which
> case the code which is reading this length counter will follow the content
> to read/write it, and trip across unmapped pages.
Note: PAGE errors in a Reason field typically are something like "loaded Firefox off a network drive", or trying to read a file over the network, and the network fails (or so I've been told).
| Comment 33•7 years ago
           | ||
Let's re-triage this crash to look at current crashes and determine if more needs to be done.
Assignee: tcampbell → nobody
Whiteboard: [#jsapi:crashes-retriage]
| Comment 34•7 years ago
           | ||
Stalled. Long-standing series of crashes and we've done a lot to improve these.
Also, downgrading to sec-moderate since these seem to be startup and chrome-code related crashes.
| Updated•4 years ago
           | 
Blocks: sm-defects-crashes
| Updated•3 years ago
           | 
Severity: critical → S2
| Updated•2 years ago
           | 
Severity: S2 → S3
| Comment 36•1 year ago
           | ||
Only 2 of these signatures are still alive on recent versions. Startup crash, hasn't been investigated in 5 years. If these are still a concern, we can re-investigate.
Status: REOPENED → RESOLVED
Crash Signature: [@ js::XDRState<T>::codeUint32]
[@ memcpy | js::XDRState<T>::codeBytes ]
[@ memcpy | js::XDRState<T>::codeChars ]
[@ js::XDRState<T>::codeUint8]
[@ js::XDRScript<T> ]
[@ js::AtomizeChars<T> ] → [@ memcpy | js::XDRState<T>::codeBytes ]
[@ js::AtomizeChars<T> ]
Closed: 8 years ago → 1 year ago
Resolution: --- → INCOMPLETE
| Comment 37•1 year ago
           | ||
Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.
Keywords: stalled
| Updated•1 year ago
           | 
Group: javascript-core-security
          You need to log in
          before you can comment on or make changes to this bug.
        
Description
•