Closed Bug 1407651 Opened 8 years ago Closed 1 year ago

Crash in various points called from JS XDR

Categories

(Core :: JavaScript Engine, defect, P1)

55 Branch
x86
Windows
defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: jesup, Unassigned)

References

(Blocks 1 open bug)

Details

(4 keywords, Whiteboard: [#jsapi:crashes-retriage])

Crash Data

[Tracking Requested - why for this release]: Top wildptr crashes in 57/58 +++ This bug was initially created as a clone of Bug #1374702 +++ We have several XDR crashes with wildptrs in 57/58: memcpy | js::XDRState<T>::codeBytes memcpy | js::XDRState<T>::codeChars js::XDRState<T>::codeUint32 js::XDRScript<T> (Bug 790047) js::AtomizeChars<T> (bug 1373934) (and that's just in the top 50 wildptr crashes for 57/58) I expect these are all or mostly related to the same general problem Together, these are one at least the #2 wildptr crash source, and maybe even #1 At least one of these goes back a Long Ways; the others might be the same or new bugs. Bug 1374702 didn't get rid of these; perhaps it got rid of a subset
Flags: needinfo?(nihsanullah)
Flags: needinfo?(kmaglione+bmo)
Group: core-security → javascript-core-security
The overall frequency of this has been rising a lot, and the absolute number of crashes is worrying for when we go to release. We're hitting ~750/day
(In reply to Randell Jesup [:jesup] from comment #1) > The overall frequency of this has been rising a lot, and the absolute number > of crashes is worrying for when we go to release. We're hitting ~750/day The reason it is so high is because this is already used in release (fx 56).
Ted: many, but definitely not all, of these crashes are EXCEPTION_IN_PAGE_ERROR. the addresses in the READ and WRITE errors seem *similar* to the EXCEPTION_IN_PAGE_ERRORs... what do you make of this? ALso, why so many EXCEPTION_IN_PAGE_ERRORs with these signatures? It seems odd
Flags: needinfo?(ted)
(In reply to Randell Jesup [:jesup] from comment #3) > Ted: many, but definitely not all, of these crashes are > EXCEPTION_IN_PAGE_ERROR. the addresses in the READ and WRITE errors seem > *similar* to the EXCEPTION_IN_PAGE_ERRORs... what do you make of this? > ALso, why so many EXCEPTION_IN_PAGE_ERRORs with these signatures? It seems > odd I'm not sure what you mean about the addresses--these seem to be mostly 32-bit crashes, so the crash addresses just all look like "32-bit pointers" to me. Note also that on a 32-bit Windows OS, unless you alter your boot settings processes are limited to the lower 2GB of address space. There's a "Total Virtual Memory" field on crashes that can tell you that. It's usually 2GB for 32-bit Windows, and 4GB for 32-bit Firefox running on 64-bit Windows. As to why we get a lot of EXCEPTION_IN_PAGE_ERROR, I'm not really sure. My best guess is that that memory was paged out and there was an error paging it back in? I looked at a few individual crash reports and none of them seemed to have low Available Physical Memory or a high System Memory Use Percentage, so it seems odd that they'd be hitting swap.
Flags: needinfo?(ted)
js::XDRScript<T> seems to be normal addresses (and few page errors) memcpy | js::XDRState<T>::codeBytes has a lot of page errors, and the addresses are mostly (almost all) XXXX0000 addresses, even the READ errors (which makes me think they're really PAGE errors). Not quite 100% though. Do some of these touch external DLLs that might be loaded over a network connection? More to the point, why are the XDR signatures tied to PAGE errors?
[CC'ing a couple of folks, and will poke Naveed]
(In reply to Randell Jesup [:jesup] from comment #5) > js::XDRScript<T> seems to be normal addresses (and few page errors) > > memcpy | js::XDRState<T>::codeBytes has a lot of page errors, and the > addresses are mostly (almost all) XXXX0000 addresses, even the READ errors > (which makes me think they're really PAGE errors). Not quite 100% though. Usually these issues came from a length corruption, where we attempt to copy read/write too much, and reach a non mapped page. This is probably due to a corruption of the content of the XDR buffer. I will ni? tcampbell as he has been adding assertion to catch a similar buffer corruption in another bug.
Flags: needinfo?(nihsanullah) → needinfo?(tcampbell)
Note, this issue might be a duplicate of Bug 1373934.
Priority: -- → P1
(In reply to Randell Jesup [:jesup] from comment #5) > Do some of these touch external DLLs that might be loaded over a network > connection? More to the point, why are the XDR signatures tied to PAGE > errors? I don't have a great theory here beyond what I said in comment 4. If we access memory that's been paged out, it's possible to hit an I/O error when reading it back in from the pagefile, I suppose.
https://blogs.msdn.microsoft.com/oldnewthing/20120608-00/?p=7423 EXCEPTION_IN_PAGE_ERROR The first element of the array contains a read-write flag that indicates the type of operation that caused the access violation. If this value is zero, the thread attempted to read the inaccessible data. If this value is 1, the thread attempted to write to an inaccessible address. If this value is 8, the thread causes a user-mode data execution prevention (DEP) violation. The second array element specifies the virtual address of the inaccessible data. The third array element specifies the underlying NTSTATUS code that resulted in the exception. Value 8 (Data Execution Prevention) looks especially interesting/concerning.
Flags: needinfo?(ted)
Yes, I think I linked you to that page a while ago. Breakpad uses the NTSTATUS code, it's the second half of the crash reason, like "EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_DEVICE_ERROR".
Flags: needinfo?(ted)
https://crash-stats.mozilla.com/report/index/1aff0cfb-9c6d-4871-a66a-2ab320171025 is an example of a 57b11 WRITE error with a wildptr or UAF
NBP tells us this is a dupe of bug 1404860 (via Naveed). NI Randell for further investigation/triage as he has been looking at this from triage team.
Flags: needinfo?(rjesup)
After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places while decoding. For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF issue. I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general. I will note, that Bug 1410132 might increase the crash-rate of this signature, by moving other JS crashes to this particular signature, but without increasing the overall number of crashes, only moving them around.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
> After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places > while decoding. "our XDR were corrupted" -> what exactly does this mean? Do you believe something wrote (UAF) over memory you'd allocated? If so, shipping ASAN browsers to a subset of users might be useful in catching it. > For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF > issue. I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general. 1404860 is not meant as a holding spot for otherwise unexplained UAFs, it's meant (I think) to generally track the volume and to consider ways to improve the overall attacks on UAFs. In this case, the types of errors (mostly PAGE errors; very unusual) and the number of XDR signatures are interesting and very different then the mass of UAFs we see, which is why I'd like to reopen this. It also means (depending on the answer to my first question above) that there might be some mitigations or diagnostic tripwires we could land without major perf impact. (validation of the data in various spots, changing the allocation strategy for the data, sanity assertions or deep verification in Nightly, etc).
Flags: needinfo?(rjesup) → needinfo?(nicolas.b.pierron)
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Hi Naveed, please reassigned these bugs to developers who can investigate and fix them. Thanks!
Assignee: nobody → nihsanullah
Bug 1403348 is also interesting (particularly bug 1403348 comment 19), and may be related to this. We're clearly getting data corruption in the loaded scripts there. The one useful diagnostic error I've gotten so far says getBoolPzef is undefined on a source line that references getBoolPref. My guess is that that error is caused by corrupted XDR data, but there are other possibilities. The user's browser crashed with the same error several sessions in a row, so this is clearly reproducible, to some extent.
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(nihsanullah)
Ted/Jan/Nicolas are there speculative fixes or tripwires we can land based on that what we know now to at least limit or clarify these issues? This volume is high enough that we need some due diligence here.
Flags: needinfo?(nihsanullah) → needinfo?(jdemooij)
The action we discussed in Lisbon was to do Bug 1410132. :nbp and I (and :kmag too I believe) have been adding tripwires over last month but the interesting crashes seem to come back to allocated buffers becoming corrupted. I'll sync up with :nbp again now that I'm back and see if we can come up with a more concrete theory on why XDR is hit by so much of this UAF behaviour.
Clearing NI per comment 19.
Flags: needinfo?(jdemooij)
It probably makes sense to add a checksum to our XDR data - that's not too hard to do and then we can see if this has an effect on the other crashes that might be XDR related.
(In reply to Randell Jesup [:jesup] from comment #15) > > After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places > > while decoding. > > "our XDR were corrupted" -> what exactly does this mean? That means the data contained in the TranscodeBuffer, no longer look like any data which could be stored in our TranscodeBuffer. > Do you believe > something wrote (UAF) over memory you'd allocated? If so, shipping ASAN > browsers to a subset of users might be useful in catching it. Definitely, as it would be a way to get better reports from Bug 1404860 issues. > In this case, the types of errors (mostly PAGE errors; very unusual) and the > number of XDR signatures are interesting and very different then the mass of > UAFs we see, which is why I'd like to reopen this. The page error are caused by reading length value out of a corrupted buffer, thus causing the copy to read/write the buffer until it goes out of bounds, and fails when it goes reading/writing from/to an unmapped page. If XDR is the source of these buffer corruption, then I have no way to investigate this issue, even with the tons of crashes that we have today. We could add tons of canaries, but they would all fail at reporting the original UAF issue. > It also means (depending > on the answer to my first question above) that there might be some > mitigations or diagnostic tripwires we could land without major perf impact. > (validation of the data in various spots, changing the allocation strategy > for the data, sanity assertions or deep verification in Nightly, etc). Any of these would help us fail the execution of XDR, they would not help catch any buffer corruption. (In reply to Jan de Mooij [:jandem] from comment #21) > It probably makes sense to add a checksum to our XDR data - that's not too > hard to do and then we can see if this has an effect on the other crashes > that might be XDR related. The issue reported in comment 17 highlights that some corruption might happen when we are saving the bytecode. So this is definitely where a checksum might be helpful, to catch encoding issues (due to corruption or not). However, the incremental encoder does not simplify the problem and whatever checksum we choose should at least be position independent or composable. Also, I will note that we are noticing similar UAF issues in IonMonkey's LifoAlloc buffer, when it is being transferred from one thread to another. So, having a checksum will give us a way to fail early but this might not prevent crashes to happen due to concurrent corruption in our TranscodeBuffer while we are decoding it.
I will note, that Bug 900784 (Bug 1405738) is now enabled by default and causes 30M/day calls through the XDR decoder based on the telemetry results, and this does not seems to be reflected in any increases in nightly crashes yet. This issue might be specific to the MultiDecoder task, or the uses of XDR it does, or the sanity of the data which flows into XDR.
Assigning this bug to myself as part of general XDR-related crash investigation. I'm hoping Bug 1418894 will reduce the rate here until we are able to track down root causes.
Assignee: nihsanullah → tcampbell
Depends on: 1418894
Flags: needinfo?(tcampbell)
Bug 1418894 greatly reduced number of crashes with XDR on the stack. A lot of memory corruption is caused by bad XDR data decoding creating dangerous script object leading to crashes. Another set of signatures that remains until we can figure out why XDR is corrupt is that XDRAtom and family will try to decode a string from the buffer but use a garbage length value. This results in memcpy or hash functions walking off the end of buffer and crashing at a 4k-page-aligned address. I'm highly suspicious of the Gecko StartupCache right now. It reads/writes/seeks from multiple threads and it's coordination using plain old variables with no atomics or barriers. This could explain why the newly enabled JS Bytecode Cache did not cause a rise in XDR crashes: it uses the Necko cache for persistence.
Just for clarification, are checksums still an interesting approach, if the suspicions against the StartupCache are correct?
The StartupCache sounds scary, sec-wise. Can you file some sec bugs on it?
Flags: needinfo?(tcampbell)
(In reply to Randell Jesup [:jesup] from comment #15) > > After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places > > while decoding. > > "our XDR were corrupted" -> what exactly does this mean? Do you believe > something wrote (UAF) over memory you'd allocated? If so, shipping ASAN > browsers to a subset of users might be useful in catching it. Definitely. > > For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF > > issue. I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general. > > […] > > In this case, the types of errors (mostly PAGE errors; very unusual) and the > number of XDR signatures are interesting and very different then the mass of > UAFs we see, which is why I'd like to reopen this. Page errors can happen if the corrupted value is a length counter. In which case the code which is reading this length counter will follow the content to read/write it, and trip across unmapped pages. The XDR buffers contains a bunch of inlined strings which are encoded as {length, bytes[length]}. On the other hand, this might also be explained by a miss-encoded buffer where the decoded length is not at the same offset as the encoded length.
Flags: needinfo?(nicolas.b.pierron)
I opened Bug 1451371 to investigate thread-safety/coding-styles for StartupCache.
Flags: needinfo?(tcampbell)
(In reply to Nicolas B. Pierron [:nbp] {backlog: ~36} from comment #30) > (In reply to Randell Jesup [:jesup] from comment #15) > > In this case, the types of errors (mostly PAGE errors; very unusual) and the > > number of XDR signatures are interesting and very different then the mass of > > UAFs we see, which is why I'd like to reopen this. > > Page errors can happen if the corrupted value is a length counter. In which > case the code which is reading this length counter will follow the content > to read/write it, and trip across unmapped pages. Note: PAGE errors in a Reason field typically are something like "loaded Firefox off a network drive", or trying to read a file over the network, and the network fails (or so I've been told).
Let's re-triage this crash to look at current crashes and determine if more needs to be done.
Assignee: tcampbell → nobody
Whiteboard: [#jsapi:crashes-retriage]
Depends on: 1435779
Stalled. Long-standing series of crashes and we've done a lot to improve these. Also, downgrading to sec-moderate since these seem to be startup and chrome-code related crashes.
Severity: critical → S2
Severity: S2 → S3

Only 2 of these signatures are still alive on recent versions. Startup crash, hasn't been investigated in 5 years. If these are still a concern, we can re-investigate.

Status: REOPENED → RESOLVED
Crash Signature: [@ js::XDRState<T>::codeUint32] [@ memcpy | js::XDRState<T>::codeBytes ] [@ memcpy | js::XDRState<T>::codeChars ] [@ js::XDRState<T>::codeUint8] [@ js::XDRScript<T> ] [@ js::AtomizeChars<T> ] → [@ memcpy | js::XDRState<T>::codeBytes ] [@ js::AtomizeChars<T> ]
Closed: 8 years ago1 year ago
Resolution: --- → INCOMPLETE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
Group: javascript-core-security
You need to log in before you can comment on or make changes to this bug.