1407651 - Crash in various points called from JS XDR

Reporter

Description

•

8 years ago

[Tracking Requested - why for this release]: Top wildptr crashes in 57/58 +++ This bug was initially created as a clone of Bug #1374702 +++ We have several XDR crashes with wildptrs in 57/58: memcpy | js::XDRState<T>::codeBytes memcpy | js::XDRState<T>::codeChars js::XDRState<T>::codeUint32 js::XDRScript<T> (Bug 790047) js::AtomizeChars<T> (bug 1373934) (and that's just in the top 50 wildptr crashes for 57/58) I expect these are all or mostly related to the same general problem Together, these are one at least the #2 wildptr crash source, and maybe even #1 At least one of these goes back a Long Ways; the others might be the same or new bugs. Bug 1374702 didn't get rid of these; perhaps it got rid of a subset

Randell Jesup [:jesup] (needinfo me)

Reporter

Updated

•

8 years ago

Flags: needinfo?(nihsanullah)

Flags: needinfo?(kmaglione+bmo)

Daniel Veditz [:dveditz] out until Nov 10

Updated

•

8 years ago

Group: core-security → javascript-core-security

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 1

•

8 years ago

The overall frequency of this has been rising a lot, and the absolute number of crashes is worrying for when we go to release. We're hitting ~750/day

Nicolas B. Pierron [:nbp]

Comment 2

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #1) > The overall frequency of this has been rising a lot, and the absolute number > of crashes is worrying for when we go to release. We're hitting ~750/day The reason it is so high is because this is already used in release (fx 56).

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 3

•

8 years ago

Ted: many, but definitely not all, of these crashes are EXCEPTION_IN_PAGE_ERROR. the addresses in the READ and WRITE errors seem *similar* to the EXCEPTION_IN_PAGE_ERRORs... what do you make of this? ALso, why so many EXCEPTION_IN_PAGE_ERRORs with these signatures? It seems odd

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 4

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #3) > Ted: many, but definitely not all, of these crashes are > EXCEPTION_IN_PAGE_ERROR. the addresses in the READ and WRITE errors seem > *similar* to the EXCEPTION_IN_PAGE_ERRORs... what do you make of this? > ALso, why so many EXCEPTION_IN_PAGE_ERRORs with these signatures? It seems > odd I'm not sure what you mean about the addresses--these seem to be mostly 32-bit crashes, so the crash addresses just all look like "32-bit pointers" to me. Note also that on a 32-bit Windows OS, unless you alter your boot settings processes are limited to the lower 2GB of address space. There's a "Total Virtual Memory" field on crashes that can tell you that. It's usually 2GB for 32-bit Windows, and 4GB for 32-bit Firefox running on 64-bit Windows. As to why we get a lot of EXCEPTION_IN_PAGE_ERROR, I'm not really sure. My best guess is that that memory was paged out and there was an error paging it back in? I looked at a few individual crash reports and none of them seemed to have low Available Physical Memory or a high System Memory Use Percentage, so it seems odd that they'd be hitting swap.

Flags: needinfo?(ted)

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 5

•

8 years ago

js::XDRScript<T> seems to be normal addresses (and few page errors) memcpy | js::XDRState<T>::codeBytes has a lot of page errors, and the addresses are mostly (almost all) XXXX0000 addresses, even the READ errors (which makes me think they're really PAGE errors). Not quite 100% though. Do some of these touch external DLLs that might be loaded over a network connection? More to the point, why are the XDR signatures tied to PAGE errors?

Julien Cristau [:jcristau] (back on 2025-11-03)

Updated

•

8 years ago

tracking-firefox57: ? → +

tracking-firefox58: --- → +

David Bolter [:davidb] (NeedInfo me for attention)

Comment 6

•

8 years ago

[CC'ing a couple of folks, and will poke Naveed]

Nicolas B. Pierron [:nbp]

Comment 7

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #5) > js::XDRScript<T> seems to be normal addresses (and few page errors) > > memcpy | js::XDRState<T>::codeBytes has a lot of page errors, and the > addresses are mostly (almost all) XXXX0000 addresses, even the READ errors > (which makes me think they're really PAGE errors). Not quite 100% though. Usually these issues came from a length corruption, where we attempt to copy read/write too much, and reach a non mapped page. This is probably due to a corruption of the content of the XDR buffer. I will ni? tcampbell as he has been adding assertion to catch a similar buffer corruption in another bug.

Flags: needinfo?(nihsanullah) → needinfo?(tcampbell)

Nicolas B. Pierron [:nbp]

Comment 8

•

8 years ago

Note, this issue might be a duplicate of Bug 1373934.

Jason Orendorff [:jorendorff]

Updated

•

8 years ago

Priority: -- → P1

(not currently active) Ted Mielczarek

Comment 9

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #5) > Do some of these touch external DLLs that might be loaded over a network > connection? More to the point, why are the XDR signatures tied to PAGE > errors? I don't have a great theory here beyond what I said in comment 4. If we access memory that's been paged out, it's possible to hit an I/O error when reading it back in from the pagefile, I suppose.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 10

•

8 years ago

https://blogs.msdn.microsoft.com/oldnewthing/20120608-00/?p=7423 EXCEPTION_IN_PAGE_ERROR The first element of the array contains a read-write flag that indicates the type of operation that caused the access violation. If this value is zero, the thread attempted to read the inaccessible data. If this value is 1, the thread attempted to write to an inaccessible address. If this value is 8, the thread causes a user-mode data execution prevention (DEP) violation. The second array element specifies the virtual address of the inaccessible data. The third array element specifies the underlying NTSTATUS code that resulted in the exception. Value 8 (Data Execution Prevention) looks especially interesting/concerning.

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 11

•

8 years ago

Yes, I think I linked you to that page a while ago. Breakpad uses the NTSTATUS code, it's the second half of the crash reason, like "EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_DEVICE_ERROR".

Flags: needinfo?(ted)

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 12

•

8 years ago

https://crash-stats.mozilla.com/report/index/1aff0cfb-9c6d-4871-a66a-2ab320171025 is an example of a 57b11 WRITE error with a wildptr or UAF

David Bolter [:davidb] (NeedInfo me for attention)

Comment 13

•

8 years ago

NBP tells us this is a dupe of bug 1404860 (via Naveed). NI Randell for further investigation/triage as he has been looking at this from triage team.

Flags: needinfo?(rjesup)

Nicolas B. Pierron [:nbp]

Comment 14

•

8 years ago

After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places while decoding. For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF issue. I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general. I will note, that Bug 1410132 might increase the crash-rate of this signature, by moving other JS crashes to this particular signature, but without increasing the overall number of crashes, only moving them around.

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox57: affected → fix-optional

Resolution: --- → DUPLICATE

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox56: affected → ---

status-firefox57: fix-optional → ---

status-firefox58: affected → ---

status-firefox-esr52: affected → ---

tracking-firefox57: + → ---

tracking-firefox58: + → ---

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 15

•

8 years ago

> After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places > while decoding. "our XDR were corrupted" -> what exactly does this mean? Do you believe something wrote (UAF) over memory you'd allocated? If so, shipping ASAN browsers to a subset of users might be useful in catching it. > For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF > issue. I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general. 1404860 is not meant as a holding spot for otherwise unexplained UAFs, it's meant (I think) to generally track the volume and to consider ways to improve the overall attacks on UAFs. In this case, the types of errors (mostly PAGE errors; very unusual) and the number of XDR signatures are interesting and very different then the mass of UAFs we see, which is why I'd like to reopen this. It also means (depending on the answer to my first question above) that there might be some mitigations or diagnostic tripwires we could land without major perf impact. (validation of the data in various spots, changing the allocation strategy for the data, sanity assertions or deep verification in Nightly, etc).

Flags: needinfo?(rjesup) → needinfo?(nicolas.b.pierron)

Randell Jesup [:jesup] (needinfo me)

Reporter

Updated

•

8 years ago

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

Wennie

Comment 16

•

8 years ago

Hi Naveed, please reassigned these bugs to developers who can investigate and fix them. Thanks!

Assignee: nobody → nihsanullah

Kris Maglione [:kmag]

Comment 17

•

8 years ago

Bug 1403348 is also interesting (particularly bug 1403348 comment 19), and may be related to this. We're clearly getting data corruption in the loaded scripts there. The one useful diagnostic error I've gotten so far says getBoolPzef is undefined on a source line that references getBoolPref. My guess is that that error is caused by corrupted XDR data, but there are other possibilities. The user's browser crashed with the same error several sessions in a row, so this is clearly reproducible, to some extent.

Flags: needinfo?(kmaglione+bmo)

Randell Jesup [:jesup] (needinfo me)

Reporter

Updated

•

8 years ago

Flags: needinfo?(nihsanullah)

Naveed Ihsanullah [:naveed]

Comment 18

•

8 years ago

Ted/Jan/Nicolas are there speculative fixes or tripwires we can land based on that what we know now to at least limit or clarify these issues? This volume is high enough that we need some due diligence here.

Flags: needinfo?(nihsanullah) → needinfo?(jdemooij)

Ted Campbell [:tcampbell]

Comment 19

•

8 years ago

The action we discussed in Lisbon was to do Bug 1410132. :nbp and I (and :kmag too I believe) have been adding tripwires over last month but the interesting crashes seem to come back to allocated buffers becoming corrupted. I'll sync up with :nbp again now that I'm back and see if we can come up with a more concrete theory on why XDR is hit by so much of this UAF behaviour.

Jan de Mooij [:jandem]

Comment 20

•

7 years ago

Clearing NI per comment 19.

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 21

•

7 years ago

It probably makes sense to add a checksum to our XDR data - that's not too hard to do and then we can see if this has an effect on the other crashes that might be XDR related.

Nicolas B. Pierron [:nbp]

Comment 22

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #15) > > After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places > > while decoding. > > "our XDR were corrupted" -> what exactly does this mean? That means the data contained in the TranscodeBuffer, no longer look like any data which could be stored in our TranscodeBuffer. > Do you believe > something wrote (UAF) over memory you'd allocated? If so, shipping ASAN > browsers to a subset of users might be useful in catching it. Definitely, as it would be a way to get better reports from Bug 1404860 issues. > In this case, the types of errors (mostly PAGE errors; very unusual) and the > number of XDR signatures are interesting and very different then the mass of > UAFs we see, which is why I'd like to reopen this. The page error are caused by reading length value out of a corrupted buffer, thus causing the copy to read/write the buffer until it goes out of bounds, and fails when it goes reading/writing from/to an unmapped page. If XDR is the source of these buffer corruption, then I have no way to investigate this issue, even with the tons of crashes that we have today. We could add tons of canaries, but they would all fail at reporting the original UAF issue. > It also means (depending > on the answer to my first question above) that there might be some > mitigations or diagnostic tripwires we could land without major perf impact. > (validation of the data in various spots, changing the allocation strategy > for the data, sanity assertions or deep verification in Nightly, etc). Any of these would help us fail the execution of XDR, they would not help catch any buffer corruption. (In reply to Jan de Mooij [:jandem] from comment #21) > It probably makes sense to add a checksum to our XDR data - that's not too > hard to do and then we can see if this has an effect on the other crashes > that might be XDR related. The issue reported in comment 17 highlights that some corruption might happen when we are saving the bytecode. So this is definitely where a checksum might be helpful, to catch encoding issues (due to corruption or not). However, the incremental encoder does not simplify the problem and whatever checksum we choose should at least be position independent or composable. Also, I will note that we are noticing similar UAF issues in IonMonkey's LifoAlloc buffer, when it is being transferred from one thread to another. So, having a checksum will give us a way to fail early but this might not prevent crashes to happen due to concurrent corruption in our TranscodeBuffer while we are decoding it.

Nicolas B. Pierron [:nbp]

Comment 23

•

7 years ago

I will note, that Bug 900784 (Bug 1405738) is now enabled by default and causes 30M/day calls through the XDR decoder based on the telemetry results, and this does not seems to be reflected in any increases in nightly crashes yet. This issue might be specific to the MultiDecoder task, or the uses of XDR it does, or the sanity of the data which flows into XDR.

Ted Campbell [:tcampbell]

Comment 24

•

7 years ago

Assigning this bug to myself as part of general XDR-related crash investigation. I'm hoping Bug 1418894 will reduce the rate here until we are able to track down root causes.

Assignee: nihsanullah → tcampbell

Depends on: 1418894

Flags: needinfo?(tcampbell)

Ted Campbell [:tcampbell]

Comment 25

•

7 years ago

Bug 1418894 greatly reduced number of crashes with XDR on the stack. A lot of memory corruption is caused by bad XDR data decoding creating dangerous script object leading to crashes. Another set of signatures that remains until we can figure out why XDR is corrupt is that XDRAtom and family will try to decode a string from the buffer but use a garbage length value. This results in memcpy or hash functions walking off the end of buffer and crashing at a 4k-page-aligned address. I'm highly suspicious of the Gecko StartupCache right now. It reads/writes/seeks from multiple threads and it's coordination using plain old variables with no atomics or barriers. This could explain why the newly enabled JS Bytecode Cache did not cause a rise in XDR crashes: it uses the Necko cache for persistence.

Frederik Braun [:freddy]

Comment 26

•

7 years ago

Just for clarification, are checksums still an interesting approach, if the suspicions against the StartupCache are correct?

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 27

•

7 years ago

The StartupCache sounds scary, sec-wise. Can you file some sec bugs on it?

Flags: needinfo?(tcampbell)

Nicolas B. Pierron [:nbp]

Comment 30

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #15) > > After talking with Ted Campbell last week, the conclusion is that our XDR were corrupted, causing crashes at various places > > while decoding. > > "our XDR were corrupted" -> what exactly does this mean? Do you believe > something wrote (UAF) over memory you'd allocated? If so, shipping ASAN > browsers to a subset of users might be useful in catching it. Definitely. > > For the moment this issue is not actionable, and based on the various number of signatures, it seems that this is a UAF > > issue. I will mark it as a Duplicate as Bug 1404860 for now, until we can get a better grasp of UAF issues in general. > > […] > > In this case, the types of errors (mostly PAGE errors; very unusual) and the > number of XDR signatures are interesting and very different then the mass of > UAFs we see, which is why I'd like to reopen this. Page errors can happen if the corrupted value is a length counter. In which case the code which is reading this length counter will follow the content to read/write it, and trip across unmapped pages. The XDR buffers contains a bunch of inlined strings which are encoded as {length, bytes[length]}. On the other hand, this might also be explained by a miss-encoded buffer where the decoded length is not at the same offset as the encoded length.

Flags: needinfo?(nicolas.b.pierron)

Ted Campbell [:tcampbell]

Comment 31

•

7 years ago

I opened Bug 1451371 to investigate thread-safety/coding-styles for StartupCache.

Flags: needinfo?(tcampbell)

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 32

•

7 years ago

(In reply to Nicolas B. Pierron [:nbp] {backlog: ~36} from comment #30) > (In reply to Randell Jesup [:jesup] from comment #15) > > In this case, the types of errors (mostly PAGE errors; very unusual) and the > > number of XDR signatures are interesting and very different then the mass of > > UAFs we see, which is why I'd like to reopen this. > > Page errors can happen if the corrupted value is a length counter. In which > case the code which is reading this length counter will follow the content > to read/write it, and trip across unmapped pages. Note: PAGE errors in a Reason field typically are something like "loaded Firefox off a network drive", or trying to read a file over the network, and the network fails (or so I've been told).

Ted Campbell [:tcampbell]

Comment 33

•

7 years ago

Let's re-triage this crash to look at current crashes and determine if more needs to be done.

Assignee: tcampbell → nobody

Whiteboard: [#jsapi:crashes-retriage]

Ted Campbell [:tcampbell]

Updated

•

7 years ago

Depends on: 1435779

Ted Campbell [:tcampbell]

Comment 34

•

7 years ago

Stalled. Long-standing series of crashes and we've done a lot to improve these. Also, downgrading to sec-moderate since these seem to be startup and chrome-code related crashes.

Keywords: sec-high → sec-moderate, stalled

Nicolas B. Pierron [:nbp]

Updated

•

4 years ago

Blocks: sm-defects-crashes

BMO Automation

Updated

•

3 years ago

Severity: critical → S2

Ted Campbell [:tcampbell]

Updated

•

2 years ago

Severity: S2 → S3

Tom Ritter [:tjr] (OOTO Oct 18-26)

Comment 36

•

1 year ago

Only 2 of these signatures are still alive on recent versions. Startup crash, hasn't been investigated in 5 years. If these are still a concern, we can re-investigate.

Status: REOPENED → RESOLVED

Crash Signature: [@ js::XDRState<T>::codeUint32] [@ memcpy | js::XDRState<T>::codeBytes ] [@ memcpy | js::XDRState<T>::codeChars ] [@ js::XDRState<T>::codeUint8] [@ js::XDRScript<T> ] [@ js::AtomizeChars<T> ] → [@ memcpy | js::XDRState<T>::codeBytes ] [@ js::AtomizeChars<T> ]

Closed: 8 years ago → 1 year ago

Resolution: --- → INCOMPLETE

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 37

•

1 year ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Daniel Veditz [:dveditz] out until Nov 10

Updated

•

1 year ago

Group: javascript-core-security