Closed
Bug 1181345
Opened 10 years ago
Closed 10 years ago
When accent is found, subsequent code isn't shown in debugger (Firefox Nightly)
Categories
(DevTools :: Debugger, defect)
Tracking
(firefox40 unaffected, firefox41+ fixed, firefox42+ fixed)
RESOLVED
FIXED
Firefox 42
Tracking | Status | |
---|---|---|
firefox40 | --- | unaffected |
firefox41 | + | fixed |
firefox42 | + | fixed |
People
(Reporter: contato, Assigned: sjakthol)
References
Details
(Keywords: regression)
Attachments
(3 files)
12.82 KB,
image/png
|
Details | |
70 bytes,
text/html
|
Details | |
6.96 KB,
patch
|
past
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:42.0) Gecko/20100101 Firefox/42.0
Build ID: 20150707030206
Steps to reproduce:
Write code with accent and opened it in the debugger from the web developer tools.
Actual results:
The character with accent and subsequent code isn't shown.
Expected results:
Code should be displayed as in Firefox standard release.
I'm able to reprouce the issue when I load the testcase locally (as file://).
Reg range:
good=2015-06-19
bad=2015-06-20
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2694ff2ace6a&tochange=c319f262ce3e
My guess goes to bug 1169343.
Blocks: 1169343
Status: UNCONFIRMED → NEW
status-firefox40:
--- → unaffected
tracking-firefox41:
--- → ?
tracking-firefox42:
--- → ?
Component: Untriaged → Developer Tools: Debugger
Ever confirmed: true
Flags: needinfo?(ejpbruel)
Keywords: regression
Version: 42 Branch → 41 Branch
![]() |
||
Comment 3•10 years ago
|
||
Pushlog:
https://hg.mozilla.org/integration/fx-team/pushloghtml?fromchange=543e22ba9046&tochange=07b5f3ee0315
Regressed by: a6d0ac7916ff Sami Jaktholm — Bug 1170864 - Refactor DevToolsUtils.fetch to use NetUtil more extensively and make it fall back to OS.File if a local file cannot be read. r=past
![]() |
Assignee | |
Comment 4•10 years ago
|
||
This is my fault. This is an encoding issue. If no charset information is available the code assumes it to be UTF-8. However, if the file is not UTF-8 (like the testcase here which is iso-8859), the bytes -> js string conversion is cut short.
I have a patch but since try is closed I can't test it at the moment.
Flags: needinfo?(sjakthol)
Adding a tracking flag for FF41 and FF42 as this is a regression.
status-firefox41:
--- → affected
status-firefox42:
--- → affected
![]() |
Assignee | |
Comment 6•10 years ago
|
||
So this is an encoding issue and it happens to be a fun one.
When we try to read a file from the disk, we can only guess its encoding[1]. If no charset hints are provided the code defaults to UTF-8. However, the attached test case file is in Windows-1252 (which is a superset of ISO-8859-1 [2]). And this is where the problem lies.
Why this worked before?
The earlier version first read the raw bytes from a file and the bytes were converted to a JavaScript string according to a locale specific default encoding[3] which is Windows-1252 for most locales. Only after the initial conversion the code tried to decode the data as UTF-8 but if that failed it just returned the initial, Windows-1252 decoded, string.
So in essence the result of the conversion was the one that worked. If the file was Windows-1252 it returned Windows-1252 decoding of the data. If it was UTF-8 it returned UTF-8 decoding of the data. If it was Windows-1252 but did not contain sequences that are invalid in UTF-8, it likely returned something that looked broken (we cannot really avoid this: if we don't know the encoding, we can't decode it).
Why it doesn't work anymore?
When I fixed bug 1170864 I removed the second UTF-8 conversion step and instead defaulted directly to UTF-8 in the first conversion if no encoding hints were given. The code uses NetUtil.readInputStreamToString() to do the conversion. And that seems to silently truncate the string if invalid UTF-8 byte sequences are found (filed bug 1182755).
The fun part:
When you encode the string "déf" with Windows-1252, you get the byte sequence \x64\xe9\x66. When you try to decode that sequence as UTF-8 you're in trouble. According to UTF-8 rules, if the byte 0xe? is found (the middle one, 0xe9, in the above sequence) the next two bytes must have binary form of 10xxxxxx 10xxxxxx.[4] Since the next byte is 0x66 or 01100110 in binary, the decoding as it's not a valid UTF-8 continuation byte.
To fix this problem I will add the previous conversion behavior back even though it is discouraged and deprecated (bug 1003087).
[1] https://developer.mozilla.org/en-US/docs/Reading_textual_data#Determining_the_character_encoding_of_data
[2] https://en.wikipedia.org/wiki/Windows-1252#Details
[3] https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings#Details_and_browser_internals
[4] http://stackoverflow.com/a/5552623
![]() |
Assignee | |
Comment 7•10 years ago
|
||
Here's a patch that restores the pre-bug 1170864 conversion behavior. More information in the commit message and the previous comment.
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=96f0fe793042
Comment 8•10 years ago
|
||
Comment on attachment 8632448 [details] [diff] [review]
bug-1181345-non-utf8-files.patch
Review of attachment 8632448 [details] [diff] [review]:
-----------------------------------------------------------------
Splendid job, as always, Sami!
Attachment #8632448 -
Flags: review?(past) → review+
![]() |
Assignee | |
Updated•10 years ago
|
Keywords: checkin-needed
Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 42
![]() |
Assignee | |
Comment 11•10 years ago
|
||
Comment on attachment 8632448 [details] [diff] [review]
bug-1181345-non-utf8-files.patch
Approval Request Comment
[Feature/regressing bug #]: Bug 1170864.
[User impact if declined]: Debugger shows truncated sources if the files are not UTF-8 encoded and contain invalid UTF-8 byte sequences such as accents in Latin-1 (ISO 8859-1) encoded files.
[Describe test coverage new/current, TreeHerder]: Unit tests for basic cases are still passing and this patch introduced tests for this bug. Also the tests for tools that rely on the modified utility method (mostly Debugger and Style Editor) are still passing.
[Risks and why]: Minimal as the patch only restores pre-bug 1170864 behavior for charset conversion. In the worst case scenario the debugger won't load sources or shows garbled text due to charset conversion failure.
[String/UUID change made/needed]: N/A.
Attachment #8632448 -
Flags: approval-mozilla-aurora?
Comment 12•10 years ago
|
||
Comment on attachment 8632448 [details] [diff] [review]
bug-1181345-non-utf8-files.patch
The patch has added unit tests to test this scenario which is great. Let's uplift to Aurora.
Attachment #8632448 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 13•10 years ago
|
||
Flags: in-testsuite+
Comment 14•8 years ago
|
||
(In reply to Sami Jaktholm from comment #6)
> Why this worked before?
> The earlier version first read the raw bytes from a file and the bytes were
> converted to a JavaScript string according to a locale specific default
> encoding[3] which is Windows-1252 for most locales. Only after the initial
> conversion the code tried to decode the data as UTF-8 but if that failed it
> just returned the initial, Windows-1252 decoded, string.
It seems to me that there's nothing in the code tying the behavior to the locale. Instead, it seems to me that the code is relying on bytes getting zero-extended to UTF-16 code units, which happens to match ISO-8859-1 to UTF-16 conversion.
Am I missing something?
Flags: needinfo?(sjakthol)
Comment 15•8 years ago
|
||
More importantly:
Why do our dev tools try to support creating new non-UTF-8 content? (Presumably, local files are stuff you are developing yourself and not someone else's code you are debugging to figure out a Web compat problem.)
![]() |
Assignee | |
Comment 16•8 years ago
|
||
Sorry, I can't remember any extra details that are not already written down to this bug.
But it could very well be that my analysis at the time was flawed. I'm was not and I'm still not an expert in the area of string encodings.
Flags: needinfo?(sjakthol)
Comment 17•8 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #14)
> It seems to me that there's nothing in the code tying the behavior to the
> locale. Instead, it seems to me that the code is relying on bytes getting
> zero-extended to UTF-16 code units, which happens to match ISO-8859-1 to
> UTF-16 conversion.
I managed to track down what I changed to regress the test case for this bug, and, indeed, this has nothing to do with the locale. If the input isn't valid UTF-8, zero-extending the bytes to UTF-16 code units is used (which happens to resemble locale-specific legacy conversion for some locales, including en-US that we run our tests as).
Updated•7 years ago
|
Product: Firefox → DevTools
You need to log in
before you can comment on or make changes to this bug.
Description
•