Closed Bug 1552008 Opened 6 years ago Closed 3 years ago

HTML parser should supply column numbers for <script> elements

Categories

(Core :: DOM: HTML Parser, defect, P2)

defect

Tracking

()

RESOLVED FIXED
113 Branch
Tracking Status
firefox68 --- wontfix
firefox113 --- fixed

People

(Reporter: loganfsmyth, Assigned: hsivonen)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [debugger-mvp])

Attachments

(3 files, 1 obsolete file)

This was spawned from https://bugzilla.mozilla.org/show_bug.cgi?id=1538056 but I'm going to file it separately since there is at least some work that can be done there on its own without this.

In 67, the JS debugger will begin treating breakpoints as line/column positions, where traditionally they have been treated as line-based. One of the outstanding issues is that while that is fine for

<script>
  console.log("foo");
</script>

where a breakpoint might be specifically at column 2 (the c in console.log)
we run into issues with

<script>console.log("foo");</script>

because the position of c, as far as the engine knows is 0, even though visually it is column 8, which causes our breakpoint markers to render in the wrong place.

While hand-written markup usually has that newline and doesn't tend to encounter this, we see it pop up in contexts where the inline code has been minified, which can be quite frustrating for users since debugging minified code is already tough and this will make it even tougher. Things can be especially frustrating if there are multiple <script> elements on the same line, because then it isn't just that the breakpoints are rendered off by 8 characters, but also that there could in fact be multiple sets of breakpoints overlapping.

The nsIScriptElement type already has a SetScriptColumnNumber, which is set by the XML parser, but not the HTML parser, so that would be the objective of this bug.

The XML case also doesn't work, so I'm going to co-opt https://bugzilla.mozilla.org/show_bug.cgi?id=1538056 to address that (we just never used the column from nsIScriptElement at all right now). Then once this lands we should be all set.

Attached patch Untested tokenizer patch (obsolete) — Splinter Review

This patch is untested beyond checking that it compiles. This is the part that is affected by Java to C++ translation. The rest should be buildable on top of this in non-generated C++.

Please check the impact on the innerHTML setter early (with text input that has no tags).

One issue here is that this is a code-unit column count, but SpiderMonkey's column values expect a code-point count, so it's not just an char16_t indexed counter that we need, but a codepoint count.

The alternate approach I've thought about is storing the start offset of the current line start, and in the rare case where getColumnNumber is actually called, it could run through the buffer between that location and the current location, calculating the codepoint offset. That has the downside of the tokenizer needing to be aware of the current buffer and position, which are currently scoped to only the stateLoop, but it would potentially have less of a performance effect since checkChar would have to do less per character.

Thoughts?

Flags: needinfo?(hsivonen)
Whiteboard: [debugger-mvp]

(In reply to Logan Smyth [:loganfsmyth] from comment #2)

One issue here is that this is a code-unit column count, but SpiderMonkey's column values expect a code-point count, so it's not just an char16_t indexed counter that we need, but a codepoint count.

:-( Code point counts instead of code unit counts are generally bad when identifying a position in a text buffer, but I guess these are supposed to "make sense" independently of buffer encoding in messages.

Since line numbers in document.written scripts are hopeless anyway and innerHTML-inserted scripts don't run, you can rely on data from the network being guaranteed to be valid UTF-16, so you can simply make checkChar not increment the column upon encountering a low surrogate.

The alternate approach I've thought about is storing the start offset of the current line start, and in the rare case where getColumnNumber is actually called, it could run through the buffer between that location and the current location, calculating the codepoint offset. That has the downside of the tokenizer needing to be aware of the current buffer and position, which are currently scoped to only the stateLoop, but it would potentially have less of a performance effect since checkChar would have to do less per character.

Since the tokenizer tokenizes a sequence of buffers such that a boundary can be anywhere, dealing with syncing things on buffer boundary is potentially complicated and/or error-prone, so I suggest seeing what the perf impact of the simple solution is first.

Which reminds me that my untested patch doesn't handle restoring the column number when rewinding the input stream after a failed speculation. I'll post a new patch.

Flags: needinfo?(hsivonen)
Priority: -- → P2
No longer blocks: dbg-69
Whiteboard: [debugger-mvp] → [debugger-reserve]
Blocks: dbg-71
No longer blocks: dbg-70
Whiteboard: [debugger-reserve] → [debugger-mvp]
Severity: normal → S3

So I applied patch from Comment 4 and rebased + fixed compilation errors I was seeing, and this seems to give us the information we need.
I added a patch on top of this one to actually consume the tokenizer information and set the column on inline script load context (see https://phabricator.services.mozilla.com/D170580 , for Bug 1815937)

I pushed to TRY with mach try auto: https://treeherder.mozilla.org/jobs?repo=try&revision=cda6d06715d8418ee608012a86427889f9c990f1
Aside from the devtools failures, which we'll take care of, it looks like only one other test fails: parser/htmlparser/tests/mochitest/test_bug672453.html , but I'm not sure which message is missing and why

Henri, do you think you could take ownership of this patch, check that I didn't do anything stupid when rebasing and ask review to a peer?

Flags: needinfo?(hsivonen)

I read the rebased patch, and it looks OK.

smaug, the main concern here is if this affects innerHTML performance. Can you suggest a benchmark that should be used to check whether this is harmless enough for innerHTML perf?

Flags: needinfo?(hsivonen) → needinfo?(smaug)

Answered on Matrix.

Flags: needinfo?(smaug)

nchevobbe, does changing

inline char16_t checkChar(char16_t* buf, int32_t pos) {
  char16_t c = buf[pos];
  if (nextCharOnNewLine) {
    line++;
    col = 1;
    nextCharOnNewLine = false;
  } else if (!(c >= 0xDC00 && c <= 0xDFFF)) {
    // SpiderMonkey wants to count scalar values
    // instead of UTF-16 code units.
    col++;
  }
  return c;
}

to

inline char16_t checkChar(char16_t* buf, int32_t pos) {
  char16_t c = buf[pos];
  if (MOZ_UNLIKELY(nextCharOnNewLine)) {
    line++;
    col = 1;
    nextCharOnNewLine = false;
  } else if (MOZ_LIKELY(!NS_IS_LOW_SURROGATE(c))) {
    // SpiderMonkey wants to count scalar values
    // instead of UTF-16 code units.
    col++;
  }
  return c;
}

improve things?

Flags: needinfo?(nchevobbe)

(In reply to Henri Sivonen (:hsivonen) from comment #10)

nchevobbe, does changing
[…]
improve things?

maybe ? https://treeherder.mozilla.org/perfherder/compare?originalProject=try&newProject=try&newRevision=0e0670eaee59a7cbca0fa2564044f9c786551870&framework=13&originalRevision=2828aaa0c115a72516d7269aa59ca016c66f9f38&page=1

I feel like I don't see the same regression as before, but also, it looks like the results don't have a super high confidence.
Anyhow, it doesn't seem like there would be a very obvious regression, so it sounds okay?

Flags: needinfo?(nchevobbe)

smaug, are you OK with landing this with the refinement from comment 10?

Flags: needinfo?(smaug)

I think I'm ok, but if we see some regression after all, it may need to get backed out.
With perf tests one need to run them quite a few times to get reliable numbers.

Flags: needinfo?(smaug)

nchevobbe, I suggest updating the patch with the change from comment 10 and requesting review from smaug. (It seems inappropriate for me to review, since most of the code in the patch comes from me.)

Once the patch is on m-c, I can then reconstruct the appropriate changeset for the htmlparser repo on GitHub.

If you profile microbenchmark like http://mozilla.pettay.fi/moztests/innerhtml_nonsimple_nostyle_test.html, does the new code show up in the profile? Might be useful to change that test to use a bit longer string for innerHTML.

Here's the test page modified
The only thing I changed is that

// in loop
document.getElementById("speed_dom").innerHTML =
  "<b><span> Testing...</span> (" + i + "/1000)</b>"

into

const suffix = "<i></i>".repeat(100);
// ...
// in loop
document.getElementById("speed_dom").innerHTML =
  "<b><span> Testing...</span> (" + i + "/1000)</b>" + suffix

Here's a base profile from latest central: https://share.firefox.dev/3JfgPE4
And a profile with henri's latest patch: https://share.firefox.dev/3EWsN2D

I don't see too much differences in set Element.innerHTML timings, but I'm not sure I'm able to analyze those graphs accurately

Assignee: nobody → nchevobbe
Attachment #9319147 - Attachment description: WIP: Bug 1552008 part 1 - Track column number in the HTML → Bug 1552008 - Track column number in the HTML. r=smaug.
Status: NEW → ASSIGNED

(In reply to Henri Sivonen (:hsivonen) from comment #15)

nchevobbe, I suggest updating the patch with the change from comment 10 and requesting review from smaug. (It seems inappropriate for me to review, since most of the code in the patch comes from me.)

Once the patch is on m-c, I can then reconstruct the appropriate changeset for the htmlparser repo on GitHub.

Patch queue rebased and pushed on phab https://phabricator.services.mozilla.com/D170579

Attachment #9319147 - Attachment description: Bug 1552008 - Track column number in the HTML. r=smaug. → Bug 1552008 - Track column number in the HTML.
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1b5481ac73dc Track column number in the HTML. r=smaug,nchevobbe

Backed out changeset 1b5481ac73dc (Bug 1552008) for mochitest failures on test_bug672453.html.
Backout link
Push with failures <--> 1
Failure Log

Flags: needinfo?(nchevobbe)

looks like the failure I was talking about in https://bugzilla.mozilla.org/show_bug.cgi?id=1552008#c6

Assignee: nchevobbe → hsivonen
Flags: needinfo?(nchevobbe) → needinfo?(hsivonen)

This does fix the issue

diff --git a/parser/htmlparser/tests/mochitest/test_bug672453.html b/parser/htmlparser/tests/mochitest/test_bug672453.html
--- a/parser/htmlparser/tests/mochitest/test_bug672453.html
+++ b/parser/htmlparser/tests/mochitest/test_bug672453.html
@@ -104,7 +104,7 @@ var expectedErrors = [
     isWarning: true },
   { errorMessage: "The start of the document was reparsed, because there were non-ASCII characters in the part of the document that was unsuccessfully searched for a meta tag before falling back to the XML declaration syntax. A meta tag at the start of the head part should be used instead of the XML declaration syntax.",
     sourceName: "http://mochi.test:8888/tests/parser/htmlparser/tests/mochitest/file_bug672453_xml_speculation_fail.html",
-    lineNumber: 11,
+    lineNumber: 10,
     isWarning: true },
 ];

line 10 points to the closing html tag https://searchfox.org/mozilla-central/rev/cd2121e7d83af1b421c95e8c923db70e692dab5f/parser/htmlparser/tests/mochitest/file_bug672453_xml_speculation_fail.html#10

(In reply to Nicolas Chevobbe [:nchevobbe] from comment #22)

This does fix the issue

Thanks. Relanding queued with this tweak. Sorry about the delay.

Flags: needinfo?(hsivonen)
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/eccdf9dad71c Track column number in the HTML. r=smaug,nchevobbe
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: