core: correctly truncate unicode strings #14911

connorjclark · 2023-03-21T18:00:15Z

In various places, we truncate web strings using .slice, which can result in eliding in the middle of a grapheme (visual character). This is typically fine for JSON serialization (it just won't look great), but in PSI there is a serialization step that breaks under this input.

This PR improves all the places (but maybe I missed some?) that truncate strings that end up in the LHR to instead elide using Intl.Segmenter. It also changes the result a bit, by considering the suffix added to the truncated strings as part of the max character length - and also, using an ellipse in some places that instead used ....

reprise of #11697 #11698

fixes #14897

core/lib/page-functions.js

core/audits/byte-efficiency/unminified-javascript.js

adamraine · 2023-03-21T18:24:17Z

core/computed/unused-css.js

      } else if (firstRuleEnd < PREVIEW_LENGTH) {
        // The entire first rule-set fits within the preview
-        preview = preview.slice(0, firstRuleEnd + 1) + ' ...';
+        preview = preview.slice(0, firstRuleEnd + 1) + ' …';


Why not use Util.truncate(0, firstRuleEnd, ' …') here?

Because firstRuleEnd is a specific location in the string, we should not truncate by characters here.

adamraine · 2023-03-21T18:26:30Z

core/lib/lh-error.js

 * @typedef {{sentinel: '__ErrorSentinel', message: string, code?: string, stack?: string}} SerializedBaseError
 */

 class LighthouseError extends Error {
  /**
   * @param {LighthouseErrorDefinition} errorDefinition
-   * @param {Record<string, string|undefined>=} properties
+   * @param {Record<string, unknown>=} properties


Why is this change necessary?

initially I needed this change when the base tsconfig target was updated to 2022, reverted in de6a394 . the base tsconfig didn't need the change anymore after I moved the truncate function to shared/util.js (it was in page-functions.js). Might as well keep this more correct type, though.

It looks like a tsc bug that spreading unknown down on L125 doesn't result in an error, though. properties are ICU placeholder values and the constructor shouldn't allow anything but strings through if at all possible, so ideally we can hold off on using unknown here unless it's absolutely required.

core/lib/page-functions.js

brendankenny · 2023-03-21T19:39:42Z

My only concern is that Intl.Segmenter is quite slow compared to iterating on string code points. It'll also have the very slow ICU startup cost (like what necessitated caching Intl.NumberFormaters), but hopefully that's being caught by the V8 caching of ICU init with basic options?

So for that cost, do we really need this for the handful of cases where grapheme clusters are significant (thanks Mathias for the term)? Most unicode will be fine, this is mostly down in html attributes and other non-prominent strings, and it'll just be strings like 👨‍👨‍👦‍👦 that might end split in the middle if they happen to end up at the length limit (though still on unicode code points, so it won't have any issue with bad string encoding).

shared/test/util-test.js

paulirish · 2023-03-21T22:17:19Z

My only concern is that Intl.Segmenter is quite slow compared to iterating on string code points. It'll also have the very slow ICU startup cost (like what necessitated caching Intl.NumberFormaters), but hopefully that's being caught by the V8 caching of ICU init with basic options?

i was skeptical too. numberformat has been a pain for us.

I was able to capture a profile of just axe + getnodedetails (in this branch): https://trace.cafe/t/Rd8274lf8y
fwiw: getnodedetails ran on 39 elems in this case (https://www.kissmyparcel.com/retail/)
I saw truncate take 14ms on the first time, but then it was fast on all subsequent invocations. Also, on my second/third traces i couldn't repro this large instantiate cost, but I did on my fourth. shrug. some weird caching presumably.

i made this snippet to just test from within devtools: https://gist.github.com/paulirish/e46ec350be36cd23047c350602c47435

given what i'm seeing here.. so far I feel okay about the perf side. 14ms isn't nothing, but.. it seems like it's a one-time cost?

Co-authored-by: Paul Irish <paulirish@users.noreply.github.com>

brendankenny · 2023-03-22T15:22:49Z

given what i'm seeing here.. so far I feel okay about the perf side. 14ms isn't nothing, but.. it seems like it's a one-time cost?

That could be the ICU startup time, and if that is the case, seems like V8 is caching it for us, which is great. We should be wary of ever doing this for the first time during tracing.

If it looks good on perf, great to be doing the best segmenting here!

brendankenny · 2023-03-22T15:24:48Z

core/audits/byte-efficiency/unminified-javascript.js

@@ -8,6 +8,7 @@ import {ByteEfficiencyAudit} from './byte-efficiency-audit.js';
 import * as i18n from '../../lib/i18n/i18n.js';
 import {computeJSTokenLength as computeTokenLength} from '../../lib/minification-estimator.js';
 import {getRequestForScript, isInline} from '../../lib/script-helpers.js';
+import {Util} from '../../../shared/util.js';


feels weird that this isn't a direct export

core/computed/unused-css.js

brendankenny · 2023-03-22T15:36:35Z

core/computed/unused-css.js

@@ -101,16 +101,16 @@ class UnusedCSS {
          firstRuleStart > firstRuleEnd ||
          firstRuleStart > PREVIEW_LENGTH) {


if PREVIEW_LENGTH is in graphemes, it won't be comparable to preview.length, firstRuleStart, firstRuleEnd, etc. Not sure how much we care, but definitely some wonkiness in here as a result (e.g. firstRuleStart could be greater than PREVIEW_LENGTH in code points but less than PREVIEW_LENGTH in graphemes), which makes the code harder to understand/alter in the future.

brendankenny · 2023-03-22T15:53:23Z

core/lib/lh-error.js

 * @typedef {{sentinel: '__ErrorSentinel', message: string, code?: string, stack?: string}} SerializedBaseError
 */

 class LighthouseError extends Error {
  /**
   * @param {LighthouseErrorDefinition} errorDefinition
-   * @param {Record<string, string|undefined>=} properties
+   * @param {Record<string, unknown>=} properties


It looks like a tsc bug that spreading unknown down on L125 doesn't result in an error, though. properties are ICU placeholder values and the constructor shouldn't allow anything but strings through if at all possible, so ideally we can hold off on using unknown here unless it's absolutely required.

core/lib/page-functions.js

brendankenny · 2023-03-22T15:58:22Z

core/lib/url-utils.js

@@ -138,7 +138,7 @@ class UrlUtils {
  static elideDataURI(url) {
    try {
      const parsed = new URL(url);
-      return parsed.protocol === 'data:' ? url.slice(0, 100) : url;
+      return parsed.protocol === 'data:' ? Util.truncate(url, 100) : url;


FWIW data URLs can only contain ascii characters

Good point, but at least inclusion of the ellipse suffix is an improvement here.

shared/util.js

brendankenny · 2023-03-22T18:18:17Z

shared/util.js

+    const iterator = segmenter.segment(string)[Symbol.iterator]();
+
+    let lastSegment;
+    for (let i = 0; i <= characterLimit - ellipseSuffix.length; i++) {


technically ellipseSuffix.length might not be grapheme length, but I guess we can assume we won't pass in suffixes without a 1:1 code point:grapheme ratio? :)

…into truncate-better

connorjclark added 6 commits March 16, 2023 19:25

core(page-functions): correctly truncate unicode strings

54c9e3f

move to shared util

201d506

edge case

7f8a37d

update other places

0be3408

unminifiedjs

d997981

undo tsconfig target change

de6a394

connorjclark requested a review from a team as a code owner March 21, 2023 18:00

connorjclark requested review from brendankenny and removed request for a team March 21, 2023 18:00

devtools-bot assigned brendankenny Mar 21, 2023

devtools-bot added the waiting4reviewer label Mar 21, 2023

adamraine reviewed Mar 21, 2023

View reviewed changes

fix

2407a0c

vercel bot deployed to Preview March 21, 2023 19:02 View deployment

fix

da726e9

vercel bot deployed to Preview March 21, 2023 21:24 View deployment

paulirish mentioned this pull request Mar 21, 2023

tab OOM crash during lighthouse run #14914

Open

paulirish reviewed Mar 21, 2023

View reviewed changes

shared/test/util-test.js Show resolved Hide resolved

Update shared/test/util-test.js

4203cc9

Co-authored-by: Paul Irish <paulirish@users.noreply.github.com>

vercel bot deployed to Preview March 21, 2023 23:14 View deployment

adamraine approved these changes Mar 22, 2023

View reviewed changes

brendankenny reviewed Mar 22, 2023

View reviewed changes

connorjclark added 3 commits March 22, 2023 11:28

warmUpIntlSegmenter

b709724

pr

14fcf2a

Merge branch 'truncate-better' of github.com:GoogleChrome/lighthouse …

5893c35

…into truncate-better

vercel bot deployed to Preview March 22, 2023 18:43 View deployment

connorjclark merged commit 90797a1 into main Mar 22, 2023

connorjclark deleted the truncate-better branch March 22, 2023 19:45

connorjclark mentioned this pull request Apr 2, 2024

misc(proto): ensure all strings are well-formed #15909

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: correctly truncate unicode strings #14911

core: correctly truncate unicode strings #14911

connorjclark commented Mar 21, 2023 •

edited by paulirish

Loading

adamraine Mar 21, 2023

connorjclark Mar 21, 2023

adamraine Mar 21, 2023

connorjclark Mar 21, 2023

brendankenny Mar 22, 2023

brendankenny commented Mar 21, 2023

paulirish commented Mar 21, 2023

brendankenny commented Mar 22, 2023

brendankenny Mar 22, 2023

brendankenny Mar 22, 2023

brendankenny Mar 22, 2023

brendankenny Mar 22, 2023 •

edited

Loading

connorjclark Mar 22, 2023

brendankenny Mar 22, 2023

		@@ -101,16 +101,16 @@ class UnusedCSS {
		firstRuleStart > firstRuleEnd \|\|
		firstRuleStart > PREVIEW_LENGTH) {

core: correctly truncate unicode strings #14911

core: correctly truncate unicode strings #14911

Conversation

connorjclark commented Mar 21, 2023 • edited by paulirish Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendankenny commented Mar 21, 2023

paulirish commented Mar 21, 2023

brendankenny commented Mar 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendankenny Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

connorjclark commented Mar 21, 2023 •

edited by paulirish

Loading

brendankenny Mar 22, 2023 •

edited

Loading