Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: correctly truncate unicode strings #14911

Merged
merged 12 commits into from
Mar 22, 2023
Merged

core: correctly truncate unicode strings #14911

merged 12 commits into from
Mar 22, 2023

Conversation

connorjclark
Copy link
Collaborator

@connorjclark connorjclark commented Mar 21, 2023

In various places, we truncate web strings using .slice, which can result in eliding in the middle of a grapheme (visual character). This is typically fine for JSON serialization (it just won't look great), but in PSI there is a serialization step that breaks under this input.

This PR improves all the places (but maybe I missed some?) that truncate strings that end up in the LHR to instead elide using Intl.Segmenter. It also changes the result a bit, by considering the suffix added to the truncated strings as part of the max character length - and also, using an ellipse in some places that instead used ....

reprise of #11697 #11698

fixes #14897

@connorjclark connorjclark requested a review from a team as a code owner March 21, 2023 18:00
@connorjclark connorjclark requested review from brendankenny and removed request for a team March 21, 2023 18:00
core/lib/page-functions.js Outdated Show resolved Hide resolved
core/audits/byte-efficiency/unminified-javascript.js Outdated Show resolved Hide resolved
} else if (firstRuleEnd < PREVIEW_LENGTH) {
// The entire first rule-set fits within the preview
preview = preview.slice(0, firstRuleEnd + 1) + ' ...';
preview = preview.slice(0, firstRuleEnd + 1) + ' ';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use Util.truncate(0, firstRuleEnd, ' …') here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because firstRuleEnd is a specific location in the string, we should not truncate by characters here.

* @typedef {{sentinel: '__ErrorSentinel', message: string, code?: string, stack?: string}} SerializedBaseError
*/

class LighthouseError extends Error {
/**
* @param {LighthouseErrorDefinition} errorDefinition
* @param {Record<string, string|undefined>=} properties
* @param {Record<string, unknown>=} properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially I needed this change when the base tsconfig target was updated to 2022, reverted in de6a394 . the base tsconfig didn't need the change anymore after I moved the truncate function to shared/util.js (it was in page-functions.js). Might as well keep this more correct type, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a tsc bug that spreading unknown down on L125 doesn't result in an error, though. properties are ICU placeholder values and the constructor shouldn't allow anything but strings through if at all possible, so ideally we can hold off on using unknown here unless it's absolutely required.

core/lib/page-functions.js Outdated Show resolved Hide resolved
@brendankenny
Copy link
Member

My only concern is that Intl.Segmenter is quite slow compared to iterating on string code points. It'll also have the very slow ICU startup cost (like what necessitated caching Intl.NumberFormaters), but hopefully that's being caught by the V8 caching of ICU init with basic options?

So for that cost, do we really need this for the handful of cases where grapheme clusters are significant (thanks Mathias for the term)? Most unicode will be fine, this is mostly down in html attributes and other non-prominent strings, and it'll just be strings like 👨‍👨‍👦‍👦 that might end split in the middle if they happen to end up at the length limit (though still on unicode code points, so it won't have any issue with bad string encoding).

@paulirish
Copy link
Member

My only concern is that Intl.Segmenter is quite slow compared to iterating on string code points. It'll also have the very slow ICU startup cost (like what necessitated caching Intl.NumberFormaters), but hopefully that's being caught by the V8 caching of ICU init with basic options?

i was skeptical too. numberformat has been a pain for us.

I was able to capture a profile of just axe + getnodedetails (in this branch): https://trace.cafe/t/Rd8274lf8y
fwiw: getnodedetails ran on 39 elems in this case (https://www.kissmyparcel.com/retail/)
I saw truncate take 14ms on the first time, but then it was fast on all subsequent invocations. Also, on my second/third traces i couldn't repro this large instantiate cost, but I did on my fourth. shrug. some weird caching presumably.

i made this snippet to just test from within devtools: https://gist.github.com/paulirish/e46ec350be36cd23047c350602c47435

given what i'm seeing here.. so far I feel okay about the perf side. 14ms isn't nothing, but.. it seems like it's a one-time cost?

Co-authored-by: Paul Irish <paulirish@users.noreply.github.com>
@brendankenny
Copy link
Member

given what i'm seeing here.. so far I feel okay about the perf side. 14ms isn't nothing, but.. it seems like it's a one-time cost?

That could be the ICU startup time, and if that is the case, seems like V8 is caching it for us, which is great. We should be wary of ever doing this for the first time during tracing.

If it looks good on perf, great to be doing the best segmenting here!

@@ -8,6 +8,7 @@ import {ByteEfficiencyAudit} from './byte-efficiency-audit.js';
import * as i18n from '../../lib/i18n/i18n.js';
import {computeJSTokenLength as computeTokenLength} from '../../lib/minification-estimator.js';
import {getRequestForScript, isInline} from '../../lib/script-helpers.js';
import {Util} from '../../../shared/util.js';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels weird that this isn't a direct export

core/computed/unused-css.js Outdated Show resolved Hide resolved
@@ -101,16 +101,16 @@ class UnusedCSS {
firstRuleStart > firstRuleEnd ||
firstRuleStart > PREVIEW_LENGTH) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if PREVIEW_LENGTH is in graphemes, it won't be comparable to preview.length, firstRuleStart, firstRuleEnd, etc. Not sure how much we care, but definitely some wonkiness in here as a result (e.g. firstRuleStart could be greater than PREVIEW_LENGTH in code points but less than PREVIEW_LENGTH in graphemes), which makes the code harder to understand/alter in the future.

* @typedef {{sentinel: '__ErrorSentinel', message: string, code?: string, stack?: string}} SerializedBaseError
*/

class LighthouseError extends Error {
/**
* @param {LighthouseErrorDefinition} errorDefinition
* @param {Record<string, string|undefined>=} properties
* @param {Record<string, unknown>=} properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a tsc bug that spreading unknown down on L125 doesn't result in an error, though. properties are ICU placeholder values and the constructor shouldn't allow anything but strings through if at all possible, so ideally we can hold off on using unknown here unless it's absolutely required.

core/lib/page-functions.js Outdated Show resolved Hide resolved
@@ -138,7 +138,7 @@ class UrlUtils {
static elideDataURI(url) {
try {
const parsed = new URL(url);
return parsed.protocol === 'data:' ? url.slice(0, 100) : url;
return parsed.protocol === 'data:' ? Util.truncate(url, 100) : url;
Copy link
Member

@brendankenny brendankenny Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW data URLs can only contain ascii characters

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but at least inclusion of the ellipse suffix is an improvement here.

shared/util.js Show resolved Hide resolved
shared/util.js Outdated Show resolved Hide resolved
const iterator = segmenter.segment(string)[Symbol.iterator]();

let lastSegment;
for (let i = 0; i <= characterLimit - ellipseSuffix.length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically ellipseSuffix.length might not be grapheme length, but I guess we can assume we won't pass in suffixes without a 1:1 code point:grapheme ratio? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PSI: Oops! Something went wrong. unicode truncation json parse problem
5 participants