Add documentation on v0 symbol mangling. #97571

ehuss · 2022-05-31T00:12:11Z

This adds official documentation for the v0 symbol mangling format, migrating the documentation from RFC 2603.
The format was originally stabilized as the -C symbol-mangling-version option, but the specifics were not stabilized (per #90128 (comment)).
Per the discussion at #93661 (comment) this adds those specifics as an official description of the format.

cc #89917

rust-highfive · 2022-05-31T00:12:14Z

r? @GuillaumeGomez

(rust-highfive has picked a reviewer for you, use r? to override)

ehuss · 2022-05-31T00:13:03Z

Before this is merged, there are a variety of questions to resolve:

Is the compiler team ready for this?
Are there any open questions about the format?
How is it intended to extend this format in the future? For example, if additional const types are stabilized, can v0 just be extended to add them? What constitutes a need for a v1 format? Which parts of v0 can be extended at will?
This currently only documents stable parts of the language. Should unstable things like additional const generic types be included now, or just when they are stabilized?
The base62 encoding scheme subtly differs from common base62 (or any baseNN) encoding scheme by swapping the uppercase/lowercase letters. Was that intended?
I can't figure out a way to create a symbol with a placeholder (p) const. How can one be created?
I left a FIXME in the text for wasm-specific behavior. I'm not sure if that should be included?
Should there be a more detailed explanation of the mapping of how Rust entities are encoded?

Deviations from the RFC:

I defined <bytes> for identifier as UTF-8. Should it be defined as arbitrary bytes instead? rustc_demangle for example does not support non-utf-8.
I specified a 64-bit upper-bound for <base-62-number> so that demanglers can have a well-defined limit (and that is how rustc-demangle works). Should I keep that, or should demanglers be prepared for larger values? Removed the limit.
I added more detail on the encoding, where the RFC was under-specified. I did my best to make it correct but not over- or under-specified, though it wouldn't hurt for a careful review. For example:
- fn ABI string converts dashes to underscores.
- base62 encoding
- The format for <decimal-number>.
I added more detailed recommended demangling (based on rustc-demangle).
Adjusted the grammar for decimal-number to make it clear it cannot have leading zeroes.

cc @eddyb @Mark-Simulacrum

eddyb · 2022-05-31T08:58:40Z

Shouldn't this go in the reference? And then have the CLI option link to that?

I don't think we describe language concerns in the rustc CLI docs, and the "Rust v0" mangling is supposed to be the first official Rust language mangling, unlike the rustc-specific "legacy" one.

r? @michaelwoerister cc @lqd

eddyb · 2022-05-31T10:09:24Z

I'm going to try to do a point-by-point reply to #97571 (comment) (but some of this feels like it's going over things discussed before the RFC was accepted? I suppose those things tend to get lost to time)

Is the compiler team ready for this?

Not speaking for anyone else but I've been operating under the assumption that the RFC itself is the documentation (though it seems reasonable for it to be accessible through the reference, not sure why we didn't just do that), and the main thing we've been waiting on is upstreaming into non-Rust tooling.
Are there any open questions about the format?

Can't think of anything other than the discussion in rust-lang/rfcs#3161 (which hasn't been merged so I assume it's not part of this PR)
How is it intended to extend this format in the future? For example, if additional const types are stabilized, can v0 just be extended to add them? What constitutes a need for a v1 format? Which parts of v0 can be extended at will?

Generally, as long as leave gaps in the grammar (i.e. unused "letters" in each of 3-4 grammar categories) we can always extend it - by a quick count we have:
- 7 for path/type (uppercase): GHJLUVW
- 5 for type (lowercase): gkqrw
- 23 for "well-known" namespace (uppercase): everything except C (closure), S (shim) and I (impl, but only in a specific polymorphization situation).
- 21 for implementation-defined namespace (lowercase): this one doesn't really matter, since we just assign them in the compiler and the demangler doesn't have to know about them - worst case we can start putting information into the low bits of the disambiguator if we ever run out
For constants, rust-lang/rfcs#3161 both supports the vast majority of types that we could ever have in const generics, and offers (in last paragraph of the PR description) a generalized fallback we could always do in other cases (without updating demanglers at all).

As for v1, that would require backwards-incompatible changes, i.e. repurposing characters in ways that could accidentally be incorrectly decoded as unrelated v0 manglings. If I had to guess, this would likely a binary format, usable on platforms that do not have the charset limitations, and may even include some kind of compression using a standardized dictionary (zstd was mentioned during the original v0 RFC).
This currently only documents stable parts of the language. Should unstable things like additional const generic types be included now, or just when they are stabilized?

rust-lang/rfcs#3161 isn't even accepted yet so I don't think there's much of a point in including it now. OTOH, rustc-demangle does implement it, so it might be more of a matter of what we do by default?
The base62 encoding scheme subtly differs from common base62 (or any baseNN) encoding scheme by swapping the uppercase/lowercase letters. Was that intended?

Not sure, I think that was all chosen by @michaelwoerister before I got involved, and I kept it. Though I would've naively assumed lowercase goes before uppercase in any base, so I wouldn't trust myself to have gotten it right either. _{(Actually, doesn't e.g. base64 look nothing like hex and instead makes A the lowest value digit, so that a bunch of zero bytes turn into AAAAA... in base64? So it's not just the order of lowercase and uppercase)}
I can't figure out a way to create a symbol with a placeholder (p) const. How can one be created?

It should be the same as with the type placeholder: impl generics that do not actually apply to the entity being described, like a static, which is never monomorphized by parameters in scope (e.g. mangled / demangled - I...pKpE is ...<_, _>, a type and a const placeholder).
I left a FIXME in the text for wasm-specific behavior. I'm not sure if that should be included?

Maybe, in the general sense of linker symbol documentation, since wasm has a somewhat unique two-level system, but it wouldn't be relevant to a v0 explanation (which IMO should have a clearly marked document of its own, not mixed with other linker symbol concerns like #[no_mangle] and whatnot).
Should there be a more detailed explanation of the mapping of how Rust entities are encoded?

I don't think so, the format isn't designed to be reproducible between two producers, it's more of a "heterogeneous syntax serialization" (honestly we could've probably deduplicated more, or chosen better letters if more time was spent on that front alone, but it's not too bad).

Deviations from the RFC:
I defined <bytes> for identifier as UTF-8. Should it be defined as arbitrary bytes instead? rustc_demangle for example does not support non-utf-8.

The whole charset is _0-9a-zA-Z (ident-wise, the subset of XID_Start+XID_Continue that overlaps ASCII) so I think <bytes> was just used as a shorthand for that - in theory, it could be UTF-8, but punycode was added specifically so that we could have Unicode w/o UTF-8.
I specified a 64-bit upper-bound for <base-62-number> so that demanglers can have a well-defined limit (and that is how rustc-demangle works). Should I keep that, or should demanglers be prepared for larger values?

Demanglers can have their own implementation limits, I don't think we should put that in the format.
E.g. rustc-demangle also has like a [char; 128] limit for punycode (because it's #![no_std] so it can't allocate a Vec<char>), and demangling still succeeds, but that overlong identifier is printed out in the standard punycode syntax (used by DNS, primarily).

For encoding constants I went out of my way to allow "just dump the mangled form" by having it be hex so it just needs a 0x prefix printed (at the cost of compactness compared to base62).

Honestly, the only reason I can think for disambiguators using base62 is the crate disambiguator hash, and even then, unless it's actually printed, fully decoding it into a number shouldn't be necessary (so rustc-demangle could be updated to make the limit "softer").
I added more detail on the encoding, where the RFC was under-specified. I did my best to make it correct but not over- or under-specified, though it wouldn't hurt for a careful review. For example:
- fn ABI string converts dashes to underscores.
- base62 encoding
- The format for <decimal-number>.
I added more detailed recommended demangling (based on rustc-demangle).

For both of these, I'm tempted to say the original RFC should be amended, since more features aren't added, just existing information corrected - there is however some discussion on an existing example of such an amendment so maybe I'm wrong.

ehuss · 2022-06-01T21:22:15Z

Thank you for the detailed response!

Shouldn't this go in the reference? And then have the CLI option link to that?

Hm, my impression was the opposite, that this is not a part of the language and just an implementation-specific concern. The RFC was accepted by the compiler team, not the lang team. And it isn't really clear what consequence it has for the language if it isn't to be used as a stable ABI or have well-defined translation from Rust entities.

I can imagine in the future that this could potentially provide the basis for a stable ABI, but that seems very far out, no?

v0 explanation (which IMO should have a clearly marked document of its own, not mixed with other linker symbol concerns like #[no_mangle] and whatnot)

Yea, I waffled a bit on how to organize it. I decided to move it to a separate chapter.

Demanglers can have their own implementation limits, I don't think we should put that in the format.

OK, I removed the limit.

Generally, as long as leave gaps in the grammar (i.e. unused "letters" in each of 3-4 grammar categories) we can always extend it

I can understand how something like namespace or basic-type could be extended, but if any of the other open tags were extended, a demangler would have no way to know how to interpret anything that follows. How would that be different from any other backwards-incompatible change? The end result is that an older demangler would be unable to demangle the symbol.

fbstj · 2022-06-01T21:27:47Z

src/doc/rustc/src/symbol-mangling/index.md

+```
+
+[`rustc-demangle`]: https://crates.io/crates/rustc-demangle
+[`rustfilt`]: https://crates.io/crates/rustfilt


is recommending tools via crates.io the norm?

There's about a hundred references to crates via crates.io in the current documentation. Is there some other way you would recommend? We could link directly to the GitHub project, but I figure the crates.io page provides links to GitHub which should be good to find it, and the crates.io links should be somewhat more permanent.

ahh well yeah looking at that page in-particular it seems like they "recommend" using cargo install rather than having binaries/installers. I was specifically meaning tools rather than crates / libraries but yeah this crates pages is probably the most sensible page for this tool

michaelwoerister · 2022-06-03T08:23:36Z

Wow, that is some extensive documentation 😀
I'll take a closer look next week.

The base62 encoding scheme subtly differs from common base62 (or any baseNN) encoding scheme by swapping the uppercase/lowercase letters. Was that intended?

Nope, there is no deeper meaning to that. Would have been nice to catch that during RFC review, but I guess now that there are demanglers in external tools, it's too late to change it.

eddyb · 2022-06-04T00:28:08Z

Shouldn't this go in the reference? And then have the CLI option link to that?

Hm, my impression was the opposite, that this is not a part of the language and just an implementation-specific concern. The RFC was accepted by the compiler team, not the lang team. And it isn't really clear what consequence it has for the language if it isn't to be used as a stable ABI or have well-defined translation from Rust entities.

The exact mapping of source code to symbols is implementation-specific, but any Rust implementation would immediately benefit from generating v0 symbols, as tools exist to demangle them back into Rust syntax, while the format also is able to hide enough redundancy to cover various additional sources of "identity" not explicitly covered otherwise.

I agree that in retrospect the RFC being strictly compiler team is a bit weird, might be a good idea to poll @rust-lang/lang on how they feel about treating the v0 symbol mangling the way I described ("officially a Rust mangling scheme").
I can imagine in the future that this could potentially provide the basis for a stable ABI, but that seems very far out, no?

Probably only in the vague sense that it's the first sketch we have of being able to serialize what we might consider "definition identity". Definitely agree on far out.

Earliest similar thing I can think of would be #95845, if we stabilized a method on TypeId to actually get the mangled string (or more likely something that can be decompressed into one), but that PR is not looking favorable right now so that might never happen.
Generally, as long as leave gaps in the grammar (i.e. unused "letters" in each of 3-4 grammar categories) we can always extend it

I can understand how something like namespace or basic-type could be extended, but if any of the other open tags were extended, a demangler would have no way to know how to interpret anything that follows. How would that be different from any other backwards-incompatible change? The end result is that an older demangler would be unable to demangle the symbol.

Ah I see now. I think there's two levels of incompatibility (not sure how to describe each):
- symbols using some new feature require new (e.g. type) encoding in mangling
  - only affects code using the new feature, not existing code
  - getting mangling support out there can be part of stabilization
  - we do have the fallback option of choosing a less-appealing-after-demangling encoding, because any syntactic category that can embed paths can represent arbitrary trees
    - e.g. A<B<C, D>, X<Y>> is no different from the sexpr (A (B C D) (X Y)), information-wise
- new symbols repurpose existing mangling syntax to mean something else
  - this is worse because demanglers can then get confused and "succeed" with the wrong output
  - IOW, this introduces ambiguity, unless some kind of "explicit version bump" is done

michaelwoerister · 2022-06-09T12:54:24Z

src/doc/rustc/src/symbol-mangling/v0.md

+> inherent-impl → `M` *[impl-path]* *[type]*
+
+An *inherent-impl* indicates a path to an [inherent implementation][reference-inherent-impl].
+It consists of the character `M` followed by an *[impl-path]* to the impl's parent followed by the *[type]* representing the `Self` type of the impl.


I think this should be something like

It consists of the character M followed by an [impl-path], which uniquely identifies the impl block the item is defined in.

impl-path is the parent of the impl plus the index of the impl within the parent. We just don't emit the index (disambiguator) if it's equal to zero.

michaelwoerister · 2022-06-09T12:56:14Z

src/doc/rustc/src/symbol-mangling/v0.md

+>     │└─────────┬──────────┘└────┬──────┘
+>     │          │                │
+>     │          │                └── Self type "Example"
+>     │          └─────────────────── path to the impl's parent "mycrate"


This example should probably contain a disambiguator for the impl-path to make it more representative. Most impl-paths will have one.

michaelwoerister · 2022-06-09T13:04:50Z

src/doc/rustc/src/symbol-mangling/v0.md

+followed by a *[path]* which is a path representing the parent of the entity,
+followed by an *[identifier]* of the entity.
+
+The identifier of the entity may be empty when the entity is not named.


That's not quite accurate, I think. The string part of the identifier is empty, but there still would be the disambiguator (unless it's zero).

I have tweaked the wording to make try to make it clearer. I wasn't trying to say that there is no identifier, just that its length is zero.

michaelwoerister · 2022-06-09T13:26:16Z

src/doc/rustc/src/symbol-mangling/v0.md

+This helps differentiate symbols that would otherwise be identical,
+for example the monomorphization of a function from an external crate may result in a duplicate if another crate is also instantiating the same generic function with the same types.
+
+In practice, the instantiating crate is also the crate where the symbol is defined,


Suggested change

In practice, the instantiating crate is also the crate where the symbol is defined,

In practice, the instantiating crate is also often the crate where the symbol is defined,

michaelwoerister · 2022-06-09T13:36:21Z

Great docs, @ehuss! I left some comments in a few places but that's mostly nit-picking.

Demanglers should be prepared for any arbitrary length number.

eddyb · 2022-06-10T02:51:45Z

@michaelwoerister Do any of these comments apply to anything taken from the RFC text? It would be nice to fix them there (since IMO a diff of the RFC and reference should mostly have large-scale reorganization and additions, not e.g. spelling fixes in copied text).

src/doc/rustc/src/symbol-mangling/v0.md

michaelwoerister · 2022-06-14T08:50:32Z

@michaelwoerister Do any of these comments apply to anything taken from the RFC text? It would be nice to fix them there (since IMO a diff of the RFC and reference should mostly have large-scale reorganization and additions, not e.g. spelling fixes in copied text).

No, I don't think so.

rust-highfive assigned GuillaumeGomez May 31, 2022

rust-highfive added the S-waiting-on-review label May 31, 2022

This comment has been hidden.

Sign in to view

ehuss force-pushed the symbol-mangling branch from 8143f5e to 2fc1d10 Compare May 31, 2022

rust-highfive assigned michaelwoerister and unassigned GuillaumeGomez May 31, 2022

fbstj reviewed Jun 1, 2022

View changes

michaelwoerister reviewed Jun 9, 2022

View changes

ehuss added 4 commits Jun 10, 2022

Add documentation on v0 symbol mangling.

453b980

Rearrange symbol-mangling chapter out of codegen-options.

befda9c

Remove 64-bit limit for base-62-numbers.

9d78b64

Demanglers should be prepared for any arbitrary length number.

Update from review from michaelwoerister.

875d18f

ehuss force-pushed the symbol-mangling branch from 8bd9d32 to 875d18f Compare Jun 10, 2022

tmiasko reviewed Jun 10, 2022

View changes

src/doc/rustc/src/symbol-mangling/v0.md Outdated Show resolved Hide resolved

ehuss added 3 commits Jun 10, 2022

Clarify missing tick.

644cbeb

Clarify grammar for decimal-number cannot have leading zeroes.

ae1cea5

Rewrite recommended demangling for lifetimes using "De Bruijn level".

63f305f

JohnCSimon added S-waiting-on-review and removed S-waiting-on-review labels Jul 3, 2022

Add documentation on v0 symbol mangling. #97571

Add documentation on v0 symbol mangling. #97571

ehuss commented May 31, 2022

rust-highfive commented May 31, 2022

ehuss commented May 31, 2022 •

edited

This comment has been hidden.

eddyb commented May 31, 2022

eddyb commented May 31, 2022

ehuss commented Jun 1, 2022

fbstj Jun 1, 2022

ehuss Jun 1, 2022

fbstj Jun 1, 2022

michaelwoerister commented Jun 3, 2022

eddyb commented Jun 4, 2022

michaelwoerister Jun 9, 2022

michaelwoerister Jun 9, 2022

michaelwoerister Jun 9, 2022

ehuss Jun 10, 2022

michaelwoerister Jun 9, 2022

michaelwoerister commented Jun 9, 2022

eddyb commented Jun 10, 2022

michaelwoerister commented Jun 14, 2022

	In practice, the instantiating crate is also the crate where the symbol is defined,
	In practice, the instantiating crate is also often the crate where the symbol is defined,

Add documentation on v0 symbol mangling. #97571

Are you sure you want to change the base?

Add documentation on v0 symbol mangling. #97571

Conversation

ehuss commented May 31, 2022

rust-highfive commented May 31, 2022

ehuss commented May 31, 2022 • edited

This comment has been hidden.

eddyb commented May 31, 2022

eddyb commented May 31, 2022

ehuss commented Jun 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwoerister commented Jun 3, 2022

eddyb commented Jun 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwoerister commented Jun 9, 2022

eddyb commented Jun 10, 2022

michaelwoerister commented Jun 14, 2022

ehuss commented May 31, 2022 •

edited