Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on v0 symbol mangling. #97571

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ehuss
Copy link
Contributor

@ehuss ehuss commented May 31, 2022

This adds official documentation for the v0 symbol mangling format, migrating the documentation from RFC 2603.
The format was originally stabilized as the -C symbol-mangling-version option, but the specifics were not stabilized (per #90128 (comment)).
Per the discussion at #93661 (comment) this adds those specifics as an official description of the format.

cc #89917

@rust-highfive
Copy link
Collaborator

@rust-highfive rust-highfive commented May 31, 2022

r? @GuillaumeGomez

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review label May 31, 2022
@ehuss
Copy link
Contributor Author

@ehuss ehuss commented May 31, 2022

Before this is merged, there are a variety of questions to resolve:

  • Is the compiler team ready for this?
  • Are there any open questions about the format?
  • How is it intended to extend this format in the future? For example, if additional const types are stabilized, can v0 just be extended to add them? What constitutes a need for a v1 format? Which parts of v0 can be extended at will?
  • This currently only documents stable parts of the language. Should unstable things like additional const generic types be included now, or just when they are stabilized?
  • The base62 encoding scheme subtly differs from common base62 (or any baseNN) encoding scheme by swapping the uppercase/lowercase letters. Was that intended?
  • I can't figure out a way to create a symbol with a placeholder (p) const. How can one be created?
  • I left a FIXME in the text for wasm-specific behavior. I'm not sure if that should be included?
  • Should there be a more detailed explanation of the mapping of how Rust entities are encoded?

Deviations from the RFC:

  • I defined <bytes> for identifier as UTF-8. Should it be defined as arbitrary bytes instead? rustc_demangle for example does not support non-utf-8.
  • I specified a 64-bit upper-bound for <base-62-number> so that demanglers can have a well-defined limit (and that is how rustc-demangle works). Should I keep that, or should demanglers be prepared for larger values? Removed the limit.
  • I added more detail on the encoding, where the RFC was under-specified. I did my best to make it correct but not over- or under-specified, though it wouldn't hurt for a careful review. For example:
    • fn ABI string converts dashes to underscores.
    • base62 encoding
    • The format for <decimal-number>.
  • I added more detailed recommended demangling (based on rustc-demangle).
  • Adjusted the grammar for decimal-number to make it clear it cannot have leading zeroes.

cc @eddyb @Mark-Simulacrum

@rust-log-analyzer

This comment has been hidden.

@eddyb
Copy link
Member

@eddyb eddyb commented May 31, 2022

Shouldn't this go in the reference? And then have the CLI option link to that?

I don't think we describe language concerns in the rustc CLI docs, and the "Rust v0" mangling is supposed to be the first official Rust language mangling, unlike the rustc-specific "legacy" one.

r? @michaelwoerister cc @lqd

@eddyb
Copy link
Member

@eddyb eddyb commented May 31, 2022

I'm going to try to do a point-by-point reply to #97571 (comment) (but some of this feels like it's going over things discussed before the RFC was accepted? I suppose those things tend to get lost to time)

  • Is the compiler team ready for this?

    Not speaking for anyone else but I've been operating under the assumption that the RFC itself is the documentation (though it seems reasonable for it to be accessible through the reference, not sure why we didn't just do that), and the main thing we've been waiting on is upstreaming into non-Rust tooling.

  • Are there any open questions about the format?

    Can't think of anything other than the discussion in rust-lang/rfcs#3161 (which hasn't been merged so I assume it's not part of this PR)

  • How is it intended to extend this format in the future? For example, if additional const types are stabilized, can v0 just be extended to add them? What constitutes a need for a v1 format? Which parts of v0 can be extended at will?

    Generally, as long as leave gaps in the grammar (i.e. unused "letters" in each of 3-4 grammar categories) we can always extend it - by a quick count we have:

    • 7 for path/type (uppercase): GHJLUVW
    • 5 for type (lowercase): gkqrw
    • 23 for "well-known" namespace (uppercase): everything except C (closure), S (shim) and I (impl, but only in a specific polymorphization situation).
    • 21 for implementation-defined namespace (lowercase): this one doesn't really matter, since we just assign them in the compiler and the demangler doesn't have to know about them - worst case we can start putting information into the low bits of the disambiguator if we ever run out

    For constants, rust-lang/rfcs#3161 both supports the vast majority of types that we could ever have in const generics, and offers (in last paragraph of the PR description) a generalized fallback we could always do in other cases (without updating demanglers at all).

    As for v1, that would require backwards-incompatible changes, i.e. repurposing characters in ways that could accidentally be incorrectly decoded as unrelated v0 manglings. If I had to guess, this would likely a binary format, usable on platforms that do not have the charset limitations, and may even include some kind of compression using a standardized dictionary (zstd was mentioned during the original v0 RFC).

  • This currently only documents stable parts of the language. Should unstable things like additional const generic types be included now, or just when they are stabilized?

    rust-lang/rfcs#3161 isn't even accepted yet so I don't think there's much of a point in including it now. OTOH, rustc-demangle does implement it, so it might be more of a matter of what we do by default?

  • The base62 encoding scheme subtly differs from common base62 (or any baseNN) encoding scheme by swapping the uppercase/lowercase letters. Was that intended?

    Not sure, I think that was all chosen by @michaelwoerister before I got involved, and I kept it. Though I would've naively assumed lowercase goes before uppercase in any base, so I wouldn't trust myself to have gotten it right either. (Actually, doesn't e.g. base64 look nothing like hex and instead makes A the lowest value digit, so that a bunch of zero bytes turn into AAAAA... in base64? So it's not just the order of lowercase and uppercase)

  • I can't figure out a way to create a symbol with a placeholder (p) const. How can one be created?

    It should be the same as with the type placeholder: impl generics that do not actually apply to the entity being described, like a static, which is never monomorphized by parameters in scope (e.g. mangled / demangled - I...pKpE is ...<_, _>, a type and a const placeholder).

  • I left a FIXME in the text for wasm-specific behavior. I'm not sure if that should be included?

    Maybe, in the general sense of linker symbol documentation, since wasm has a somewhat unique two-level system, but it wouldn't be relevant to a v0 explanation (which IMO should have a clearly marked document of its own, not mixed with other linker symbol concerns like #[no_mangle] and whatnot).

  • Should there be a more detailed explanation of the mapping of how Rust entities are encoded?

    I don't think so, the format isn't designed to be reproducible between two producers, it's more of a "heterogeneous syntax serialization" (honestly we could've probably deduplicated more, or chosen better letters if more time was spent on that front alone, but it's not too bad).

    Deviations from the RFC:

  • I defined <bytes> for identifier as UTF-8. Should it be defined as arbitrary bytes instead? rustc_demangle for example does not support non-utf-8.

    The whole charset is _0-9a-zA-Z (ident-wise, the subset of XID_Start+XID_Continue that overlaps ASCII) so I think <bytes> was just used as a shorthand for that - in theory, it could be UTF-8, but punycode was added specifically so that we could have Unicode w/o UTF-8.

  • I specified a 64-bit upper-bound for <base-62-number> so that demanglers can have a well-defined limit (and that is how rustc-demangle works). Should I keep that, or should demanglers be prepared for larger values?

    Demanglers can have their own implementation limits, I don't think we should put that in the format.
    E.g. rustc-demangle also has like a [char; 128] limit for punycode (because it's #![no_std] so it can't allocate a Vec<char>), and demangling still succeeds, but that overlong identifier is printed out in the standard punycode syntax (used by DNS, primarily).

    For encoding constants I went out of my way to allow "just dump the mangled form" by having it be hex so it just needs a 0x prefix printed (at the cost of compactness compared to base62).

    Honestly, the only reason I can think for disambiguators using base62 is the crate disambiguator hash, and even then, unless it's actually printed, fully decoding it into a number shouldn't be necessary (so rustc-demangle could be updated to make the limit "softer").

  • I added more detail on the encoding, where the RFC was under-specified. I did my best to make it correct but not over- or under-specified, though it wouldn't hurt for a careful review. For example:

    • fn ABI string converts dashes to underscores.
    • base62 encoding
    • The format for <decimal-number>.
  • I added more detailed recommended demangling (based on rustc-demangle).

    For both of these, I'm tempted to say the original RFC should be amended, since more features aren't added, just existing information corrected - there is however some discussion on an existing example of such an amendment so maybe I'm wrong.

@ehuss
Copy link
Contributor Author

@ehuss ehuss commented Jun 1, 2022

Thank you for the detailed response!

Shouldn't this go in the reference? And then have the CLI option link to that?

Hm, my impression was the opposite, that this is not a part of the language and just an implementation-specific concern. The RFC was accepted by the compiler team, not the lang team. And it isn't really clear what consequence it has for the language if it isn't to be used as a stable ABI or have well-defined translation from Rust entities.

I can imagine in the future that this could potentially provide the basis for a stable ABI, but that seems very far out, no?

v0 explanation (which IMO should have a clearly marked document of its own, not mixed with other linker symbol concerns like #[no_mangle] and whatnot)

Yea, I waffled a bit on how to organize it. I decided to move it to a separate chapter.

Demanglers can have their own implementation limits, I don't think we should put that in the format.

OK, I removed the limit.

Generally, as long as leave gaps in the grammar (i.e. unused "letters" in each of 3-4 grammar categories) we can always extend it

I can understand how something like namespace or basic-type could be extended, but if any of the other open tags were extended, a demangler would have no way to know how to interpret anything that follows. How would that be different from any other backwards-incompatible change? The end result is that an older demangler would be unable to demangle the symbol.

```

[`rustc-demangle`]: https://crates.io/crates/rustc-demangle
[`rustfilt`]: https://crates.io/crates/rustfilt
Copy link
Contributor

@fbstj fbstj Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is recommending tools via crates.io the norm?

Copy link
Contributor Author

@ehuss ehuss Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's about a hundred references to crates via crates.io in the current documentation. Is there some other way you would recommend? We could link directly to the GitHub project, but I figure the crates.io page provides links to GitHub which should be good to find it, and the crates.io links should be somewhat more permanent.

Copy link
Contributor

@fbstj fbstj Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh well yeah looking at that page in-particular it seems like they "recommend" using cargo install rather than having binaries/installers. I was specifically meaning tools rather than crates / libraries but yeah this crates pages is probably the most sensible page for this tool

@michaelwoerister
Copy link
Member

@michaelwoerister michaelwoerister commented Jun 3, 2022

Wow, that is some extensive documentation 😀
I'll take a closer look next week.

The base62 encoding scheme subtly differs from common base62 (or any baseNN) encoding scheme by swapping the uppercase/lowercase letters. Was that intended?

Nope, there is no deeper meaning to that. Would have been nice to catch that during RFC review, but I guess now that there are demanglers in external tools, it's too late to change it.

@eddyb
Copy link
Member

@eddyb eddyb commented Jun 4, 2022

  • Shouldn't this go in the reference? And then have the CLI option link to that?

    Hm, my impression was the opposite, that this is not a part of the language and just an implementation-specific concern. The RFC was accepted by the compiler team, not the lang team. And it isn't really clear what consequence it has for the language if it isn't to be used as a stable ABI or have well-defined translation from Rust entities.

    The exact mapping of source code to symbols is implementation-specific, but any Rust implementation would immediately benefit from generating v0 symbols, as tools exist to demangle them back into Rust syntax, while the format also is able to hide enough redundancy to cover various additional sources of "identity" not explicitly covered otherwise.

    I agree that in retrospect the RFC being strictly compiler team is a bit weird, might be a good idea to poll @rust-lang/lang on how they feel about treating the v0 symbol mangling the way I described ("officially a Rust mangling scheme").

  • I can imagine in the future that this could potentially provide the basis for a stable ABI, but that seems very far out, no?

    Probably only in the vague sense that it's the first sketch we have of being able to serialize what we might consider "definition identity". Definitely agree on far out.

    Earliest similar thing I can think of would be #95845, if we stabilized a method on TypeId to actually get the mangled string (or more likely something that can be decompressed into one), but that PR is not looking favorable right now so that might never happen.

  • Generally, as long as leave gaps in the grammar (i.e. unused "letters" in each of 3-4 grammar categories) we can always extend it

    I can understand how something like namespace or basic-type could be extended, but if any of the other open tags were extended, a demangler would have no way to know how to interpret anything that follows. How would that be different from any other backwards-incompatible change? The end result is that an older demangler would be unable to demangle the symbol.

    Ah I see now. I think there's two levels of incompatibility (not sure how to describe each):

    • symbols using some new feature require new (e.g. type) encoding in mangling
      • only affects code using the new feature, not existing code
      • getting mangling support out there can be part of stabilization
      • we do have the fallback option of choosing a less-appealing-after-demangling encoding, because any syntactic category that can embed paths can represent arbitrary trees
        • e.g. A<B<C, D>, X<Y>> is no different from the sexpr (A (B C D) (X Y)), information-wise
    • new symbols repurpose existing mangling syntax to mean something else
      • this is worse because demanglers can then get confused and "succeed" with the wrong output
      • IOW, this introduces ambiguity, unless some kind of "explicit version bump" is done

> inherent-impl → `M` *[impl-path]* *[type]*

An *inherent-impl* indicates a path to an [inherent implementation][reference-inherent-impl].
It consists of the character `M` followed by an *[impl-path]* to the impl's parent followed by the *[type]* representing the `Self` type of the impl.
Copy link
Member

@michaelwoerister michaelwoerister Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be something like

It consists of the character M followed by an [impl-path], which uniquely identifies the impl block the item is defined in.

impl-path is the parent of the impl plus the index of the impl within the parent. We just don't emit the index (disambiguator) if it's equal to zero.

> │└─────────┬──────────┘└────┬──────┘
> │ │ │
> │ │ └── Self type "Example"
> │ └─────────────────── path to the impl's parent "mycrate"
Copy link
Member

@michaelwoerister michaelwoerister Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example should probably contain a disambiguator for the impl-path to make it more representative. Most impl-paths will have one.

followed by a *[path]* which is a path representing the parent of the entity,
followed by an *[identifier]* of the entity.

The identifier of the entity may be empty when the entity is not named.
Copy link
Member

@michaelwoerister michaelwoerister Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite accurate, I think. The string part of the identifier is empty, but there still would be the disambiguator (unless it's zero).

Copy link
Contributor Author

@ehuss ehuss Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tweaked the wording to make try to make it clearer. I wasn't trying to say that there is no identifier, just that its length is zero.

This helps differentiate symbols that would otherwise be identical,
for example the monomorphization of a function from an external crate may result in a duplicate if another crate is also instantiating the same generic function with the same types.

In practice, the instantiating crate is also the crate where the symbol is defined,
Copy link
Member

@michaelwoerister michaelwoerister Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In practice, the instantiating crate is also the crate where the symbol is defined,
In practice, the instantiating crate is also often the crate where the symbol is defined,

@michaelwoerister
Copy link
Member

@michaelwoerister michaelwoerister commented Jun 9, 2022

Great docs, @ehuss! I left some comments in a few places but that's mostly nit-picking.

@eddyb
Copy link
Member

@eddyb eddyb commented Jun 10, 2022

@michaelwoerister Do any of these comments apply to anything taken from the RFC text? It would be nice to fix them there (since IMO a diff of the RFC and reference should mostly have large-scale reorganization and additions, not e.g. spelling fixes in copied text).

@michaelwoerister
Copy link
Member

@michaelwoerister michaelwoerister commented Jun 14, 2022

@michaelwoerister Do any of these comments apply to anything taken from the RFC text? It would be nice to fix them there (since IMO a diff of the RFC and reference should mostly have large-scale reorganization and additions, not e.g. spelling fixes in copied text).

No, I don't think so.

@JohnCSimon JohnCSimon added S-waiting-on-review and removed S-waiting-on-review labels Jul 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants