Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a way to be able to debug cache misses that can be always on without too much overhead #18643

Open
lberki opened this issue Jun 12, 2023 · 20 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@lberki
Copy link
Contributor

lberki commented Jun 12, 2023

Description of the feature request:

Sometimes, it's useful to be able to dig into why a particular action was not cached by remote execution.

The usual culprits are a changed input file, different input files or a different command line. Bazel provides ample tools to query what it would run (aquery) and what it did (--execution_log_*_file or the BEP) but neither of these is amenable to post-facto debugging:

  • aquery requires extra conscious effort
  • The execution logs are too large to be always uploaded
  • The BEP does not contain enough information

So it would be nice to be able to do some sort of logging to answer questions like "why did this build done two days ago did not reuse the cached actions from that other build done 30 minutes before it".

This doesn't necessarily need to be done within Bazel (although that's the particular hammer I have); I could also imagine logging on the remote execution side where it is logged with actions are done by which Bazel invocations.

What underlying problem are you trying to solve with this feature?

Being able to tell why a particular action was not cached so that future cache hit rates are higher.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@lberki lberki added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 12, 2023
@coeuvre coeuvre added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Jun 13, 2023
@tbaing
Copy link
Contributor

tbaing commented Jun 13, 2023

Our exec logs turn out to be ~120GB (and ~150GB once run through the parser) so we'd appreciate anything that can significantly reduce the amount of output while still letting us answer questions about which actions needed a rebuild (rather than being reusable from cache) and which dependencies drove that need.

@lberki
Copy link
Contributor Author

lberki commented Jun 14, 2023

cc @ramymedhat

@meisterT
Copy link
Member

meisterT commented Jun 20, 2023

Ok, I did look into this a bit and found some issues and some avenues.

Reducing the size

There is is a couple things we could do to reduce the size of the output:

  • use compression, zstd has proven most useful a couple of times both because of low CPU cost and good compression rates
  • use an indexed format for artifacts instead of listing each artifact in full where it appears as input or output
  • instead of listing inputs as a list, include the nested set structure, again indexed; this is non-trivial and would incur additional CPU cost, so I would like to avoid that.
  • leaner proto with less fields populated (removing command lines, listed outputs, textual status, progress message)

I tried zstd compression outside of Bazel (but this can easily be done within Bazel as well if we decide this is the way to go) and also did some dirty hacks to understand the benefits from artifact indexing and a leaner proto.

large build:

what without compression zstd compressed
before 99GB 20GB
artifact indexing 5.1GB 610MB
artifact indexing plus lean fields 2.3GB 450MB

medium build:

what without compression zstd compressed
before 1.3GB 65MB
artifact indexing 217MB 14MB
artifact indexing plus lean fields 47MB 9MB

The path indexing contains the exact same information as before, but needs to be re-constructed to compare. The lean format does leave out information but probably most of these are not relevant, one exception is command lines where I would store one high level hash. You can still see that the command line has changed and you can reconstruct it, e.g. using aquery. A tool could do this, see more below.

Performance overhead

I noticed that action execution was quite a bit slower with the execution log turned on (regardless of which variant). I added more profiling to the JSON trace profile to better understand it (and will send a PR to add this permanently). For the large build from above, the with the exec log it took ~75mins, without ~7mins.

I think we should move to a producer / consumer pattern where we put spawns to be processed in a queue and work on writing the file in a separate thread. We have to see how much overhead this pattern is memory-wise.

I am not sure whether that is enough to reduce the overhead into acceptable levels.

Different formats

We should use the opportunity to clean up the different logging options anyway. Currently we have:

  • --experimental_execution_log_file: “unordered” spawn protos
  • --execution_log_binary_file: First writes the file from above to /tmp and then parses it again and then sorts it in stable order which is CPU and memory intensive (e.g. see improve usability of --experimental_execution_log_file #6807). In my opinion, this is nothing that should live in Bazel and should be removed.
  • --execution_log_json_file: Same as --execution_log_binary_file, should also go away at least in the current form with sorting. @aherrmann mentions in Documentation issue: Debugging remote cache hits for remote execution #12770 that the JSON format has the advantage of using jq and being human readable. So, we might want to keep the option to output both binary and JSON around in some form.
  • --experimental_execution_log_spawn_metrics: Adds spawn metrics to the profile. If we add an option to produce a leaner exec log as mentioned above, we could consider making this option go away and write it always unless the lean log format is requested.

Perhaps it would also make sense to add an option to only write the exec log for specific actions (basically filtering on mnemonics and targets, see also #12075) and for the other actions only include high-level information? This would not only help addressing size issues but also performance concerns.

How to compare protos

For large exec logs, a tool to compare them is essential. @coeuvre created https://github.com/coeuvre/bazel-find-non-deterministic-actions - it helps finding deterministic actions, but is built on top of the JSON output. The tool needs to be changed to support the lean proto format.

Then we could also let the tool spit out other helpful commands like bazel aquery for further debugging.

Related issues

Next steps

I am aware that I am abusing this feature request a little bit to do one step back and understand the bigger picture. I want to give people some time to voice their opinions here before proceeding, so please let me know what you think.

copybara-service bot pushed a commit that referenced this issue Jun 21, 2023
This might help to understand when it is costly, see also #18643

PiperOrigin-RevId: 542194475
Change-Id: I85932d8a5d7a832ee4675722642deb675f7db81f
@sheldonneuberger-sc
Copy link

@meisterT thanks for that detailed outline, it aligns with what I see too: massive exec logs (50GB+), and the build is slower if exec log is on. I'd prefer to have exec logging always on, but this makes it unworkable.

@chancilasnap
Copy link

we would prefer this data be visible through BES somehow, be it as a cas artifact (like the profile) or via special messages...the BES also prints actions (generally on failure only) but it can emit all actions in a slightly different format...it's annoying the BES actions and exec log spawns don't match since exec log seems richer...

I noticed that action execution was quite a bit slower with the execution log turned on (regardless of which variant). I added more profiling to the JSON trace profile to better understand it (and will send a PR to add this permanently). For the large build from above, the with the exec log it took ~75mins, without ~7mins.

I think we should move to a producer / consumer pattern where we put spawns to be processed in a queue and work on writing the file in a separate thread. We have to see how much overhead this pattern is memory-wise.

I am not sure whether that is enough to reduce the overhead into acceptable levels.

are execlogs written in the threads running the actions? seems like that would have huge contention issues possibly

@tbaing
Copy link
Contributor

tbaing commented Sep 1, 2023

@meisterT +1 on the thanks for the detailed analysis.

Reducing the size

For our use case, most of the content is inputs and actual_outputs, so artifact indexing seems most promising for our needs.

Leaner fields could also be useful, but we'd see the most benefit if you focus on inputs/outputs vs. per-action outputs. Hashing the command line and env vars could save some space, but reconstructing those things may be somewhat annoying (the command line requires you to checkout exactly that source code to reconstruct via aquery) or hard (env vars on a CI bot that's been deleted are going to take some non-trivial work to recreate, probably more than they're worth). For my $0.02 I'd just leave them alone. And I find that many of those one-per-action lines are useful for understanding the context ("did we rebuild this action despite no changed inputs because it's cacheable: true?"), so for always-on usage I'd hesitate to cut those things. Whereas I'd happily drop the hash_function_name and size_bytes from inputs/outputs. Set the default hash function at the top of the file and only list it per artifact if it doesn't match, and I don't care about size because all I really care about is "did its hash change between invocations?" Maybe have a flag to control whether to cut the per-action content so people who find it useful can leave it on while cutting the per-artifact lines?

Compression SGTM.

Performance overhead

Anything to improve the overhead of creating this content is welcome. Having less content to create, and compressing that content, might help if the bottleneck is the writes to the disk.

Different formats

Your observation that we could have a single (efficient to create and store) output format that then gets post-processed into other formats (including stable ones) - whether for comparison against another log or just for reading to understand what happened in the current execution - seems right to me.

How to compare protos

I agree that a tool is essential. The one you referenced is a good start, but would benefit from enhancements such as analyzing which changed inputs are outputs from another action in the current run. An action for which all changed inputs are outputs from other actions is far less interesting when you're trying to understand why you rebuilt something you didn't expect to, so the tool would benefit from the ability to 1) focus on the non-transitive changes, and 2) walk easily down through the transitive changes to find interesting non-transitive changes. The ability to filter in/out branches of that graph would also be helpful; I could imagine saying "I want to see all non-transitive changes, but not the ones that chain up into dependency B, or only the ones that chain up in to A and C." I could also imagine that it might be valuable to understand cases where a deeply nested dependency was rebuilt due to changed inputs but produced unmodified output and so didn't contribute to rebuilds of artifacts higher in the dependency graph, and that the tool might help understand that there are multiple distinct subsets of the graph.

I'm not sure I agree that the tool shouldn't work from the JSON content. Having the analysis tool highlight aspects of the content in human-readable formats, and then being able to go find the raw content in human-readable format, seems like a better workflow than having the tool be the only way to interact with the content. The tool should handle all the variations in the data model (no matter how lean/rich the user chooses to capture), but JSON seems like an OK format for the reasons you touched on in your "Different formats" section.

In analyzing execution logs this week, I've found some value in having a file per action, for easy comparison or examination of that particular action vs. having to diff the entire file or open it and search down to the specific part I'm interested in. It probably wouldn't matter with a 5MB file, it was annoying but not worse with a 1GB file, but it's high-friction with our full 100+GB files. And doing that could help with the memory usage required for sorting (my experience was that I needed the JVM to have an Xmx around 100GB for a 100GB sorted exec log due I believe to the sort at the end) since sorting would be just a matter of reading an index to determine how to concatenate the individual files rather than something that has to happen in memory.

@minor-fixes
Copy link
Contributor

+1 on data available through BES, or streaming through some remote protocol.

There's already a good deal of challenge around getting the bazel client to perform well (in the face of memory issues, startup times, dependency fetching, etc.) - only offering this info via client-generated local files could very well mean that users need to supply:

  • mechanisms to unconditionally dump these files ("always on" in the FR) without filling up local disk, etc.
  • mechanisms to exfiltrate these files from automated build machines, as well as user machines
  • tuning around automated bazel runs (if running this feature unconditionally incurs a CPU/build time penalty), which are already challenging (see above)

The various streaming protocols that bazel supports (BES, RBE) have been a godsend for doing this sort of analysis/reporting across builds - I'm sure that if this info were available in the build event stream (maybe with verbosity flags/options as needed) then the community would write the backends to store/aggregate the data appropriately.

@tbaing
Copy link
Contributor

tbaing commented Sep 15, 2023

While trying to debug what changed between two builds, I discovered something else that makes the current implementation hard to use. When an action is not a cache hit, we get information about the actual_outputs produced by the action, including and especially the hashes of each file generate. However, when an action is a cache hit, that content is not provided, and instead there is a digest block given. So when diffing two successive runs where the first built the artifact and the second reused that artifact from the cache, we see irrelevant differences because the second run doesn't have this information, which makes it harder to zero in on the actual differences between the files.

I'm fine with having differences in the runner line (local vs remote cache hit) and the remote_cache_hit line because those are easy to grep out or ignore, but the differences in the actual_outputs makes it much harder to grab two of these logs and compare them against each other because the differences are in content that we actually want to look at for differences.

@acecilia
Copy link

acecilia commented Oct 23, 2023

We were discussing today in BazelCon Community day an idea to be able to test cache determinism in every pull request. Leaving it here in case is useful.

Assuming remote cache is enabled, and that it contains artifacts for latest commit in main git branch (the usual):

  1. Get list of targets that missed cache
    • Perform the usual test build of your targets
    • Out of that build, generate a list of the targets that missed caching
    • => So the CI usage/capacity stays the same, no need to perform extra builds
  2. Get the list of targets that changed based on git diff between current branch and target branch of the PR
    • Get the files that changed between current branch and target branch of the PR
    • Calculate the targets that are associated with those changed git files (maybe with bazel diff?)
  3. Compare above two target lists. If the list of the cache misses is not a subset of the list generated by git diff, the pull request is introducing non-determinism

copybara-service bot pushed a commit that referenced this issue Nov 7, 2023
…optimize creation of unsorted logs.

Since we're about to introduce a new log format (see #18643), let's take this opportunity to clean up the command line API.

Avoid a pointless copy of the execution log, if no sorting is requested, by writing directly into the requested output binary path. If a temporary file is necessary, write it to a well-known path in the output base, and delete it after conversions are successful; this ensures that large temporary files don't accumulate in /tmp.

RELNOTES[INC]: Delete the --experimental_execution_log_file flag. Use --execution_log_binary_file in conjunction with --noexecution_log_sort instead.

PiperOrigin-RevId: 580132026
Change-Id: Icca827260176b3ea804384d2bed4a56ce85cfc14
tjgq added a commit to tjgq/bazel that referenced this issue Nov 9, 2023
…ag, and optimize creation of unsorted logs.

Since we're about to introduce a new log format (see bazelbuild#18643), let's take this opportunity to clean up the command line API.

Avoid a pointless copy of the execution log, if no sorting is requested, by writing directly into the requested output binary path. If a temporary file is necessary, write it to a well-known path in the output base, and delete it after conversions are successful; this ensures that large temporary files don't accumulate in /tmp.

RELNOTES[INC]: Delete the --experimental_execution_log_file flag. Use --execution_log_binary_file in conjunction with --noexecution_log_sort instead.

PiperOrigin-RevId: 580132026
Change-Id: Icca827260176b3ea804384d2bed4a56ce85cfc14
keertk pushed a commit that referenced this issue Nov 10, 2023
…ag, and optimize creation of unsorted logs. (#20132)

Since we're about to introduce a new log format (see
#18643), let's take this
opportunity to clean up the command line API.

Avoid a pointless copy of the execution log, if no sorting is requested,
by writing directly into the requested output binary path. If a
temporary file is necessary, write it to a well-known path in the output
base, and delete it after conversions are successful; this ensures that
large temporary files don't accumulate in /tmp.

RELNOTES[INC]: Delete the --experimental_execution_log_file flag. Use
--execution_log_binary_file in conjunction with --noexecution_log_sort
instead.

PiperOrigin-RevId: 580132026
Change-Id: Icca827260176b3ea804384d2bed4a56ce85cfc14
copybara-service bot pushed a commit to bazelbuild/intellij that referenced this issue Nov 14, 2023
They're fairly verbose and represent a significant portion of the log, much more so once we switch to a compact format. It should be possible to uniquely identify spawns through a combination of mnemonic, target label and output path.

Progress on bazelbuild/bazel#18643.

PiperOrigin-RevId: 582273738
copybara-service bot pushed a commit that referenced this issue Nov 14, 2023
They're fairly verbose and represent a significant portion of the log, much more so once we switch to a compact format. It should be possible to uniquely identify spawns through a combination of mnemonic, target label and output path.

Progress on #18643.

PiperOrigin-RevId: 582273738
Change-Id: Ife9edc6aaf51c96e4ac42a84f69371ab4aaa50eb
copybara-service bot pushed a commit to bazelbuild/intellij that referenced this issue Nov 14, 2023
We now unconditionally emit the full spawn metrics, which already include the wall time; the extra field is wasteful and potentially confusing.

Progress on bazelbuild/bazel#18643.

PiperOrigin-RevId: 582282615
copybara-service bot pushed a commit that referenced this issue Nov 14, 2023
We now unconditionally emit the full spawn metrics, which already include the wall time; the extra field is wasteful and potentially confusing.

Progress on #18643.

PiperOrigin-RevId: 582282615
Change-Id: I73f87f3e1c54679506acde860991ef57947262d3
copybara-service bot pushed a commit to bazelbuild/intellij that referenced this issue Nov 14, 2023
… reflects its purpose.

Also amend misleading or plainly incorrect comments. In particular, the runner field is always populated, even for cache hits, and spawns are only missing when the persistent action cache is hit ("local" is ambiguous, as we also use it in certain places to refer to the disk cache).

Note that, while the runner and cache_hit fields currently overlap in purpose, their separate existence is better aligned with potential future plans to promote cache lookups to a strategy (see bazelbuild/bazel#19904).

Progress on bazelbuild/bazel#18643.

PiperOrigin-RevId: 582294420
copybara-service bot pushed a commit that referenced this issue Nov 14, 2023
… reflects its purpose.

Also amend misleading or plainly incorrect comments. In particular, the runner field is always populated, even for cache hits, and spawns are only missing when the persistent action cache is hit ("local" is ambiguous, as we also use it in certain places to refer to the disk cache).

Note that, while the runner and cache_hit fields currently overlap in purpose, their separate existence is better aligned with potential future plans to promote cache lookups to a strategy (see #19904).

Progress on #18643.

PiperOrigin-RevId: 582294420
Change-Id: I75fc9a42b8a38c2233cc033e97d7c87264fb11d7
keertk pushed a commit that referenced this issue Nov 14, 2023
We now unconditionally emit the full spawn metrics, which already
include the wall time; the extra field is wasteful and potentially
confusing.

Progress on #18643.

PiperOrigin-RevId: 582282615
Change-Id: I73f87f3e1c54679506acde860991ef57947262d3
tjgq added a commit to tjgq/bazel that referenced this issue Nov 14, 2023
…curately reflects its purpose.

Also amend misleading or plainly incorrect comments. In particular, the runner field is always populated, even for cache hits, and spawns are only missing when the persistent action cache is hit ("local" is ambiguous, as we also use it in certain places to refer to the disk cache).

Note that, while the runner and cache_hit fields currently overlap in purpose, their separate existence is better aligned with potential future plans to promote cache lookups to a strategy (see bazelbuild#19904).

Progress on bazelbuild#18643.

PiperOrigin-RevId: 582294420
Change-Id: I75fc9a42b8a38c2233cc033e97d7c87264fb11d7
keertk pushed a commit that referenced this issue Nov 14, 2023
…curately reflects its purpose. (#20201)

Also amend misleading or plainly incorrect comments. In particular, the
runner field is always populated, even for cache hits, and spawns are
only missing when the persistent action cache is hit ("local" is
ambiguous, as we also use it in certain places to refer to the disk
cache).

Note that, while the runner and cache_hit fields currently overlap in
purpose, their separate existence is better aligned with potential
future plans to promote cache lookups to a strategy (see
#19904).

Progress on #18643.

PiperOrigin-RevId: 582294420
Change-Id: I75fc9a42b8a38c2233cc033e97d7c87264fb11d7
copybara-service bot pushed a commit to bazelbuild/intellij that referenced this issue Nov 16, 2023
Although this is a breaking change, I find it unlikely that anyone requests both, given how unwieldy these files are.

This avoids future complexity around format conversions as we add more of them (specifically: in order to add a new format while keeping the old formats equally performant, the choice of "raw" format would likely have to vary, so there would be additional pairs to convert between).

The current conversion code can also be simplified, as there's now at most one conversion to make, but I prefer doing that separately.

Related to bazelbuild/bazel#18643.

PiperOrigin-RevId: 582994996
copybara-service bot pushed a commit that referenced this issue Nov 16, 2023
Although this is a breaking change, I find it unlikely that anyone requests both, given how unwieldy these files are.

This avoids future complexity around format conversions as we add more of them (specifically: in order to add a new format while keeping the old formats equally performant, the choice of "raw" format would likely have to vary, so there would be additional pairs to convert between).

The current conversion code can also be simplified, as there's now at most one conversion to make, but I prefer doing that separately.

Related to #18643.

PiperOrigin-RevId: 582994996
Change-Id: I8e0bc228c73a0da7f3e7b830ade2cab6c14319ff
tjgq added a commit to tjgq/bazel that referenced this issue Nov 16, 2023
…tion log.

Although this is a breaking change, I find it unlikely that anyone requests both, given how unwieldy these files are.

This avoids future complexity around format conversions as we add more of them (specifically: in order to add a new format while keeping the old formats equally performant, the choice of "raw" format would likely have to vary, so there would be additional pairs to convert between).

The current conversion code can also be simplified, as there's now at most one conversion to make, but I prefer doing that separately.

Related to bazelbuild#18643.

PiperOrigin-RevId: 582994996
Change-Id: I8e0bc228c73a0da7f3e7b830ade2cab6c14319ff
keertk pushed a commit that referenced this issue Nov 16, 2023
…tion log. (#20222)

Although this is a breaking change, I find it unlikely that anyone
requests both, given how unwieldy these files are.

This avoids future complexity around format conversions as we add more
of them (specifically: in order to add a new format while keeping the
old formats equally performant, the choice of "raw" format would likely
have to vary, so there would be additional pairs to convert between).

The current conversion code can also be simplified, as there's now at
most one conversion to make, but I prefer doing that separately.

Related to #18643.

PiperOrigin-RevId: 582994996
Change-Id: I8e0bc228c73a0da7f3e7b830ade2cab6c14319ff
mai93 pushed a commit to bazelbuild/intellij that referenced this issue Nov 21, 2023
They're fairly verbose and represent a significant portion of the log, much more so once we switch to a compact format. It should be possible to uniquely identify spawns through a combination of mnemonic, target label and output path.

Progress on bazelbuild/bazel#18643.

(cherry picked from commit b7e97fd)
mai93 pushed a commit to bazelbuild/intellij that referenced this issue Nov 21, 2023
We now unconditionally emit the full spawn metrics, which already include the wall time; the extra field is wasteful and potentially confusing.

Progress on bazelbuild/bazel#18643.

(cherry picked from commit ab4d2a1)
mai93 pushed a commit to bazelbuild/intellij that referenced this issue Nov 21, 2023
… reflects its purpose.

Also amend misleading or plainly incorrect comments. In particular, the runner field is always populated, even for cache hits, and spawns are only missing when the persistent action cache is hit ("local" is ambiguous, as we also use it in certain places to refer to the disk cache).

Note that, while the runner and cache_hit fields currently overlap in purpose, their separate existence is better aligned with potential future plans to promote cache lookups to a strategy (see bazelbuild/bazel#19904).

Progress on bazelbuild/bazel#18643.

(cherry picked from commit 0bc8e5f)
mai93 pushed a commit to bazelbuild/intellij that referenced this issue Nov 21, 2023
Although this is a breaking change, I find it unlikely that anyone requests both, given how unwieldy these files are.

This avoids future complexity around format conversions as we add more of them (specifically: in order to add a new format while keeping the old formats equally performant, the choice of "raw" format would likely have to vary, so there would be additional pairs to convert between).

The current conversion code can also be simplified, as there's now at most one conversion to make, but I prefer doing that separately.

Related to bazelbuild/bazel#18643.

(cherry picked from commit 557dab5)
@meisterT meisterT added P2 We'll consider working on this in future. (Assignee optional) and removed P3 We're not considering working on this, but happy to review a PR. (No assignee) labels Jan 12, 2024
@tjgq
Copy link
Contributor

tjgq commented Feb 14, 2024

Status update:

Bazel 7.1.0 will introduce an --experimental_execution_log_compact_file flag, which is similar to the existing --execution_log_{binary,json}_file, but produces the log in a more efficient binary/compressed format (see flag description for details).

We designed this new format so that the log can be produced with minimal runtime overhead (aiming for 2-3% at most, often less) and have a much smaller file size (a reduction of 100x relative to the old format can be expected, sometimes more).

As a drawback, the new format is not designed with human readability in mind, and is expected to be analyzed with the aid of specialized tools. As a starting point, 7.1.0 will also provide an offline conversion tool (//src/tools/execlog:converter) that can convert the new format into the old ones, so that existing tools built around the old formats continue to work. Over time, we expect tools to acquire the ability to consume the new format natively.

We look forward to your feedback regarding this new format during the 7.x cycle, and might still make some minor adjustments in response to it (hence the experimental nature) although we'll endeavor to minimize disruption to early adopters.

We expect Bazel 8.0.0 to declare the new format to be stable and to deprecate the older formats.

@sluongng
Copy link
Contributor

As a drawback, the new format is not designed with human readability in mind, and is expected to be analyzed with the aid of specialized tools. As a starting point, 7.1.0 will also provide an offline conversion tool (//src/tools/execlog:converter) that can convert the new format into the old ones, so that existing tools built around the old formats continue to work. Over time, we expect tools to acquire the ability to consume the new format natively.

We have seen this pattern with recent log files. They are written as protobuf so it's not friendly for human consumption.

What if we include a bazel log subcommand to support these logs collectively instead of shipping them via separate tools? This would let us store these logs in Zstd compressed form as well, which is append-able and save more space.

For reference, buck2 comes with

> buck2 log
buck2-log
Commands for interacting with buck2 logs

USAGE:
    buck2 log [OPTIONS] <SUBCOMMAND>

OPTIONS:
    -h, --help                Print help information
    -v, --verbose <NUMBER>    How verbose buck should be while logging. Values: 0 = Quiet, errors
                              only; 1 = default; 2 = more info about errors; 3 = more info about
                              everything [default: 1]

SUBCOMMANDS:
    cmd                  Show buck command line arguments from selected invocation
    critical-path        Show the critical path for a selected build
    help                 Print this message or the help of the given subcommand(s)
    path                 Output the path to the selected log
    replay               Replay an event log
    show                 Outputs the log in JSON format from selected invocation
    show-user            Converts the event log from a selected invocation into a user event
                             log, in JSONL format
    what-failed          Outputs every command that failed in the selected invocation
    what-materialized    Outputs materializations from selected invocation
    what-ran             Output everything Buck2 ran from selected invocation
    what-up              Show the spans that were open when the log ended
    what-uploaded        Outputs stats about uploads to RE from the selected invocation
exit 2

which is a lot more user-friendly.


Separate question: given that this is supposed to be "compact", could we include this log in BES?

@lberki
Copy link
Contributor Author

lberki commented Feb 15, 2024

I think that's a reasonable idea (cc @tjgq @coeuvre ); getting the separate tools is a pain and I hope that since Bazel needs to link all the code handling these logs anyways, it would not be too much bloat.

@brentleyjones
Copy link
Contributor

I also want to ask what Son asked at the end: can we get this referenced in the BES and uploaded to the remote cache? Similar to the execution deps graph when that is enabled?

@tjgq
Copy link
Contributor

tjgq commented Feb 15, 2024

Yes, we can include it in the BES. I'll prepare a PR.

@sluongng
Copy link
Contributor

sluongng commented Apr 8, 2024

@tjgq I started to play with the new compact exec log today and it's very nice.

One small feedback is that --experimental_execution_log_compact_file flag should take an empty string. In that case, it should simply write the file to output_base with a default name to upload the file to BES. That's how it's done for the trace profile today https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/runtime/BlazeRuntime.java;l=328-347;drc=3f9d80c35d88a280c2d53682f3d201b4733a3fff

@BalestraPatrick
Copy link
Member

+1 to also adding a way to upload this by default without providing a value to the flag.

@brentleyjones
Copy link
Contributor

@tjgq and I discussed this, and he's not sure it would ever be fine for this to always be collected by default. But there could be a flag, like --enable_execution_log_compact_file_by_default which people could flip that would make it behave like the timing profile does today.

@sluongng
Copy link
Contributor

Since this is an experimental flag, I think just change it to a tri-state flag:

  • Unset: disable logging.
  • Set, no value: log to bazel_out
  • Set, with value: log to specified path

The default value could still be unset.


From my observation, this flag benefits folks who rely on Bazel's incremental state tracking a lot more. For folks to rely on a separate query system (i.e. bazel-diff or target-determinator) to decide what to build and test, the number of actions is usually higher and thus, the log size is also a bit higher (1-2 GBs for some runs). To be clear, I don't think the increase in action count is caused by using a separate query system, but there usually is some correlation there.

Therefore, I do agree that we should not enable the compact log by default, at least not yet.

@brentleyjones
Copy link
Contributor

brentleyjones commented Jun 26, 2024

There would be no way to unset it then in a different --config, since you can't do null versus empty string at the command-line.

@sluongng
Copy link
Contributor

🤔 perhaps you could explicitly unset it with the --no<flag>. Or we could accept some boolean value such as true/false/1/0 to detect unset / set-but-use-default-path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests