Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Machine Specification Recommendations for Image Build. #41

Closed
hanishi opened this issue Jan 24, 2024 · 28 comments
Closed

Request for Machine Specification Recommendations for Image Build. #41

hanishi opened this issue Jan 24, 2024 · 28 comments

Comments

@hanishi
Copy link

hanishi commented Jan 24, 2024

Hi,
I am currently working on building this project following the instruction for the aws environment
and have encountered significant build times and resource usage issues. I seek recommendations on the EC2 instance that is required to handle this task efficiently.

I am currently using c5.4xlarge and it looks to me that compilation speed has decelerated significantly or stopped.

I appreciate any advice or suggestions.

==== Sourcing builder.sh =====
==== Running build_and_test_all_in_docker =====
fix end of files...........................................................Passed
fix utf-8 byte order marker................................................Passed
mixed line ending..........................................................Passed
trim trailing whitespace...................................................Passed
check for case conflicts...................................................Passed
check for merge conflicts..................................................Passed
check yaml.................................................................Passed
check json.................................................................Passed
check for broken symlinks..................................................Passed
check for added large files................................................Passed
check vcs permalinks.......................................................Passed
check that executables have shebangs.......................................Passed
detect private key.........................................................Passed
Executable shell script omits the filename extension.......................Passed
Non-executable shell script filename ends in .sh...........................Passed
Check file encoding........................................................Passed
Test shell scripts with shellcheck.........................................Passed
buf format.................................................................Passed
clang-format...............................................................Passed
addlicense.................................................................Passed
addlicense check...........................................................Passed
terraform fmt..............................................................Passed
prettier...................................................................Passed
lint markdown..............................................................Passed
buildifier.................................................................Passed
cpplint....................................................................Passed
black python formatter.....................................................Passed
==== build and test specified targets using bazel-debian ====
=== cbuild debian action envs ===
action_env=TOOLCHAINS_HASH=b77b7d50a6527337035ce8d9d8a4a32218c32a5c0c431f28d4ca1a4bf767e384
INFO: Reading 'startup' options from /etc/bazel.bazelrc: --output_user_root=/bazel_root/build_ubuntu_b77b7d5
INFO: Reading 'startup' options from /root/.bazelrc: --output_base=/bazel_root/build_ubuntu_b77b7d5/7cd096bbd5692e698f6d658cc3f0db40
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=184
INFO: Reading rc options for 'build' from /etc/bazel.bazelrc:
  'build' options: --action_env=TOOLCHAINS_HASH=b77b7d50a6527337035ce8d9d8a4a32218c32a5c0c431f28d4ca1a4bf767e384
INFO: Reading rc options for 'build' from /src/workspace/.bazelrc:
  'build' options: --announce_rc --verbose_failures --compilation_mode=opt --output_filter=^//((?!(third_party):).)*$` --color=yes --@io_bazel_rules_docker//transitions:enable=false --workspace_status_command=bash tools/get_workspace_status --copt=-Werror=thread-safety-analysis --config=clang --config=noexcept --per_file_copt=.*sandboxed_api.*@-Wno-return-type --@com_google_googleurl//build_config:system_icu=0 --@io_opentelemetry_cpp//api:with_abseil=true --copt=-DENABLE_LOGS_PREVIEW
INFO: Found applicable config definition build:clang in file /src/workspace/.bazelrc: --cxxopt=-fbracket-depth=512 --client_env=CC=clang --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --client_env=BAZEL_CXXOPTS=-std=c++17
INFO: Found applicable config definition build:noexcept in file /src/workspace/.bazelrc: --copt=-fno-exceptions --per_file_copt=.*boost.*@-fexceptions --per_file_copt=.*cc/aws/proxy.*@-fexceptions --per_file_copt=.*cc/roma.*@-fexceptions --per_file_copt=.*oneTBB.*@-fexceptions --per_file_copt=.*com_github_nghttp2_nghttp2.*@-fexceptions --per_file_copt=.*cc/core.*@-fexceptions --per_file_copt=.*cc/cpio.*@-fexceptions
INFO: Found applicable config definition build:aws_instance in file /src/workspace/.bazelrc: --//:instance=aws --@google_privacysandbox_servers_common//:instance=aws
INFO: Found applicable config definition build:aws_platform in file /src/workspace/.bazelrc: --//:platform=aws --@google_privacysandbox_servers_common//:platform=aws --@google_privacysandbox_servers_common//scp/cc/public/cpio/interface:platform=aws
INFO: Analyzed 277 targets (594 packages loaded, 43384 targets configured).
INFO: Found 277 targets...
[20,585 / 33,774] 16 actions running
    Compiling src/heap/setup-heap-internal.cc; 667s processwrapper-sandbox
    Compiling src/compiler/js-create-lowering.cc; 514s processwrapper-sandbox
    Compiling src/google/protobuf/compiler/java/context.cc [for tool]; 165s processwrapper-sandbox
    Compiling scp/cc/core/async_executor/src/single_thread_priority_async_executor.cc; 138s processwrapper-sandbox
    Compiling aws-cpp-sdk-ec2/source/model/AssociatedRole.cpp; 138s processwrapper-sandbox
    Compiling src/google/protobuf/io/zero_copy_sink.cc; 138s processwrapper-sandbox
    Compiling src/google/protobuf/compiler/rust/context.cc; 123s processwrapper-sandbox
    Compiling src/inspector/v8-regex.cc; 123s processwrapper-sandbox ...
@lx3-g
Copy link
Collaborator

lx3-g commented Jan 24, 2024

Hi hahishi,
c5d.9xlarge or higher should probably work well for you -- it'll take about 30 mins. Bazel tends to scale fairly linearly with number of CPUs. So you can get an even bigger machine if you want to further decrease the amount of time it takes to build.

@hanishi
Copy link
Author

hanishi commented Jan 25, 2024

Thank you for your prompt and detailed response. I really appreciate your input that has significantly aided in shaping our approach to addressing this particular issue. We will try with a larger instance as suggested. 😄

I believe it would be beneficial to update the relevant part of aws environment somehow; building the image is resource intensive and c5d.9xlarge or higher is recommended.

It should serve as a valuable resource for others in the future, providing guidance and clarity on similar issue.

@hanishi
Copy link
Author

hanishi commented Jan 25, 2024

I am using c5d.9xlarge now
It's taking definitely longer than 30 mins. (It's been 2 hours already)
Looks to me that compilation process is being throttled for whatever reasons. Is this normal?

[35,381 / 44,415] 36 actions running
    Compiling src/baseline/baseline-batch-compiler.cc [for tool]; 1657s processwrapper-sandbox
    Compiling src/asio_client_response.cc; 1560s processwrapper-sandbox
    Compiling src/objects/js-collator.cc [for tool]; 1303s processwrapper-sandbox
    Compiling src/builtins/builtins-shared-array.cc; 1221s processwrapper-sandbox
    Compiling src/builtins/builtins-string.cc; 1215s processwrapper-sandbox
    Compiling src/heap/cppgc/member-storage.cc [for tool]; 1215s processwrapper-sandbox
    Compiling src/baseline/bytecode-offset-iterator.cc [for tool]; 1203s processwrapper-sandbox
    Compiling src/asio_client_session.cc; 1202s processwrapper-sandbox ...

@hanishi
Copy link
Author

hanishi commented Jan 25, 2024

Here is the output from top

The high load average and high I/O wait suggest that it is spending a lot of time waiting for I/O.
I see no disk usage at all. If the build is fetching a lot of data from a remote source, and the connection to that source is slow(or throttled), it could cause the build to take much longer time. I may be wrong, but couldn't think of any other reasons. That said, building for every release is proving to be too slow, if my assumption is correct, therefore not realistic.

top - 13:30:04 up  2:53,  2 users,  load average: 40.37, 40.40, 40.45
Tasks: 408 total,   1 running, 407 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,  2.8 id, 97.2 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  70208.9 total,  58116.8 free,   5604.6 used,   6487.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  63831.3 avail Mem

@lx3-g
Copy link
Collaborator

lx3-g commented Jan 25, 2024

Thnx for your feedback. We will update aws environment to reflect these requirements.

One thing you can try is to try setting a couple of bazel flags to encourage it to schedule more concurrent actions:
for example,

export BAZEL_EXTRA_ARGS="--local_cpu_resources=72 --jobs=72"

https://bazel.build/versions/6.3.0/docs/user-manual?hl=en#local-resources

However, we don't often build with a 0% cache hit rate -- so given your feedback, it might be useful to further bump up your machine to c5d.18xlarge.

Then instead of seeing

 36 actions running

you'll see 72 actions running, since you'll have more vCpus, which will increase the build speed.

@hanishi
Copy link
Author

hanishi commented Jan 26, 2024

After thorough investigation, I've pinpointed the problem to bandwidth throttling during build process. This issue persists regardless of the instance type we have tried so far, indicating that the instance's capacity isn't the limiting factor.

My hypothesis is that the bandwidth for fetching source from remote are being throttled, leading to a bottleneck in the overall build performance. There's nothing that limits the network bandwidth on our side, so there must be something...

This issue will cause significant delays in our build times, impacting our development and deployment cycle.
スクリーンショット 2024-01-26 12 04 11

With 8 actions(I downgraded the instance tye to c5d.2xlarge), build proceeds fairly quickly until it hits the bandwidth issue.

INFO: Analyzed 277 targets (594 packages loaded, 43384 targets configured).
INFO: Found 277 targets...
[41,017 / 43,720] 8 actions running
    Compiling src/ic/accessor-assembler.cc [for tool]; 276s processwrapper-sandbox
[41,120 / 43,720] 8 actions running
    @v8//:v8_libshared_icu; 133s processwrapper-sandbox
    Compiling src/builtins/builtins-constructor-gen.cc [for tool]; 57s processwrapper-sandbox
[41,120 / 43,720] 8 actions running
    @v8//:v8_libshared_icu; 188s processwrapper-sandbox
[41,215 / 43,720] 8 actions running
    Compiling src/interpreter/bytecode-flags.cc; 244s processwrapper-sandbox
[41,218 / 43,720] 8 actions running
    Compiling src/objects/ordered-hash-table.cc [for tool]; 340s processwrapper-sandbox
    Compiling src/runtime/runtime-collections.cc; 229s processwrapper-sandbox

@hanishi
Copy link
Author

hanishi commented Jan 27, 2024

I am reposting this communication as a follow-up to my earlier messages regarding a critical issue we have identified in our build process, which we suspect might be related to external bandwidth throttling.

After a thorough investigation and multiple tests across different instance types, we have consistently encountered significant delays in our build times. This issue adversely impacts our development and deployment and, if unresolved, poses a substantial risk to our project timelines.

The evidence indicates that the issue is unrelated to the instance type or our internal network configurations. We have observed consistent bandwidth limitations during the build process. These limitations occur irrespective of the instance's capacity, suggesting an external constraint on the network bandwidth.

We'd like to ask for your help in investigating this issue. We would like to understand if any bandwidth limitations imposed on your end could be causing this bottleneck. Any information or insights you can give would be extremely helpful in helping us resolve this matter. We are open to exploring potential solutions or workarounds that you might suggest.

We appreciate your quick attention to this matter and look forward to your support in resolving this issue.

Thank you for your cooperation and support.

@yw63
Copy link

yw63 commented Jan 29, 2024

Hi, I have a similar issue when building AMI, it has been taking me about 6 hours already to build the image. Wondering how long does it take for @hanishi to successfully build the image. Your provided information really helped me understand the issue more, would appreciate it if you have any follow-ups, thank you!

@hanishi
Copy link
Author

hanishi commented Jan 29, 2024

@yw63
The build process ran for over six hours with a slow process indicator, almost appearing to be stopped, leading to the decision to abandon the build process. So, we've never succeeded.

@yw63
Copy link

yw63 commented Jan 30, 2024

Hi @hanishi,
Thank you for your response, my build process seemed to stop also after a certain point (We let the process hang for almost a day in total), I will also update the building status if I see anything new.

Thanks again for sharing your issue.

@hanishi
Copy link
Author

hanishi commented Feb 3, 2024

@lx3-g
cc: @yw63
I'm letting you know about our ongoing AMI creation issue.

In a recent development, the AWS support team shared their findings with us after independently conducting the download and AMI creation process. They recently conducted tests on their end using a c5d.18xlarge instance, as suggested, and similarly experienced the prolonged 8-hour process with intermittent failures.

Given the critical nature of this bottleneck and its apparent impact on our project, we would appreciate your expertise and assistance more urgently than ever. We have ruled out many potential internal causes and believe the issue may involve external bandwidth throttling or other constraints beyond our direct control.

We look forward to any support or insights you can offer.

Thank you.

@azaidisovrn
Copy link

azaidisovrn commented Feb 15, 2024

Hi @lx3-g @hanishi,

We are running c5d.18xlarge and building the 0.15 release, and it has been over 14 hours since the build was triggered. We also believe that the long build times might be due to throttling on the remote side.

image

image

image

@lx3-g
Copy link
Collaborator

lx3-g commented Feb 16, 2024

I created a fresh machine -- c5d.24xlarge
I followed the steps here:
https://github.com/privacysandbox/protected-auction-key-value-service/blob/release-0.16/getting_started/quick_start.md
up to

 ./builders/tools/bazel-debian build //components/data_server/server:server \
  --//:platform=aws  --//:instance=aws

Note that I built for AWS, and not local as in example -- since AWS build takes longer.
I got:

INFO: Elapsed time: 1631.935s, Critical Path: 115.17s
INFO: 18237 processes: 5792 internal, 1 local, 12444 processwrapper-sandbox.
INFO: Build completed successfully, 18237 total actions

which is 27 minutes.

I didn't do anything extra beyond spinning up a standard AWS EC2 instance.
I also ran production/packaging/aws/build_and_test --with-ami us-east-1 on a new separate machine and it took about 45 mins to finish and produce an ami.

@xinkuifeng
Copy link

xinkuifeng commented Apr 9, 2024

Hello @lx3-g,

I'm using Macbook M1 2020 (16 GB Memory) to build the image locally by following the instructions on this page. And I'm on the branch release-0.15 (commit 05b6890). I use Docker Desktop 4.28.0.

The build speed was OK up to 6k actions. Then it slowed down greatly. It takes around 3hrs now and the build is still at 12k actions.

[11,968 / 16,243] 8 actions running
    Compiling src/interpreter/bytecode-array-random-iterator.cc; 56s processwrapper-sandbox
    Compiling external/v8/icu/torque-generated/src/objects/foreign-tq.cc [for tool]; 53s processwrapper-sandbox
    Compiling generated/src/aws-cpp-sdk-ec2/source/model/AssignedPrivateIpAddress.cpp; 27s processwrapper-sandbox
    Compiling src/core/ext/transport/chttp2/transport/parsing.cc; 17s processwrapper-sandbox
    Compiling generated/src/aws-cpp-sdk-ec2/source/model/VerifiedAccessLogCloudWatchLogsDestinationOptions.cpp; 14s processwrapper-sandbox
    Compiling generated/src/aws-cpp-sdk-ec2/source/model/AssociateAddressRequest.cpp; 10s processwrapper-sandbox
    Compiling external/v8/icu/torque-generated/src/builtins/typed-array-values-tq.cc; 10s processwrapper-sandbox
    Compiling src/interpreter/bytecode-array-writer.cc; 6s processwrapper-sandbox

image

Do you know what could be the bottleneck? Is it realistic to expect to build this project on Macbook M1? If yes, how many jobs should I force bazel to run in parallel?

@peiwenhu
Copy link
Collaborator

peiwenhu commented Apr 9, 2024

hi @xinkuifeng ,

We can't really comment on whether Macbook M1 is realistic because we don't have one that has the same setup to evaluate. However, I just ran on a GCP VM with 8 vcpu and 32GB RAM 300GB SSD and it finished in about an hour. (The VM is a brand new one so there's not a lot of other cpu-heavy activities and that may or may not have helped.)

@xinkuifeng
Copy link

xinkuifeng commented Apr 9, 2024

Hey @peiwenhu ,

However, I just ran on a GCP VM with 8 vcpu and 32GB RAM 300GB SSD and it finished in about an hour.

Thanks for this info!

Is it possible to consider supporting the setup Macbook + Docker Desktop for local deployment? Without the local dev environment ready, it could be hard to contribute.

In my use case, I noticed the container CPU usage on Docker Desktop is often less than 1 core (<100%) and in theory, I could go 8 cores. I tried:

export BAZEL_EXTRA_ARGS="--local_cpu_resources=16 --jobs=16"

before launching the build. However, it does not seem to change the container CPU usage.

@peiwenhu
Copy link
Collaborator

Hi @xinkuifeng , we don't have the capacity to support this specific setup unfortunately. Perhaps you can try to locate the problem with a simpler repro case such as replacing Docker Desktop with plain Docker and run the build or running some other simpler bazel build such as the official example inside Docker Desktop and see whether any of these gives any hint.

@xinkuifeng
Copy link

Hi @peiwenhu, thanks for replying! Got it and will find alternatives.

@thegreatfatzby
Copy link

Hey @peiwenhu I'm trying to get this running on my Mac as well and am wondering if I'm barking up the wrong tree, as I'm not even getting to the slow build times that @xinkuifeng is describing (although I have seen that on AWS instances I've tried this on).

I've tried this a few different ways:

  1. Running on my Mac directly
  2. Running through a Docker container, starting from a Debian base container, installing docker as described in the provided links.
  3. Doing the same with Ubuntu.

(1) ran into little issues immediately, so I moved onto trying to use Linux based containers, but even there I'm getting a lot of small errors that might be telling me this isn't the correct path. For instance, I got past a few issues by mounting a directory for the workspace, making sure my Docker was up to date, etc...but it seems like now I'm getting odd issues with the install script where it can't find files that seem to be there, I see it look for files in the install directory that are being written to the workspace directory.

I'll come back to this later, but I'm curious if this is just not supported...if so, I'm trying to understand what local development is intended to be supported.

@thegreatfatzby
Copy link

Also @lx3-g the link you provided isn't accessible to non Google employees I don't think, is that different than what is on Github?

@xinkuifeng
Copy link

Hey @thegreatfatzby

I would expect that you could encounter at least 2 small errors when building this project on Mac directly:

  1. declare: -n: invalid option
  2. mktemp: unrecognized option `--suffix=.log'

because bash/zsh on MacOS are slightly different than GNU bash on Ubuntu.

I didn't try to build this project inside a docker container.

Running a Ubuntu VM inside MacOS to build this project is definitely a viable solution.

@thegreatfatzby
Copy link

@xinkuifeng very encouraging to hear that, I take it that path succeeded for you...I don't mind spelunking but I was beginning to worry I was in the wrong rabbithole.

Did you have to make small adjustments to the scripts? For instance, I just was able to get past one issue by adding a "mkdir ${WS_TMP_IMAGE_DIR} " in the get_builder_image_tagged phase, which has led to my next issue which is similar...did you have to do anything interesting with the scripts, docker commands, etc? I

@xinkuifeng
Copy link

For Path 1

Running on my Mac directly

I didn't change the scripts. I installed utilities from GNU:

brew install bash
brew install coreutils

Starting from there, the project can be built. But I would not suggest you take this path as the build speed is far too slow (>> 6 hours). The root cause could be Bazel's sandboxing strategy (creating the symlinks in MacOS is way more costly than doing it in Ubuntu).

For Path 3

Doing the same with Ubuntu.

Things work out of the box. You don't need to change any scripts.

@thegreatfatzby
Copy link

@xinkuifeng understood on number one, no issue going through an Ubuntu container.... That is really interesting I must be doing something silly... I had to step out so I don't have the commands in front of me but I basically just docker pulled the base Ubuntu, installed some basic docker and other stuff, tried it both with cloning directly into the image and also mounting from the host machine, and also mounting the docker socket as well as something for the workspace... Maybe I'm using the wrong image?

@xinkuifeng
Copy link

Maybe I'm using the wrong image?

I don't know. As I said, I never tried to use the docker container with Ubuntu as the base image (Path 2).
Instead, I verified Ubuntu VirtualMachine inside MacOS can work fine.

@thegreatfatzby
Copy link

Ah apologies, I heard what I wanted to, thanks.

@lx3-g
Copy link
Collaborator

lx3-g commented May 28, 2024

@thegreatfatzby

Also @lx3-g the link you provided isn't accessible to non Google employees I don't think, is that different than what is on Github?

Thnx for pointing it out, I've updated the link. And yeah, it was exactly the same.

@formgit
Copy link

formgit commented Jul 17, 2024

Hi all, we have now set up Cloud Build (GCP) and CodeBuild (AWS). For both platforms, the build time should be within 2 hours. See cloud build docs for more details.

@formgit formgit closed this as completed Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants