Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known issue: AWS instance can go into an unresponsive state after running for a few days #31

Closed
peiwenhu opened this issue Nov 1, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@peiwenhu
Copy link
Collaborator

peiwenhu commented Nov 1, 2023

We discovered that if the server runs on AWS for a few days it may get killed due to failing autoscaling health checks. When we disabled the health checks we noticed that the instance would be unresponsive, failing EC2 status checks and sshing to it times out.

Work is in progress to fix this.

@peiwenhu peiwenhu added the bug Something isn't working label Nov 1, 2023
@MarcoLugo
Copy link

Any updates on the progress/timeline for the fix? The context offered here seems rather concerning to us so any additional information would be helpful. Thank you.

@peiwenhu
Copy link
Collaborator Author

We think it is an out of memory issue due to the continuously ingested test data in that environment. memory consumption in TEE is a bit tricky so we're still in the process of getting concrete evidence.

@peiwenhu
Copy link
Collaborator Author

peiwenhu commented Feb 16, 2024

It turns out that we set Envoy logging verbosity level to DEBUG which generated too much logs and it used up our test machine disk over time. It is unrelated to the server code itself. The verbosity level is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants