This migration began 4 years ago. 😲 Not our typical Ray use case, but so impressive and it illustrates Ray's versatility. Also, it was worth it because they're saving over *$100 million annually*.
Some fascinating excerpts.
2016: Amazon aims to remove all dependencies on Oracle.
2018: Shutdown last Oracle Data Warehouse cluster, 50PB of table data migrated from Oracle to S3. The tables store "deltas," that is, records to insert, update, or delete, which need to be merged at read time when used. The reads grow too expensive, so Apache Spark is used to merge these deltas offline to produce a read-optimized versions of the tables.
2019: The data has grown from petabyte scale to exabyte scale. The current system needs constant tuning and optimization to handle the scale.
2020: The team completes a PoC using Ray for this workload, demonstrating the ability to handle "12X larger datasets than Apache Spark, improve cost efficiency by 91%, and process 13X more data per hour."
2021: The team settled on an overall architecture and shared early results at the Ray Summit.
2022: More testing of Ray to expose any issues when handling exabyte-scale production data. The main problems were around the management of Amazon EC2 instances at scale (poor resource utilization and slow cluster start times) and out-of-memory errors.
Late 2022: The migration begins in earnest beginning with the largest ~1% of tables (which accounted for ~40% of the cost and the vast majority of job failures).
2023: Most issues fixed. Began moving to fully automated shadow compaction on Ray. Whenever new inserts / updates / deletes arrived in a table to be compacted, both Spark and Ray would kick off the same compaction job to verify the benefits and correctness (temporarily increasing the overall cost of compaction before lowering it).
2024 Q1: Ray compacted 1.5 exabytes of Apache Parquet data from S3 using 10,000 years of CPU compute time.
Today: Reading over 20 petabytes of data / day across 1600 Ray jobs / day. Ray has maintained a 100% on-time delivery rate of newly compacted data to table subscribers. This is being done with 82% better cost efficiency! This means annual savings of over $120 million / year.
https://lnkd.in/g-pJhFei