Delta Lake

Software Development

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture

About us

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. Delta Lake is an independent open-source project and not controlled by any single company. To emphasize this we joined the Delta Lake Project in 2019, which is a sub-project of the Linux Foundation Projects.

Website: https://delta.io
External link for Delta Lake
Industry: Software Development
Company size: 11-50 employees
Headquarters: San Francisco
Type: Partnership
Founded: 2019
Specialties: Delta Lake, Apache Spark, PrestoDB, Trino, Hive, Apache Flink, Apache Beam, Apache Pulsar, Rust, Scala, Java, Python, and Ruby

Locations

Primary

San Francisco, US

Get directions

Updates

Delta Lake

58,545 followers
6h
Report this post
Cloning #DeltaLake tables is a powerful feature for migrating data, replicating machine learning workflows, and creating data copies for experiments. Changes to clones do not impact the source table, ensuring integrity and isolation. ✅ Understanding the different methods of cloning can greatly impact your data management strategy. Learn which cloning technique is right for your needs and how to implement it effectively. Get started ➡ https://lnkd.in/eR4fdFcc #opensource #oss #linuxfoundation #lfaidata

Delta Lake Clone

delta.io

Like Comment Share
Delta Lake

58,545 followers
4d
Report this post
Delta Lake deletion vectors make delete operations much faster. 🏃♂️💨 Enabling deletion vectors can make deletes run faster by a factor of 3x, 10x, 100x, or even more. It sounds like an exaggeration, but it's true! Delete operations without deletion vectors can be numbingly slow. Parquet data lake delete operations are slow because the underlying files are immutable. You can't simply remove a row of data from a Parquet file. You need to read the Parquet file, filter out the row you want removed, and then create a new file. "Deleting" 5 rows of data from a Parquet file with a million rows is comparatively expensive, as you might imagine. Delete vectors write out the rows that should be removed from the table in a roaring bitmap file. Engines can consult the roaring bitmap to see the rows that should be filtered when data is read. Writing a roaring bitmap file is much, much faster than rewriting tons of #Parquet files - that's why deletion vectors are so much faster. The image shows how 400,000 rows from a 4.2 billion row dataset (0.01% of the data) in just 16 seconds when deletion vectors are enabled. This is a great result for users because their queries run a lot faster and they save. 🔗 Learn more about deletion vectors: https://lnkd.in/eGeYdPJw Credit: Matthew Powers, CFA #opensource #oss #linuxfoundation #lfaidata #data
2 Comments

Like Comment Share
Delta Lake

58,545 followers
5d
Report this post
In a recent conversation, Tathagata Das, Staff Software Engineer at Databricks, delves into the evolution, strategic advancements, and future of Delta Lake with Quentin Ambard and Youssef Mrini! Key takeaways include: 🌟 Origins of Delta Lake: Insights into its development for enhanced data integrity. 🌟 Real-World Applications: Learn from its pivotal deployment handling Apple's data scale. 🌟 Upcoming Features: Preview the new capabilities in Delta Lake 4.0. 🌟 Interoperability Focus: Strategies for seamless integration with Apache Iceberg and Hudi. Watch the full session for more ➡ https://lnkd.in/eSWEJv3Y #opensource #linuxfoundation #lfaidata #oss

The future of Delta Lake and Apache Iceberg

https://www.youtube.com/

Like Comment Share
Delta Lake

58,545 followers
1w
Report this post
You can query a Delta table with both Polars and DuckDB. The following example reads a #deltalake table into a Polars DataFrame and applies a filter. The resulting Polars DataFrame is then queried by DuckDB. You can also run the #sql with the #polars SQL interface and don't need to use #duckdb. It's awesome the data community is building tools that are so interoperable! 🦀 Credit: Matthew Powers, CFA #opensource #linuxfoundation #oss #lfaidata #polars #duckdb
Like Comment Share
Delta Lake

58,545 followers
1w
Report this post
Did you know that you can use Delta Lake UniForm to append Iceberg metadata to a Delta table❓ Subsequent operations to the Delta table will make metadata entries that are compatible with both #deltalake and Apache Iceberg. The overhead for making an additional metadata entry is minimal (most processing time is for writing data files). This makes it easy to read Delta tables with engines that only have Iceberg connectors. Engines that already have Delta connectors don't need UniForm enabled - you can just use the Delta connector when it's available. 🔗 View the documentation for more: https://lnkd.in/ePbs63xn Credit: Matthew Powers, CFA #opensource #oss #linuxfoundation #lfaidata
1 Comment

Like Comment Share
Delta Lake

58,545 followers
2w
Report this post
Delta Lake lets you store similar data close together via Liquid Clustering, Z-ordering, and Hive-style partitioning—making ETL processes faster and more reliable. Consider clustering for your ETL workloads if you: ✅ Frequently filter data by high cardinality columns ✅ Encounter significant skew in data distribution. ✅ Manage rapidly growing tables that require ongoing maintenance ✅ Experience changes in data access patterns over time 👀 Take a look at the significant speed improvements of a Z-ordered Delta table compared to traditional #Parquet and #JSON formats! Learn how to use #DeltaLake for ETL workloads. 👉 https://lnkd.in/epRiXJze #opensource #oss #linuxfoundation #lfaidata #ETL
1 Comment

Like Comment Share
Delta Lake

58,545 followers
2w
Report this post
Parquet files store column statistics for row groups in the metadata footer. This is nice because you can skip row groups you don't need. Delta Lake takes this further. Delta Lake stores metadata at the **file-level** in the transaction log. This way query engines can figure out which data can be skipped using a single read operation, instead of reading each #Parquet footer. This is great because it means you can skip entire files. This can give you order-of-magnitude performance gains when reading large tables with a selective filter. Credit: Avril Aysha #opensource #oss #deltalake #linuxfoundation
3 Comments

Like Comment Share
Delta Lake

58,545 followers
2w
Report this post
🚀 Do you have a passion for data engineering and a love for open source? No matter your experience level or background, there's a place for you in #DeltaLake. From code contributions and bug reports to updating documentation and giving community talks—there are numerous ways to get involved. 🙌 👀 Interested? Explore our roadmap to find out what we're focusing on next: https://lnkd.in/e7sWtX9T #linuxfoundation #opensource #oss #lfaidata

Linux Foundation Delta Lake Roadmap • delta-io

github.com

1 Comment

Like Comment Share
Delta Lake

58,545 followers
3w
Report this post
How do you query the Delta Lake tables within BigQuery directly? Vishal Waghmode walks us through the process! 🙌 Google BigQuery, a fully managed data warehouse, now supports reading Delta Lake tables directly! This integration allows users to query #DeltaLake data from #BigQuery without the need for complex #ETL processes or manually refreshing table metadata. Why is this integration important❓ 1️⃣ Streamlined Data Access: Simplifies querying Delta Lake data within BigQuery, enhancing data accessibility and reducing data movement. 2️⃣ Efficiently share data: Share data seamlessly across different processing engines like BigQuery, Databricks, Dataproc, and Dataflow, enabling efficient data utilization and collaboration 3️⃣ Reduced Complexity: Eliminates the need for maintaining separate data pipelines, thereby simplifying data management workflows. Results: ✅ Improved Query Performance: Faster query execution times due to direct access to Delta Lake data. ✅ Simplified Data Management: Automatically detects data and schema changes, so you can read the latest snapshot without manually refreshing table metadata. ✅ Unified Data Access: Maintain a single authoritative copy of your data that can be queried by both Databricks and BigQuery without the need to export, copy, or use manifest files. 🔗 Learn more: https://lnkd.in/eacAaCr4 #opensource #oss #linuxfoundation #dataengineering
7 Comments

Like Comment Share
Delta Lake

58,545 followers
3w
Report this post
We are excited to announce the release of python-v0.19.0: complete CDF support, add column operation, faster MERGE! 🎉 This release includes several exciting new performance improvements and features. Highlights include: 🌟 CDF support in write_deltalake, delete, and merge operation 🌟 Expired logs cleanup during post-commit. Can be disabled with 𝚍𝚎𝚕𝚝𝚊.𝚎𝚗𝚊𝚋𝚕𝚎𝙴𝚡𝚙𝚒𝚛𝚎𝚍𝙻𝚘𝚐𝙲𝚕𝚎𝚊𝚗𝚞𝚙 = 𝚏𝚊𝚕𝚜𝚎 🌟 Improved MERGE performance by using predicate non-partition columns min/max for prefiltering 🌟 𝙰𝙳𝙳 𝚌𝚘𝚕𝚞𝚖𝚗 option 🌟 Speed up log parsing 🔗 View the full release notes: https://lnkd.in/esWdhXDn #deltalake #opensource #oss #linuxfoundation
Like Comment Share

Delta Lake

Software Development

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture

About us

Locations

Updates

The future of Delta Lake and Apache Iceberg

https://www.youtube.com/

Join now to see what you are missing

Similar pages

Databricks

Apache Iceberg

MLflow

dbt Labs

Apache Spark

Tabular (now part of Databricks)

Apache Hudi

Data Engineering Jobs

Snowflake

Apache Airflow

Browse jobs

Engineer jobs

Data Engineer jobs

Head of Analytics jobs

Business Architect jobs

Senior Product Manager jobs

Procurement Manager jobs

Head of Operations jobs

Technical Trainer jobs

Advocate jobs

General Manager jobs

Data Scientist jobs

Solutions Architect jobs

Architect jobs

Scientist jobs

Knowledge Manager jobs

Data Manager jobs

Account Manager jobs

Head jobs

Delivery Manager jobs

Associate Director jobs