Delta Lake

Delta Lake

Software Development

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture

About us

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. Delta Lake is an independent open-source project and not controlled by any single company. To emphasize this we joined the Delta Lake Project in 2019, which is a sub-project of the Linux Foundation Projects.

Website
https://delta.io
Industry
Software Development
Company size
11-50 employees
Headquarters
San Francisco
Type
Partnership
Founded
2019
Specialties
Delta Lake, Apache Spark, PrestoDB, Trino, Hive, Apache Flink, Apache Beam, Apache Pulsar, Rust, Scala, Java, Python, and Ruby

Locations

Updates

  • View organization page for Delta Lake, graphic

    58,545 followers

    Cloning #DeltaLake tables is a powerful feature for migrating data, replicating machine learning workflows, and creating data copies for experiments. Changes to clones do not impact the source table, ensuring integrity and isolation. ✅ Understanding the different methods of cloning can greatly impact your data management strategy. Learn which cloning technique is right for your needs and how to implement it effectively. Get started ➡ https://lnkd.in/eR4fdFcc #opensource #oss #linuxfoundation #lfaidata

    Delta Lake Clone

    Delta Lake Clone

    delta.io

  • View organization page for Delta Lake, graphic

    58,545 followers

    Delta Lake deletion vectors make delete operations much faster. 🏃♂️💨 Enabling deletion vectors can make deletes run faster by a factor of 3x, 10x, 100x, or even more. It sounds like an exaggeration, but it's true! Delete operations without deletion vectors can be numbingly slow. Parquet data lake delete operations are slow because the underlying files are immutable. You can't simply remove a row of data from a Parquet file. You need to read the Parquet file, filter out the row you want removed, and then create a new file. "Deleting" 5 rows of data from a Parquet file with a million rows is comparatively expensive, as you might imagine. Delete vectors write out the rows that should be removed from the table in a roaring bitmap file. Engines can consult the roaring bitmap to see the rows that should be filtered when data is read. Writing a roaring bitmap file is much, much faster than rewriting tons of #Parquet files - that's why deletion vectors are so much faster. The image shows how 400,000 rows from a 4.2 billion row dataset (0.01% of the data) in just 16 seconds when deletion vectors are enabled. This is a great result for users because their queries run a lot faster and they save. 🔗 Learn more about deletion vectors: https://lnkd.in/eGeYdPJw Credit: Matthew Powers, CFA #opensource #oss #linuxfoundation #lfaidata #data

    • No alternative text description for this image
  • View organization page for Delta Lake, graphic

    58,545 followers

    In a recent conversation, Tathagata Das, Staff Software Engineer at Databricks, delves into the evolution, strategic advancements, and future of Delta Lake with Quentin Ambard and Youssef Mrini! Key takeaways include: 🌟 Origins of Delta Lake: Insights into its development for enhanced data integrity. 🌟 Real-World Applications: Learn from its pivotal deployment handling Apple's data scale. 🌟 Upcoming Features: Preview the new capabilities in Delta Lake 4.0. 🌟 Interoperability Focus: Strategies for seamless integration with Apache Iceberg and Hudi. Watch the full session for more ➡ https://lnkd.in/eSWEJv3Y #opensource #linuxfoundation #lfaidata #oss

  • View organization page for Delta Lake, graphic

    58,545 followers

    You can query a Delta table with both Polars and DuckDB. The following example reads a #deltalake table into a Polars DataFrame and applies a filter. The resulting Polars DataFrame is then queried by DuckDB. You can also run the #sql with the #polars SQL interface and don't need to use #duckdb. It's awesome the data community is building tools that are so interoperable! 🦀 Credit: Matthew Powers, CFA #opensource #linuxfoundation #oss #lfaidata #polars #duckdb

    • No alternative text description for this image
  • View organization page for Delta Lake, graphic

    58,545 followers

    Did you know that you can use Delta Lake UniForm to append Iceberg metadata to a Delta table❓ Subsequent operations to the Delta table will make metadata entries that are compatible with both #deltalake and Apache Iceberg. The overhead for making an additional metadata entry is minimal (most processing time is for writing data files). This makes it easy to read Delta tables with engines that only have Iceberg connectors. Engines that already have Delta connectors don't need UniForm enabled - you can just use the Delta connector when it's available. 🔗 View the documentation for more: https://lnkd.in/ePbs63xn Credit: Matthew Powers, CFA #opensource #oss #linuxfoundation #lfaidata

    • No alternative text description for this image
  • View organization page for Delta Lake, graphic

    58,545 followers

    Delta Lake lets you store similar data close together via Liquid Clustering, Z-ordering, and Hive-style partitioning—making ETL processes faster and more reliable. Consider clustering for your ETL workloads if you: ✅ Frequently filter data by high cardinality columns ✅ Encounter significant skew in data distribution. ✅ Manage rapidly growing tables that require ongoing maintenance ✅ Experience changes in data access patterns over time 👀 Take a look at the significant speed improvements of a Z-ordered Delta table compared to traditional #Parquet and #JSON formats! Learn how to use #DeltaLake for ETL workloads. 👉 https://lnkd.in/epRiXJze #opensource #oss #linuxfoundation #lfaidata #ETL

    • No alternative text description for this image
  • View organization page for Delta Lake, graphic

    58,545 followers

    Parquet files store column statistics for row groups in the metadata footer. This is nice because you can skip row groups you don't need. Delta Lake takes this further. Delta Lake stores metadata at the **file-level** in the transaction log. This way query engines can figure out which data can be skipped using a single read operation, instead of reading each #Parquet footer. This is great because it means you can skip entire files. This can give you order-of-magnitude performance gains when reading large tables with a selective filter. Credit: Avril Aysha #opensource #oss #deltalake #linuxfoundation

    • No alternative text description for this image
  • View organization page for Delta Lake, graphic

    58,545 followers

    🚀 Do you have a passion for data engineering and a love for open source? No matter your experience level or background, there's a place for you in #DeltaLake. From code contributions and bug reports to updating documentation and giving community talks—there are numerous ways to get involved. 🙌 👀 Interested? Explore our roadmap to find out what we're focusing on next: https://lnkd.in/e7sWtX9T #linuxfoundation #opensource #oss #lfaidata

    Linux Foundation Delta Lake Roadmap • delta-io

    Linux Foundation Delta Lake Roadmap • delta-io

    github.com

  • View organization page for Delta Lake, graphic

    58,545 followers

    How do you query the Delta Lake tables within BigQuery directly? Vishal Waghmode walks us through the process! 🙌 Google BigQuery, a fully managed data warehouse, now supports reading Delta Lake tables directly! This integration allows users to query #DeltaLake data from #BigQuery without the need for complex #ETL processes or manually refreshing table metadata. Why is this integration important❓ 1️⃣ Streamlined Data Access: Simplifies querying Delta Lake data within BigQuery, enhancing data accessibility and reducing data movement. 2️⃣ Efficiently share data: Share data seamlessly across different processing engines like BigQuery, Databricks, Dataproc, and Dataflow, enabling efficient data utilization and collaboration 3️⃣ Reduced Complexity: Eliminates the need for maintaining separate data pipelines, thereby simplifying data management workflows. Results: ✅ Improved Query Performance: Faster query execution times due to direct access to Delta Lake data. ✅ Simplified Data Management: Automatically detects data and schema changes, so you can read the latest snapshot without manually refreshing table metadata. ✅ Unified Data Access: Maintain a single authoritative copy of your data that can be queried by both Databricks and BigQuery without the need to export, copy, or use manifest files. 🔗 Learn more: https://lnkd.in/eacAaCr4 #opensource #oss #linuxfoundation #dataengineering

    • No alternative text description for this image
  • View organization page for Delta Lake, graphic

    58,545 followers

    We are excited to announce the release of python-v0.19.0: complete CDF support, add column operation, faster MERGE! 🎉 This release includes several exciting new performance improvements and features. Highlights include: 🌟 CDF support in write_deltalake, delete, and merge operation 🌟 Expired logs cleanup during post-commit. Can be disabled with 𝚍𝚎𝚕𝚝𝚊.𝚎𝚗𝚊𝚋𝚕𝚎𝙴𝚡𝚙𝚒𝚛𝚎𝚍𝙻𝚘𝚐𝙲𝚕𝚎𝚊𝚗𝚞𝚙 = 𝚏𝚊𝚕𝚜𝚎 🌟 Improved MERGE performance by using predicate non-partition columns min/max for prefiltering 🌟 𝙰𝙳𝙳 𝚌𝚘𝚕𝚞𝚖𝚗 option 🌟 Speed up log parsing 🔗 View the full release notes: https://lnkd.in/esWdhXDn #deltalake #opensource #oss #linuxfoundation

    • No alternative text description for this image

Similar pages

Browse jobs