Google Open Source Blog

Common Expressions For Portable Policy and Beyond

Tuesday, June 18, 2024

I am thrilled to introduce Common Expression Language, a simple expression language that's great for writing small snippets of logic which can be run anywhere with the same semantics in nanoseconds to microseconds evaluation speed. The launch of cel.dev marks a major milestone in the growth and stability of the language.

It powers several well known products such as Kubernetes where it's used to protect against costly production misconfigurations:

object.spec.replicas <= 5

Cloud IAM uses CEL to enable fine-grained authorization:

request.path.startsWith("/finance")

Versatility

CEL is both open source and openly governed, making it well adopted both inside and outside Google. CEL is used by a number of large tech companies, either internally or as part of their public product offering. As a reflection of this open governance, cel.dev has been launched to share more about the language and the community around it.

So, what makes CEL a good choice for these applications? Why is CEL unique or different from Amazon's Cedar Policy or Open Policy Agent's Rego? These are great questions, and common ones (no pun intended):

Highly optimized evaluation O(ns) - O(μs)
Portable with stacks supported in C++, Java, and Go
Thousands of conformance tests ensure consistent behavior across stacks
Supports extension and subsetting

Subsetting is crucial for preserving predictable compute / memory impacts, and it only exists in CEL. As any latency-critical service maintainer will tell you, it's vital to have a clear understanding of compute and memory implications of any new feature. Imagine you've chosen an expression language, validated its functionality meets your security, serving, and scaling goals, but after launching an update to the library introduces new functionality which can't be disabled and leaves your product vulnerable to attack. Your alternatives are to fork the library and accept the maintenance costs, introduce custom validation logic which is likely to be insufficient to prevent abuse, or to redesign your service. CEL supported subsetting allows you to ensure that what was true at the initial product launch will remain true until you decide when to expose more of its functionality to customers.

Cedar Policy language was developed by Amazon. It is open source, implemented in Rust, and offers formal verification. Formal verification allows Cedar to validate policy correctness at authoring time. CEL is not just a policy language, but a broader expression language. It is used within larger policy projects in a way that allows users to augment their existing systems rather than adopt an entirely new one.

Formal verification often has challenges scaling to large amounts of data and is based on the belief that a formal proof of behavior is sufficient evidence of compliance. However, regulations and policies in natural language are often ambiguous. This means that logical translations of these policies are inherently ambiguous as well. CEL's approach to this challenge of compliance and reasoning about potentially ambiguous behaviors is to support evaluation over partial data at a very high speed and with minimal resources.

CEL is fast enough to be used in networking policies and expressive enough to be used in detailed application policies. Having a greater coverage of use cases improves your ability to reason about behavior holistically. But, what if you don't have all the data yet to know exactly how the policy will behave? CEL supports partial evaluation using four-valued logic which makes it possible to determine cases which are definitely allowed, denied, or where policy behavior is conditional on additional inputs. This allows for what-if analysis against historical data as well as against speculative data from new use cases or proposed policy changes.

Open Policy Agent's Rego is also open source, implemented in Golang and based on Datalog, which makes it possible to offer similar proofs as Cedar. However, the language is much more powerful than Cedar, and more powerful than CEL. This expressiveness means that OPA Rego is fantastic for non-latency critical, single-tenant solutions, but difficult to safely embed in existing offerings.

Four-valued Logic

CEL uses commutative logical operators that can render a true, false, error, or unknown status. This is a scalable alternative to formal verification and the expressiveness of Datalog. Four-valued logic allows CEL to evaluate over a partial set of inputs to deliver either a definitive result or communicate that more information is necessary to complete the evaluation.

What is four-valued logic?

True, false, and error outcomes are considered definitive: no additional information will change the outcome of the expression. Consider the following case:

1/0 != 0 && false

In traditional programming languages, this expression would be an error; however, in CEL the outcome is false.

Now consider the following case where an input variable, called unknown_var is marked as unknown:

unknown_var && true

The outcome of this expression is UNKNOWN{unknown_var} indicating that once the variable is known, the evaluation can be safely completed. An unknown indicates what was missing, and alerts the user to fix the outcome with more information. This technique both borrows from and extends SQL three-valued predicate logic which uses TRUE, FALSE, and NULL with commutative logical operators. From a CEL perspective, the error state is akin to SQL NULL that arises when there is an absence of information.

CEL compatibility with SQL

CEL leverages SQL semantics to ensure that it can be seamlessly translated to SQL. SQL optimizers perform significantly better over large data sets, making it possible to evaluate over data at rest. Imagine trying to scale a single expression evaluation over tens of millions of records. Even if the expression evaluates within a single microsecond, the evaluation would still take tens of seconds. The more complex the expression, the greater the latency. SQL excels at this use case, so translation from CEL to SQL was an important factor in the design in order to unlock the possibility of performant policy checks both online with CEL and offline with SQL.

Thank you CEL Community!

We’re proud to announce cel.dev as a major milestone in the maturity and adoption of the language, and we look forward to working with you to make CEL the best building block for writing latency-critical, portable logic. Feel free to contact us at cel-lang-discuss@googlegroups.com

By Tristan Swadell – Senior Staff Software Engineer

Driving etcd Stability and Kubernetes Success

Thursday, June 13, 2024

Introduction: The Critical Role of etcd in Cloud-Native Infrastructure

Imagine a cloud-native world without Kubernetes. It's hard, right? But have you ever considered the unsung hero that makes Kubernetes tick? Enter etcd, the distributed key-value store that serves as the central nervous system for Kubernetes. Etcd's ability to consistently store and replicate critical cluster state data is essential for maintaining the health and harmony of distributed systems.

etcd: The Backbone of Kubernetes

Think of Kubernetes as a magnificent vertebrate animal, capable of complex movements and adaptations. In this analogy, etcd is the animal's backbone – a strong, flexible structure that supports the entire system. Just as a backbone protects the spinal cord (which carries vital information), etcd safeguards the critical data that defines the Kubernetes environment. And just as a backbone connects to every other part of the body, etcd facilitates communication and coordination between all the components of Kubernetes, allowing it to move, adapt, and thrive in the dynamic world of distributed systems.

Credit: Original image xkcd.com/2347, alterations by Josh Berkus.

Google's Deep-Rooted Commitment to Open Source

Google has a long history of contributing to open source projects, and our commitment to etcd is no exception. As the initiator of Kubernetes, Google understands the critical role that etcd plays in its success. Google engineers consistently invest in etcd to enhance its functionality and reliability, driven by their extensive use of etcd for their own internal systems.

Google's Collaborative Impact on etcd Reliability

Google engineers have actively contributed to the stability and resilience of etcd, working alongside the wider community to address challenges and improve the project. Here are some key areas where their impact has been felt:

Post-Release Support: Following the release of etcd v3.5.0, Google engineers quickly identified and addressed several critical issues, demonstrating their commitment to maintaining a stable and production-ready etcd for Kubernetes and other systems.

Data Consistency: Early Detection and Swift Action: Google engineers led efforts to identify and resolve data inconsistency issues in etcd, advocating for public awareness and mitigation strategies. Drawing from their Site Reliability Engineering (SRE) expertise, they fostered a culture of "blameless postmortems" within the etcd community—a practice where the focus is on learning from incidents rather than assigning blame. Their detailed postmortem of the v3.5 data inconsistency issue and a co-presented KubeCon talk served to share these valuable lessons with the broader cloud-native community.

Refocusing on Stability and Testing: The v3.5 incident highlighted the need for more comprehensive testing and documentation. Google engineers took action on multiple fronts:

Improving Documentation: They contributed to creating "The Implicit Kubernetes-ETCD Contract," which formalizes the interactions between the two systems, guiding development and troubleshooting.

Automating Detection: They spearheaded the development of a more efficient data corruption detection mechanism for etcd v3.5, significantly reducing detection latency.

Prioritizing Stability and Testing: They developed the "etcd Robustness Tests," a rigorous framework simulating extreme scenarios to proactively identify inconsistency and correctness issues.

These contributions have fostered a collaborative environment where the entire community can learn from incidents and work together to improve etcd's stability and resilience. The etcd Robustness Tests have been particularly impactful, not only reproducing all the data inconsistencies found in v3.5 but also uncovering other bugs introduced in that version. Furthermore, they've found previously unnoticed bugs that existed in earlier etcd versions, some dating back to the original v3 implementation. These results demonstrate the effectiveness of the robustness tests and highlight how they've made etcd the most reliable it has ever been in the history of the project.

etcd Robustness Tests: Making etcd the Most Reliable It's Ever Been

The "etcd Robustness Tests," inspired by the Jepsen methodology, subject etcd to rigorous simulations of network partitions, node failures, and other real-world disruptions. This ensures etcd's data consistency and correctness even under extreme conditions. These tests have proven remarkably effective, identifying and addressing a variety of issues:

For deeper insights into ensuring etcd's data consistency, Marek Siarkowicz's talk, "On the Hunt for Etcd Data Inconsistencies," offers valuable information about distributed systems testing and the innovative approaches used to build these tests. To foster transparency and collaboration, the etcd community holds bi-weekly meetings to discuss test results, open to engineers, researchers, and other interested parties.

The Kubernetes-etcd Contract: A Partnership Forged in Rigorous Testing

To solidify the Kubernetes-etcd partnership, Google engineers formally defined the implicit contract between the two systems. This shared understanding guided development and troubleshooting, leading to improved testing strategies and ensuring etcd meets Kubernetes' demanding requirements.

When subtle issues were discovered in how Kubernetes utilized etcd watch, the value of this formal contract became clear. These issues could lead to missed events under specific conditions, potentially impacting Kubernetes' operation. In response, Google engineers are actively working to integrate the contract directly into the etcd Robustness Tests to proactively identify and prevent such compatibility issues.

Conclusion: Google's Continued Commitment to etcd and the Cloud-Native Community

Google's ongoing investment in etcd underscores their commitment to the stability and success of the cloud-native ecosystem. Their contributions, along with the wider community's efforts, have made etcd a more reliable and performant foundation for Kubernetes and other critical systems. As the ecosystem evolves, etcd remains a critical linchpin, empowering organizations to build and deploy distributed applications with confidence. We encourage all etcd and Kubernetes contributors to continue their active participation and contribute to the project's ongoing success.

By Marek Siarkowicz – GKE etcd

The Kubernetes ecosystem is a candy store

Monday, June 3, 2024

For the 10th anniversary of Kubernetes, I wanted to look at the ecosystem we created together.

I recently wrote about the pervasiveness and magnitude of the Kubernetes and CNCF ecosystem. This was the result of a deliberate flywheel. This is a diagram I used several years ago:

Flywheel diagram of Kubernetes and CNCF ecosystem

Because Kubernetes runs on public clouds, private clouds, on the edge, etc., it is attractive to developers and vendors to build solutions targeting its users. Most tools built for Kubernetes or integrated with Kubernetes can work across all those environments, whereas integrating directly with cloud providers directly entails individual work for each one. Thus, Kubernetes created a large addressable market with a comparatively lower cost to build.

We also deliberately encouraged open source contribution, to Kubernetes and to other projects. Many tools in the ecosystem, not just those in CNCF, are open source. This includes many tools built by Kubernetes users and tools built by vendors but were too small to be products, as well as those intended to be the cores of products. Developers built and/or wrote about solutions to problems they experienced or saw, and shared them with the community. This made Kubernetes more usable and more visible, which likely attracted more users.

Today, the result is that if you need a tool, extension, or off-the-shelf component for pretty much anything, you can probably find one compatible with Kubernetes rather than having to build it yourself, and it’s more likely that you can find one that works out of the box with Kubernetes than for your cloud provider. And often there are several options to choose from. I’ll just mention a few. Also, I want to give a shout out to Kubetools, which has a great list of Kubernetes tools that helped me discover a few new ones.

For example, if you’re an application developer whose application runs on Kubernetes, you can build and deploy with Skaffold, test it on Kubernetes locally with Minikube, or connect to Kubernetes remotely with Telepresence, or sync to a preview environment with Gitpod or Okteto. When you need to debug multiple instances, you can use kubetail to view the logs in real time.

To deploy to production, you can use GitOps tools like FluxCD, ArgoCD, or Google Cloud’s Config Sync. You can perform database migrations with Schemahero. To aggregate logs from your production deployments, you can use fluentbit. To monitor them, you have your pick of observability tools, including Prometheus, which was inspired by Google’s Borgmon tool similar to how Kubernetes was inspired by Borg, and which was the 2nd project accepted into the CNCF.

If your application needs to receive traffic from the Internet, you can use one of the many Ingress controllers or Gateway implementations to configure HTTPS routing, and cert-manager to obtain and renew the certificates. For mutual TLS and advanced routing, you can use a service mesh like Istio, and take advantage of it for progressive delivery using tools like Flagger.

If you have a more specialized type of workload to run, you can run event-driven workloads using Knative, batch workloads using Kueue, ML workflows using Kubeflow, and Kafka using Strimzi.

If you’re responsible for operating Kubernetes workloads, to monitor costs, there’s kubecost. To enforce policy constraints, there’s OPA Gatekeeper and Kyverno. For disaster recovery, you can use Velero. To debug permissions issues, there are RBAC tools. And, of course, there are AI-powered assistants.

You can manage infrastructure using Kubernetes, such as using Config Connector or Crossplane, so you don’t need to learn a different syntax and toolchain to do that.

There are tools with a retro experience like K9s and Ktop, fun tools like xlskubectl, and tools that are both retro and fun like Kubeinvaders.

If this makes you interested in migrating to Kubernetes, you can use a tool like move2kube or kompose.

This just scratched the surface of the great tools available for Kubernetes. I view the ecosystem as more of a candy store than as a hellscape. It can take time to discover, learn, and test these tools, but overall I believe they make the Kubernetes ecosystem more productive. To develop any one of these tools yourself would require a significant time investment.

I expect new tools to continue to emerge as the use cases for Kubernetes evolve and expand. I can’t wait to see what people come up with.

By Brian Grant, Distinguished Engineer, Google Cloud Developer Experience