Treebeard Update

On cloud agnostic AI

Friends of Treebeard,

As it’s Kubecon in Paris this week, it’s a good time to share some of our recent contributions in the cloud native space and reflect on the state of AI in Kubernetes.

How AI meets Kubernetes/Cloud Native

AI systems often run on the cloud, but are hard to build directly using a cloud platform.

This is why machine learning platforms like Sagemaker, Databricks, and Azure ML exist. They let engineers glide through their devops workflows using mostly Python and not worry too much about infrastructure.

Now with generative AI, proprietary foundation models from OpenAI, Anthropic, and Cohere offer text/image generation, fine-tuning, and evaluation services on these platforms too.

As self-hosting technologies and open-source models improve, companies and even consumers consider the improved security, cost, and customizability of installing platforms like these into their Kubernetes environments.

Going deep on Kubeflow

The most complete solution for operating an AI app on your own Kubernetes cluster is Kubeflow. It has cloud-native alternatives to many of the technologies available in the platforms listed above:

  • Kserve for serverless model endpoints

  • Kubeflow pipelines for batch execution

  • Katib for AutoML

  • Tensorboard and Model Registry for training workflows

  • Workbench environments for devops tasks

Kubeflow has uniquely succeeded at combining these components into a single dashboard that can be deployed into a Kubernetes cluster.

It uses lower-level technologies like Istio, dex, and knative to expose a developer-friendly interface on top of the standard microservices primitives that Kubernetes implements.

So, what is better about Kubeflow than other cloud-native technologies that provide compelling point solutions such as MLFlow, Weights and Biases, Ray, and Jupyter Hub? The unified organisation of Kubeflow allows for a single point of accountability for the productivity and security of the entire machine-learning workflow.

Whilst Kubeflow users may decide to augment Kubeflow for their specific AI products, having something that works out of the box makes Kubeflow a viable alternative to the non-K8s platforms.

Introducing Kubeflow bootstrap

The downside of having such an integrated stack of tools is that it’s harder to understand and manage e.g. compared to an MLFlow server.

Kubeflow pushes the boundaries of what can be handled by a single cluster and devops team, to the extent that adopting it requires rethinking the tools we use to manage clusters.

This is where we have made our first contribution to the Kubeflow ecosystem: Our kubeflow-bootstrap project provides a 1-click experience for starting a Kubeflow deployment and sets up infrastructure teams with the right practices to scale their usage of Kubeflow.

Read more about it in our recent blog, or see the code.

How could Kubeflow be better?

Last year, Google donated Kubeflow to the cloud native computing foundation (CNCF). It is an incubating stage project — it is stable and has a strong user base, but is maturing governance and security processes to reach the level of a graduated project.

From a technical perspective, Kubeflow has some powerful components and is accumulating more — Google donated the Apache Spark K8s Operator to Kubeflow last week. But as the capabilities grow, so does the cost of harnessing them.

Three improvements to the project, as it moves towards CNCF graduation, would ideally be:

  1. Better documentation and features for managing multiple Kubeflow instances in a large organisation

  2. Security tools for managing user access to UIs and APIs in a Kubeflow instance

  3. More support for working with Spark and Ray clusters which are indispensable for many ML products.

Interested in trying out or collaborating on Kubeflow? let’s connect!

If you know someone who may enjoy this read, please share it with them.