Llama 3 and scalable machine learning

Treebeard Update

Friends of Treebeard,

AI model leaderboards continue to be upended by new releases. Deepmind’s Gemma, Mistral’s 8×22B, and Meta’s Llama 3 are some of the most recent.

Whilst it’s newsworthy to post benchmark-topping open source models, it is just as important that the newest LLMs are being designed to fit into products more suitably.

Beyond Benchmarks

Benchmarks give a great initial indication of how powerful a model is, but its value in an application is also dependent on more practical metrics such as:

Context Window Size

The context window size is the number of tokens that a model can accept in one invocation. Larger context windows let you prompt the model with more information. This is especially useful for programming and document Q&A assistants.

Google’s Gemini Pro has a 1 million token context window. It is unclear to me which applications will benefit most from this in the short term, especially given that the cost of a model inference correlates with the number of tokens being processed.

Latency (tokens per second)

This depends on several factors ranging from the model architecture and size to the GPU’s memory bandwidth. This is important because:

  • More tokens per second means faster apps and therefore a better user experience

  • Faster inference means chain-of-thought prompting can be used to greater effect, resulting in higher-quality responses

  • Better economics - you can serve more concurrent users per unit of hardware

Better trust and safety features

This allows you to roll out a model with less risk. (ICYMI, Air Canada was found liable for a hallucinating customer support chatbot back in February)

As these new releases increase the possibilities of what can be built with AI, it resets the timer of how early it feels in this wave of technology.

From an infrastructure perspective, lots of norms and processes are being written in this new era, and as we anticipate the progression of LLMs through the top of the hype cycle, and into a more pragmatic phase, here are some predictions and pointers to managing operational costs whilst delivering AI systems.

Scaling Efficiently

As I alluded to in a recent blog post about high-performance computing systems, doing the following when working with AI systems can let you run a larger operation without prohibitive costs:

  1. Use engineers’ time efficiently - many organisations are running at 10-20% effectiveness given engineering toil, ramp-up times, and employee burnout and churn. Some of the worst cases of this I have seen are related to teams being blocked by other teams due to architectural security issues.

  2. Use hardware resources efficiently - architect your stack frugally: Write code that works on spot instances, that can be moved between clouds, avoid over-provisioning developer infrastructure

  3. Mitigate tail risks - both security (see recent blog) and availability risks. This is greatly about operational hygiene. The majority of teams manage these issues successfully, but often it involves process changes that hurt everyone’s productivity.

As I found during my time at Amazon, operational excellence may be tasked to a dedicated department or platform team, which can find inventive ways to solve scaling challenges, tailored to their specific organisation.

Pre-fabricated MLOps platforms

If you want to use machine learning engineering time more effectively (especially in model training scenarios), give them platforms that make cloud infrastructure useful to them.

One such project that I recently discovered is MLInfra, which helpfully categorises the services needed by an ML Engineer into a set of problem areas (including data versioning, orchestrators and model serving)

This is built by a platform engineer at Synthesia, and in an interesting discussion where we compared it to our Kubeflow Bootstrap we discussed how MLInfra allows teams to:

  • Simplify machine learning engineering with minimal platform engineering

  • Standardise a confusing marketplace of tools that can be misused, into a clean set of categories with standard deployment options - this is great for security issues which may creep in when you try gluing together tools

  • Scale to a larger number of customers by switching the config from “cloud_vm” to “Kubernetes”, as the product grows. Ensuring you use the cheapest hardware for the task at hand

ML Platform Engineering Tips

Just as every product is constantly being improved, so are the platforms that enable them. If you need to sharpen your platform expertise, check out our recent blogs on building and securing machine learning pipelines.