treebeardtech
Posts
Securing Machine Learning Platforms: Tradeoffs and Takeaways

Securing Machine Learning Platforms: Tradeoffs and Takeaways

ML Platform Engineering Tips

Alex Remedios
April 11, 2024

I was recently reading about a new vulnerability affecting AI infrastructure environments: Companies affected by Shadowray are at risk of remote code execution attacks that can lead to disclosure of their data, ransomware attacks, or financial loss due to crypto-mining on their infrastructure.

The issue affects companies running Ray clusters where the dashboard server is exposed to the Internet. Unfortunately for these companies, the dashboard does not simply display information about your Ray cluster, it also lets you submit programs to run on the cluster.

Thousands of companies have been affected by this over the last year.

My first takeaway from this news is that Ray is extremely popular, but my second is that despite initial reflexes, I don’t think this story is as simple as an oversight on the part of Ray developers or users. What makes it more confusing is that it is called ShadowRay because it is a Shadow vulnerability. I.e. the Ray developers (Anyscale) have confirmed that the software is working well, but is being deployed incorrectly.

Let’s unpack this by thinking about ML platform security as a whole.

On good ML platforms

Good ML platforms allow teams to quickly, simply, and securely iterate through their MLOps workflows. They let engineers build without going out of their depth on cloud engineering. It gives them abstractions such as notebook servers, compute clusters, and artefact stores.

Good ML platforms are secure, they let you onboard new developers, contractors, and customers in a way that does not bring burdensome new processes that destroy your operational efficiency.

Good ML platforms mitigate the following attack vectors:

Supply chain attacks - every machine learning product imports hundreds or thousands of independently managed components. Each of these components can be compromised at source, impacting the whole supply chain.
product API attacks - Customer-facing systems which may be public facing can be exploited if well-known vulnerabilities (such as Log4shell) are not patched
runtime attacks - is data in the platform accessible to cloud infrastructure admins and vendors? Certain use cases will warrant approaches that limit this.
Developer API attacks - Like Shadowray, it is very common for developer tools to be used to perform a cyber attack. Unsecured S3 buckets, long-lived secrets, and a lack of role-based access controls are challenges here.

Bear in mind, that security is to a great extent about mitigation and hygiene. Every company has security incidents, such as running insecure software, but the damaging issues tend to come from companies that have had a series of vulnerabilities exploited.

Tradeoffs and Takeaways

Security is an ever-present concern in the process of assembling an appropriate platform. One of the most common tradeoffs you can make in this department is outsourcing the operation of some parts of your stack.

Similar to almost all cloud products, the benefit of ML platforms as a service is the shared responsibility model they entail, some examples of this:

Fully managed platform (e.g. AWS Sagemaker) — you secure your containers and users, and AWS secures the rest
Partly managed platform (e.g. Databricks) — you secure your VPC, containers, and users, Databricks secures the rest
Self-managed platform (e.g. Kubeflow on Google Cloud) — you secure your ML platform services (e.g. Artifact store API), and your cloud or infra provider secures the runtime.

A key takeaway from this Shadowray vulnerability is that if you want the cost and customisation benefits of self-managing, you need to understand the security obligations. Open-source projects such as Ray and MLFlow can be self-hosted but they require locking down, especially on the developer API side.

Security features are often sold at a premium by companies who maintain these open-source platforms. This has been raised as an ethical concern, but I believe that it’s a net good that we can use these tools for free.

Conclusion

Cloud security is a multifaceted problem that you will encounter in ML platform building regardless of the approach you take. By understanding the goals of potential attackers, the attack vectors they use, and the tradeoffs available for mitigating the risk, you can ensure that your ML operations continue to scale.