All your TFLOPS are belong to me

Mahindra École Centrale had procured Nvidia’s DGX-1 in early 2019 to promote on-campus research in AI/ML. A few weeks later, on May 27th, a team from Nvidia’s Deeplearning Institute came over to introduce the behemoth and teach all students and faculty how to use those sweet 1000 TeraFlops. Till then, most students, including me, relied on the free tier of Google, Azure, and Amazon (not in any particular order), for all our deeplearning and HPC related needs. Naturally, I had joined the training sessions to see what the fuss was all about.

On the first day, an email was sent out to everyone with the ssh credentials to log in to DGX. As someone who’s always fascinated with how things work, I started scouring the system for configuration to see in what ways it differed from the peasant laptops and desktops that we all own, and sure enough, it was as I presumed. I had worked on Maverick-2 during my summer internship at UT Austin, so I knew what to expect. NUMA nodes, GPU interconnects, RAID, pretty standard stuff, though still a far cry from actual supercomputers such as Maverick-2.

But something stood out, the kernel version. A bit of background: Nvidia services the OS with their own versioning, using Ubuntu as the base. This version 3.1.6 released in May 2018, was based on Ubuntu 16.04 LTS with kernel 4.4.0-116-generic. This seemed pretty odd, given the then-latest OS release was 4.1. I was bewildered when I found out that this kernel was susceptible to a local privilege escalation exploit. It had to do with the Berkeley Packet Filter, a VM in the kernel. I assumed the patch must’ve been applied (but the patch-version said otherwise), or a workaround must’ve been taken. I mean, no one would just ship a system with an active exploit and call it a day, would they? I got in touch with the professor-in-charge of the supercomputer, revealed my findings, and asked if I could dig further. With his approval, I started experimenting during a boring part of the lecture the next day.

It’s pretty easy to check if your system has been patched and if it works. Just search for the version followed by ‘exploits’ and keep an eye out for exploit-db.com links, as they host the proof-of-concept (POC). Read the code, understand what it’s doing, and if you’re confident it’s not up to something underhanded (apart from running the exploit), run it on your local machine. You can probably guess where this is going.

I sent out an email to the IT department warning them of the issue and requesting them to take DGX offline, less than 24H after it was deployed. In the meantime, I took the liberty of migrating everyone’s login shell from sh to bash and installing nano, the simple text editor. The last thing you want is students stuck in vim, or worse, a bunch of them running in the background, abandoned.

That wasn’t all! The marketing material for DGX-1 prominently points out Docker being installed to simplify package management. Just pull the container you want and run your code in it, easy. No fumbling around with pip, CUDA, version conflicts, etc. I had been playing around with Docker for a while to host the previous version of my website in an Nginx container to take advantage of CI/CD and automated deploys. From this experience, I knew Docker support had a big asterisk. It all boils down to how Docker works. Running the Docker daemon requires root privileges. To access it, you either must be part of the root or docker group. Being part of the latter effectively grants you root privileges. How, do you ask? It’s pretty easy. First, run a container by mounting the host filesystem / into the container at /host, then chroot into /host.

$ docker run --rm -it -v /:/host --privileged ubuntu
root@<container-hostname>#: chroot /host

That’s it. You are effectively root on the host. Want to kill some other process you don’t like? Easy, use pkill. This works thanks to the --privileged flag which exposes host processes in the container. If you’re wondering “Well damn, that’s a major security hole, how come they didn’t fix it?”. This ‘issue’ is documented here. Quoting directly from the docs,

[…] only trusted users should be allowed to control your Docker daemon. This is a direct consequence of some powerful Docker features.

What did I do with all this power? Not much actually, as there was nothing interesting I could accomplish over and above what a normal user already could (apart from being root). I just ran a few benchmarks when no one was using it, to really comprehend its compute capability. Since this system didn’t have any resource limits and usage quotas in place, anyone could run anything they wanted, for as long as they wanted to.

I reported my findings and consequently began a long chain of emails to get the supercomputer fixed. By August 2019, DGX-1 sported the latest and greatest OS version, finally supported resource limits (disk quotas, process limits, etc), ACL, and a full-fledged job scheduler (Slurm). By the time it was ready, I was tasked with creating a User Guide¹ to help new students familiarize themselves with DGX.

You can find the final draft I submitted here. For the current version, please contact MEC. ↩︎