Running LLMs on a Steam Deck in 5 minutes

TL;DR

(this is a real-time demonstration, not a sped-up video)

The Steam Deck is a surprisingly capable device for running local LLMs. With a bit of setup, you can have a portable, powerful inference machine.

ComponentSpecification
APU7 nm AMD APU
CPUZen 2 4-core/8-thread, 2.4-3.5GHz
GPU8 RDNA 2 compute units, 1.6GHz
RAM16 GB LPDDR5 @ 5500 MT/s

A Note on VRAM

The Steam Deck shares its 16GB of RAM between the system and the GPU. By default, the GPU is allocated 1GB of this shared memory. However, you can increase this to 4GB in the BIOS, which is highly recommended for running larger models.

Click here for instructions
  1. Shut down the Steam Deck completely.
  2. Hold Volume Up (+) and press the Power button. Release the power button but keep holding Volume Up until you hear a chime.
  3. Select Setup Utility.
  4. Navigate to Advanced.
  5. Set UMA Frame Buffer Size to 4G.
  6. Go to Save & Exit and select Save Changes and Exit.

1. Enable SSH

First, we need to get shell access to the Steam Deck.

2. Create a Distrobox Container

Next, we’ll create an Ubuntu container to house our development environment. This keeps the main SteamOS clean. We are using Distrobox, which is a tool that allows you to create and manage containerized development environments.

From now on, everything we do will be inside the container (your SteamOS will remain untouched). You can exit the container at any time by typing exit.

3. Install Dependencies

Inside the Distrobox container, we need to install the tools required to build llama.cpp with GPU acceleration.

sudo apt update && sudo apt install -y rocminfo nano ncdu nload screen tmux pigz unzip iotop htop build-essential cmake git mesa-vulkan-drivers libvulkan-dev vulkan-tools glslang-tools glslc libshaderc-dev spirv-tools  libcurl4-openssl-dev ca-certificates kmod ccache radeontop

4. Build llama.cpp

Now we can clone the llama.cpp repository and build it with Vulkan support.

5. Run an LLM!

With llama.cpp built, you can now run a model. This command will download and run Gemma 3 1B, using the Steam Deck’s GPU for acceleration.

./build/bin/llama-cli \
  -hf unsloth/gemma-3-1b-it-GGUF\
  --gpu-layers -1 \
  -p "Say hello from Steam Deck GPU."

You should see the model generate a response, with the GPU taking on the bulk of the work.

6. Power Consumption

A quick note on power. While running the Gemma 3 1B model, the GPU consumes around 10W. Factoring in the CPU, the total power draw is between 20-25W. This makes the Steam Deck a surprisingly efficient device for running even quantized 7B models.

Conclusion

The ability to run quantized 7B models in such a small and power-efficient form factor is incredible. I wonder if there are any good role-play finetunes that fit in the Deck’s VRAM? Who else is thinking about building something an AI companion that you can speak to (and that can speak back)?

Bilawal.net

© 2025

𝕏 GitHub