PetLM: teaching a tiny language model to be a virtual pet

I wanted to make PetLM as a project to see how trainable tiny models really are. Not to build a world-class pet or anything grandiose. Just a small model in a small box.

Try the browser demo

The split

The simulation is still the brain. It handles movement, mood, energy, social state, physics, cooldowns, pointer events, and all the things that need to be predictable.

The language model only gets called when something meaningful happens. A gentle hover near the head becomes gentle_pet. A slow drag becomes carried_slow. A frantic shake becomes shaken_fast. Long silence becomes long_silence.

The prompt is not a conversation. It is a compact state string:

State: mood=85 energy=75 social=80 idle=3s locked=false event=gentle_pet recent_pets=1 recent_shakes=0 agitation=0 personality=warm and cozy

The model answers in PetDSL:

SAY: mrrp warm here
EMO: cozy
ANIM: blink_slow
INTENT: stay_near_cursor
END

That is the whole job. Speech, emotion label, animation hint, intent. The app parses it, validates it, and ignores anything it does not understand.

The emotional contract matters more than the model

The first useful document in this project was not a training script. It was specs/petspec.md.

The pet is allowed to be cozy, playful, sleepy, curious, shy, mildly grumpy, or dizzy in a comic way. It is not allowed to guilt-trip you. It is not allowed to claim it is suffering. It is not allowed to become a therapist.

If you ignore it, it can nap or quietly self-play. It should not say "I thought you forgot about me." If you shake it, it can wobble and complain a little. It should not act hurt.

That sounds like flavour, but it is actually the product boundary. A virtual pet is supposed to lower the room temperature. It should not add another needy entity to your desktop.

The model

The base model is MiniMind2-Small: 25.8M parameters, Llama-style, small enough that full fine-tuning is cheap.

I started with QLoRA because that is the habit now. For a model this small, full fine-tuning made more sense. It was fast enough on my machine and gave better results.

The current training set has 2,825 examples:

1,251 signature examples for different events, personalities, moods, and response modes
750 template examples for grounding the basic mappings
706 multi-turn continuation examples, so the pet can lightly react to the previous thing it said
120 repair examples for bad format, missing END, wrong emotion, or forbidden phrasing

The data is synthetic, but not "ask a big model for a pile of random pet dialogue" synthetic. The useful examples are structured. The model sees lots of valid PetDSL. It sees the same event expressed in several modes: warm, playful, dramatic, sleepy, teasing. It sees what not to do.

The current v4 eval set has 720 cases:

100% parse rate
100% valid enums
98.9% event-to-emotion match
0% forbidden phrase rate
8.1 average SAY words
422 unique SAYs out of 720 generations

The last number is the one I still watch. A tiny model can learn the contract and still get a bit samey. That is fine for a toy, less fine if the whole point is making the pet feel alive.

The browser demo

The frontend is a small React/Vite toy. You can drag the pet around, pet it, shake it, lock it in place, and leave it alone long enough for idle behaviour to fire.

The model runs locally in the browser through ONNX Runtime Web. It tries WebGPU first and falls back to WASM. The runtime panel shows which backend is active, tokens per second, first-token time, total generation time, and the PetDSL packet.

Deployment had one annoying detail: Cloudflare Pages has a 25 MiB limit for a single asset, and the ONNX weight shard is about 49 MiB. The app and model are split. Pages serves the app. R2 serves the model files. R2 has free egress and a free tier that easily covers this model size.

That is not a big architecture. It is just the difference between "this works on my laptop" and "I can post this without immediately breaking the static host."

What I like about this pattern

The model is not the product. The model is a tiny expressive component inside a product that still has normal software boundaries.

That feels like the right shape for small local models. They do not need to be worse general assistants. They can be good at one odd job: compress this tool output, classify this safety edge case, turn this pet state into a sentence that feels warm without being clingy.

The constraints make the model more charming, not less. PetLM cannot ramble. It cannot take over the UI. It cannot decide to become your productivity coach. It can only say something like:

SAY: wobbles okay, less spin please
EMO: dizzy_playful
ANIM: shake_off
INTENT: recover
END

That is enough.

Demo: petlm.pages.dev