I wanted to know if a small model could read obfuscated Python and recover its semantics: what the variables mean, what the code does, whether it is dangerous.
After three months of training runs on an RTX 2070 with 8GB of VRAM, the model is 940MB, runs on a laptop, and scored 93.36% on a structured evaluation harness. It also went 11 for 11 on competition-winning obfuscated code from the International Obfuscated Python Code Competition.
The problem
Obfuscated Python is everywhere. Malware authors use it to hide payloads. CTF challenges use it to test reverse-engineering skill. Legacy codebases are full of it because someone thought minifying Python was a good idea in 2014.
Existing tools fall into two camps. Symbolic execution engines can trace what code does, but they are slow and choke on anything non-trivial. String-match heuristics are fast but tell you nothing about behaviour. The gap between "this file has suspicious strings" and "this function is a binary search, low risk" is where a human analyst lives.
I wanted to shrink that gap. Give an analyst a first-pass semantic read that takes 5 seconds instead of 30 minutes.
The approach
The model is called SemRec (Semantic Recovery). The thesis is simple: a small fine-tuned model is more useful as an evidence reducer than as a standalone code generator. It does not write code. It reads code and produces structured analysis.
The pipeline has three stages:
- SFT (Supervised Fine-Tuning) on 20,566 curated pairs of obfuscated-to-clean Python. Each pair has a clean original, an AST-obfuscated version, and a structured JSON label with recovered identifiers, behaviour tags, a summary, and a risk classification.
- DPO (Direct Preference Optimization) on 1,768 verified preference pairs. Each pair has a winning and losing response, scored on six criteria: JSON validity, identifier recovery delta, test-suite pass, contrast score, quality flag, and diversity.
- Reasoning trace augmentation via a teacher model. Every training pair gets a chain-of-thought "thinking" trace prepended to the output, so the model learns to reason before it answers.
The base model is Qwen2.5-Coder-1.5B from Alibaba. The training data splits into three categories: benign algorithms (50%), real-world Python (20%), and malware samples (30%). The malware portion is important. Without it, the model cannot calibrate risk labels.
What the model outputs
Every inference returns strict JSON. The output is designed for downstream pipelines: parse it, store it, act on it.
{
"thinking": "The function takes a list and a value. It initializes two pointers, left and right, at the start and end of the list. It enters a while loop that continues while left <= right. It calculates a midpoint. If the midpoint element equals the target value, it returns the index. If the midpoint element is less than the target, it moves the left pointer up. Otherwise, it moves the right pointer down. If the loop exits without finding the target, it returns -1. This is binary search.",
"summary": "Binary search over a sorted list; returns index or -1.",
"recovered_identifiers": {
"v_3": {"target": "binary_search", "aliases": ["search", "find_index"]},
"v_0": {"target": "arr", "aliases": ["array", "nums"]},
"v_1": {"target": "target", "aliases": ["value", "key"]},
"v_4": {"target": "left", "aliases": ["lo", "low"]},
"v_2": {"target": "right", "aliases": ["hi", "high"]},
"v_5": {"target": "mid", "aliases": ["middle", "pivot"]}
},
"behavior_tags": ["searching", "loop", "condition"],
"risk_label": "low"
}
The thinking field is where the model reasons before committing to structured output. I added this during training augmentation specifically to give the model space to work through the code before producing the JSON fields. The reasoning is still parsable downstream because it sits in its own key. Useful for audit trails, and useful for debugging when the model gets something wrong: you can see where the reasoning went off track.
Serving it
The model is a q4_K_M GGUF, about 940MB. It runs on llama-server, which speaks the OpenAI API format.
llama-server -m reports/aero_dpo_reasoning_1.5b/merged.q4_K_M.gguf \
--port 8080 --ctx-size 16384 --n-gpu-layers 99
curl http://localhost:8080/v1/chat/completions -d '{
"model": "local",
"messages": [
{"role": "system", "content": "Recover Python semantics and return strict JSON."},
{"role": "user", "content": "Analyze the obfuscated Python code and recover its semantics.\nObfuscated code:\ndef v_3(v_0, v_1):\n v_4 = 0\n v_2 = len(v_0) - 1\n while v_4 <= v_2:\n v_5 = (v_4 + v_2) // 2\n if v_0[v_5] == v_1: return v_5\n elif v_0[v_5] < v_1: v_4 = v_5 + 1\n else: v_2 = v_5 - 1\n return -1\nReturn JSON with summary, recovered_identifiers, behavior_tags, and risk_label."}
],
"temperature": 0.0,
"max_tokens": 4096,
"repeat_penalty": 1.15
}'
Two inference parameters matter. max_tokens=4096 prevents truncation on complex inputs (the thinking trace can be long). repeat_penalty=1.15 eliminates hallucination loops on pure-lambda code, which I discovered after the model got stuck repeating itself on a Y-combinator benchmark.
The obfuscated Python Olympics
The International Obfuscated Python Code Competition (IOPCC) is the hardest publicly available benchmark for this kind of work. Winners use every trick in the book: walrus operators chained into eval calls, Unicode aliases for builtins, VM interpreters built from lambdas and dicts, self-importing dataclasses, pure-lambda Y-combinators.
I tested the model against 11 real winners from 2023 to 2025. It got all 11 right.
A caveat: these tests are not a held-out evaluation in the rigorous sense. The IOPCC entries themselves are not in the training data, but the training pipeline uses AST-based obfuscation techniques (variable renaming, control flow mangling, dead code injection) that overlap with some IOPCC patterns. 11 tests is also a small sample. Treat this as promising anecdotal evidence, not a benchmark claim. The structured evaluation harness (50 pairs, 93.36%) is the more reliable number.
| Test | Result | Notes |
|---|---|---|
| IOPCC 2025 mind-boggling (walrus/eval/dataclass) | Pass | Correctly identified self-import trick |
| IOPCC 2025 Unicode eval-alias, dead code | Pass | risk: medium, eval correctly flagged |
| IOPCC 2025 underscore-as-whitespace | Pass | Identified disguised for loop |
| IOPCC 2024 day-of-week magic string | Pass | Decoded ord() lookup table |
IOPCC 2023 fibonacci via vars()/getattr |
Pass | Golden-ratio formula identified |
| IOPCC 2023 VM interpreter in lambdas/dicts | Pass | Correct in 8.3 seconds |
| IOPCC 2024 pure-lambda pentomino Y-combinator | Pass | Correct with repeat_penalty=1.15 |
XOR cipher via type() |
Pass | Key stream identified |
| Y-combinator factorial + fibonacci | Pass | |
Descriptor __set__ doubling abuse |
Pass | Silent doubling caught |
exec + zlib compressed code |
Pass | Low-risk correctly (controlled data) |
The numbers
On a 50-pair structured evaluation harness:
| Metric | Score |
|---|---|
| Overall | 93.36% |
| JSON validity | 100% |
| Execution pass rate | 100% |
| Risk accuracy | 100% |
| Behaviour tag F1 | 89.10% |
| Semantic similarity | 77.67% |
| Identifier recovery | 58.61% |
100% execution pass rate means the recovered identifiers, when substituted back into the code, produce code that runs correctly. The model is recovering names that preserve behaviour.
Risk accuracy at 100% on the eval set is encouraging, but the risk classifier is still in development. The evaluation set is small (50 pairs) and the risk categories are coarse (low, medium, high). On more diverse and adversarial inputs, the classifier will need more work before I would trust it for production triage. The training data includes 30% malware samples, which gives the model a baseline for distinguishing dangerous patterns, but "baseline" is not "finished."
Identifier recovery at 58.61% sounds low, and it is the weakest metric. But consider what it measures: exact match or strong semantic similarity to the original variable name. When the original was calculate_fibonacci and the model says fib, that counts as a miss. The aliases field helps here. The model returns multiple candidates per identifier, and one of the aliases usually matches even when the primary target does not.
What I learned
The 0.5B model hit a ceiling. DPO on top of the 0.5B SFT model improved the training objective but failed to produce reliable held-out gains. Two consecutive metric-aligned runs failed the decision rule. I moved to 1.5B and the ceiling disappeared.
Reasoning traces are the highest-leverage augmentation I tried. Adding a thinking field to every training pair, generated by a teacher model, gave the model a visible deduction process. It also improved structured output quality, probably because the model reasons about the code before committing to JSON fields.
The training data mix matters more than model size. The 30% malware portion is what makes the risk classifier work. Without malware samples, the model labels everything as low risk because it has never seen dangerous code. The 50% benign algorithms portion is what makes identifier recovery work, because algorithmic code has canonical naming conventions the model can learn.
What is next
The 3B model is deferred. The 1.5B model is good enough to ship, and the marginal improvement from 3B does not justify the training cost on an 8GB GPU right now.
The model file is available to download here (940MB q4_K_M GGUF, runs on llama-server with the config above). Give it your nastiest obfuscated Python.