I trained a 1.5B model to deobfuscate Python. It scored 93% and survived the obfuscated Python Olympics.

SemRec: a 940MB fine-tuned Qwen2.5-Coder model that recovers clean semantics from obfuscated Python. 93.36% overall, 11/11 on competition-grade obfuscation.

I wanted to know if a small model could read obfuscated Python and recover its semantics: what the variables mean, what the code does, whether it is dangerous.

After three months of training runs on an RTX 2070 with 8GB of VRAM, the model is 940MB, runs on a laptop, and scored 93.36% on a structured evaluation harness. It also went 11 for 11 on competition-winning obfuscated code from the International Obfuscated Python Code Competition.

The problem

Obfuscated Python is everywhere. Malware authors use it to hide payloads. CTF challenges use it to test reverse-engineering skill. Legacy codebases are full of it because someone thought minifying Python was a good idea in 2014.

Existing tools fall into two camps. Symbolic execution engines can trace what code does, but they are slow and choke on anything non-trivial. String-match heuristics are fast but tell you nothing about behaviour. The gap between "this file has suspicious strings" and "this function is a binary search, low risk" is where a human analyst lives.

I wanted to shrink that gap. Give an analyst a first-pass semantic read that takes 5 seconds instead of 30 minutes.

The approach

The model is called SemRec (Semantic Recovery). The thesis is simple: a small fine-tuned model is more useful as an evidence reducer than as a standalone code generator. It does not write code. It reads code and produces structured analysis.

The pipeline has three stages:

  1. SFT (Supervised Fine-Tuning) on 20,566 curated pairs of obfuscated-to-clean Python. Each pair has a clean original, an AST-obfuscated version, and a structured JSON label with recovered identifiers, behaviour tags, a summary, and a risk classification.
  1. DPO (Direct Preference Optimization) on 1,768 verified preference pairs. Each pair has a winning and losing response, scored on six criteria: JSON validity, identifier recovery delta, test-suite pass, contrast score, quality flag, and diversity.
  1. Reasoning trace augmentation via a teacher model. Every training pair gets a chain-of-thought "thinking" trace prepended to the output, so the model learns to reason before it answers.

The base model is Qwen2.5-Coder-1.5B from Alibaba. The training data splits into three categories: benign algorithms (50%), real-world Python (20%), and malware samples (30%). The malware portion is important. Without it, the model cannot calibrate risk labels.

What the model outputs

Every inference returns strict JSON. The output is designed for downstream pipelines: parse it, store it, act on it.

{
  "thinking": "The function takes a list and a value. It initializes two pointers, left and right, at the start and end of the list. It enters a while loop that continues while left <= right. It calculates a midpoint. If the midpoint element equals the target value, it returns the index. If the midpoint element is less than the target, it moves the left pointer up. Otherwise, it moves the right pointer down. If the loop exits without finding the target, it returns -1. This is binary search.",
  "summary": "Binary search over a sorted list; returns index or -1.",
  "recovered_identifiers": {
    "v_3": {"target": "binary_search", "aliases": ["search", "find_index"]},
    "v_0": {"target": "arr", "aliases": ["array", "nums"]},
    "v_1": {"target": "target", "aliases": ["value", "key"]},
    "v_4": {"target": "left", "aliases": ["lo", "low"]},
    "v_2": {"target": "right", "aliases": ["hi", "high"]},
    "v_5": {"target": "mid", "aliases": ["middle", "pivot"]}
  },
  "behavior_tags": ["searching", "loop", "condition"],
  "risk_label": "low"
}

The thinking field is where the model reasons before committing to structured output. I added this during training augmentation specifically to give the model space to work through the code before producing the JSON fields. The reasoning is still parsable downstream because it sits in its own key. Useful for audit trails, and useful for debugging when the model gets something wrong: you can see where the reasoning went off track.

Serving it

The model is a q4_K_M GGUF, about 940MB. It runs on llama-server, which speaks the OpenAI API format.

llama-server -m reports/aero_dpo_reasoning_1.5b/merged.q4_K_M.gguf \
  --port 8080 --ctx-size 16384 --n-gpu-layers 99
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "local",
  "messages": [
    {"role": "system", "content": "Recover Python semantics and return strict JSON."},
    {"role": "user", "content": "Analyze the obfuscated Python code and recover its semantics.\nObfuscated code:\ndef v_3(v_0, v_1):\n    v_4 = 0\n    v_2 = len(v_0) - 1\n    while v_4 <= v_2:\n        v_5 = (v_4 + v_2) // 2\n        if v_0[v_5] == v_1: return v_5\n        elif v_0[v_5] < v_1: v_4 = v_5 + 1\n        else: v_2 = v_5 - 1\n    return -1\nReturn JSON with summary, recovered_identifiers, behavior_tags, and risk_label."}
  ],
  "temperature": 0.0,
  "max_tokens": 4096,
  "repeat_penalty": 1.15
}'

Two inference parameters matter. max_tokens=4096 prevents truncation on complex inputs (the thinking trace can be long). repeat_penalty=1.15 eliminates hallucination loops on pure-lambda code, which I discovered after the model got stuck repeating itself on a Y-combinator benchmark.

The obfuscated Python Olympics

The International Obfuscated Python Code Competition (IOPCC) is the hardest publicly available benchmark for this kind of work. Winners use every trick in the book: walrus operators chained into eval calls, Unicode aliases for builtins, VM interpreters built from lambdas and dicts, self-importing dataclasses, pure-lambda Y-combinators.

I tested the model against 11 real winners from 2023 to 2025. It got all 11 right.

A caveat: these tests are not a held-out evaluation in the rigorous sense. The IOPCC entries themselves are not in the training data, but the training pipeline uses AST-based obfuscation techniques (variable renaming, control flow mangling, dead code injection) that overlap with some IOPCC patterns. 11 tests is also a small sample. Treat this as promising anecdotal evidence, not a benchmark claim. The structured evaluation harness (50 pairs, 93.36%) is the more reliable number.

Test Result Notes
IOPCC 2025 mind-boggling (walrus/eval/dataclass) Pass Correctly identified self-import trick
IOPCC 2025 Unicode eval-alias, dead code Pass risk: medium, eval correctly flagged
IOPCC 2025 underscore-as-whitespace Pass Identified disguised for loop
IOPCC 2024 day-of-week magic string Pass Decoded ord() lookup table
IOPCC 2023 fibonacci via vars()/getattr Pass Golden-ratio formula identified
IOPCC 2023 VM interpreter in lambdas/dicts Pass Correct in 8.3 seconds
IOPCC 2024 pure-lambda pentomino Y-combinator Pass Correct with repeat_penalty=1.15
XOR cipher via type() Pass Key stream identified
Y-combinator factorial + fibonacci Pass
Descriptor __set__ doubling abuse Pass Silent doubling caught
exec + zlib compressed code Pass Low-risk correctly (controlled data)
The pentomino Y-combinator test took about 25 seconds. That is the longest any test ran. Most finish in 2 to 6 seconds.

The numbers

On a 50-pair structured evaluation harness:

Metric Score
Overall 93.36%
JSON validity 100%
Execution pass rate 100%
Risk accuracy 100%
Behaviour tag F1 89.10%
Semantic similarity 77.67%
Identifier recovery 58.61%
100% JSON validity means every single response was valid JSON. That matters if you want to pipe the output into another tool.

100% execution pass rate means the recovered identifiers, when substituted back into the code, produce code that runs correctly. The model is recovering names that preserve behaviour.

Risk accuracy at 100% on the eval set is encouraging, but the risk classifier is still in development. The evaluation set is small (50 pairs) and the risk categories are coarse (low, medium, high). On more diverse and adversarial inputs, the classifier will need more work before I would trust it for production triage. The training data includes 30% malware samples, which gives the model a baseline for distinguishing dangerous patterns, but "baseline" is not "finished."

Identifier recovery at 58.61% sounds low, and it is the weakest metric. But consider what it measures: exact match or strong semantic similarity to the original variable name. When the original was calculate_fibonacci and the model says fib, that counts as a miss. The aliases field helps here. The model returns multiple candidates per identifier, and one of the aliases usually matches even when the primary target does not.

What I learned

The 0.5B model hit a ceiling. DPO on top of the 0.5B SFT model improved the training objective but failed to produce reliable held-out gains. Two consecutive metric-aligned runs failed the decision rule. I moved to 1.5B and the ceiling disappeared.

Reasoning traces are the highest-leverage augmentation I tried. Adding a thinking field to every training pair, generated by a teacher model, gave the model a visible deduction process. It also improved structured output quality, probably because the model reasons about the code before committing to JSON fields.

The training data mix matters more than model size. The 30% malware portion is what makes the risk classifier work. Without malware samples, the model labels everything as low risk because it has never seen dangerous code. The 50% benign algorithms portion is what makes identifier recovery work, because algorithmic code has canonical naming conventions the model can learn.

What is next

The 3B model is deferred. The 1.5B model is good enough to ship, and the marginal improvement from 3B does not justify the training cost on an 8GB GPU right now.

The model file is available to download here (940MB q4_K_M GGUF, runs on llama-server with the config above). Give it your nastiest obfuscated Python.