I wrote up the Kinsing infection a few months ago and ended that post with an uncomfortable admission. The week our team spent dismembering a 5.9 MB Go binary, a Claude agent running Ghidra via Computer Use did in about two hours. Same coverage, same MITRE mapping, a few details I had missed. I left it there at the time. I have not stopped thinking about it. This post is what happened when I stopped just thinking about it.
Over the last two weeks I built an autonomous malware triage stack on the home lab. The thesis is straightforward: a single LLM agent with the right toolbelt can do the boring eighty percent of reverse engineering at a quality bar I would accept from a human analyst, and it can do it on a cron, and it can do it overnight. What I want from that arrangement is not an analyst replacement. What I want is a tireless first pass that turns the queue of samples I will never get to into a queue of samples I have actual context on before I sit down. That is the project. Here is how it works.
The stack runs across three Proxmox hosts and is structured as three tiers stacked on top of an OpenCTI ingestion layer. A fast static-analysis tier processes fifty samples every thirty minutes — it is pure unix tools, no LLM, no network calls, optimized to cheaply triage volume and decide what is worth a deeper look. A sandbox tier dynamically detonates select samples in CAPEv2, capturing process trees, network captures, and dropped files. A deep agentic tier runs one sample every three hours through Hermes, a local LLM agent wired up to eighty-eight MCP tools spanning Ghidra, radare2, OpenCTI, CAPE, and the local malware corpus. Above all three, a publish layer writes findings to a markdown wiki, posts summaries to Discord, and writes OpenCTI Notes back to the platform that sourced the sample. The whole thing is held together with cron and shell scripts.
OpenCTI does the ingestion. It is running on a dedicated VM with 32 GB of RAM and 200 GB of disk after I bumped both — more on that in a second. The connector roster is the usual suspects: MalwareBazaar for daily sample drops, URLhaus for payload URLs, AbuseIPDB for IP reputation, plus a handful of OSINT feeds. The platform normalizes everything into STIX, deduplicates against what it has already seen, and surfaces the new artifacts via API. My pipeline polls that API for the fifty newest unprocessed artifacts every half hour. Each artifact gets a SHA-256 lookup against the local corpus, a download if it is new, and an entry in the triage queue.
The first day I turned everything on, OpenCTI ate itself. Three connectors were each pulling on five-minute cooldowns, the message queue hit forty-one thousand deep, the VM hit ninety-four percent memory and started thrashing the swap. The culprit was AbuseIPDB — one connector pulling ten thousand IPs every two hours, mostly noise for a malware triage workflow. I disabled it, threw MalwareBazaar and URLhaus to thirty-minute cooldowns, and let the queue drain. The lesson is the one every threat intel platform operator learns: turn connectors on one at a time, watch the queue depth, and accept that "free intel" is not free if it costs you a VM hang.
The fast tier is a shell script called triage-fast.sh and it is the layer I am proudest of, because nothing in it is clever. It runs a battery of static unix tools on each sample in parallel — file, pefile, exiftool, rabin2 -I, binwalk, strings, yara, an entropy scan — and dumps their output into a per-sample directory. Then a small Python classifier scores the outputs against a ruleset: PE versus ELF, packed versus not, signed versus unsigned, suspicious section names, embedded payloads, known YARA hits. Each sample gets a tier assignment. Most get queued for deep dive. A few obvious-junk samples (corrupted, zero-byte, encrypted ZIPs without keys) get skipped. The whole pass takes well under a minute per sample on this hardware.
One of the more useful pieces in the fast tier is something the agent actually requested. After processing a few 7-Zip SFX dropper samples, the agent's output kept noting "would benefit from extracting the SFX config." So I wrote sfx-config-extractor.py, a pure-Python tool that pulls the embedded ;!@Install@!UTF-8! config block out of 7-Zip self-extractors and surfaces it in tier-1 output. Now every PE sample that is a 7-Zip SFX exposes its install commands before it ever reaches the LLM. The deep tier gets a richer artifact, the static tier catches more, and the agent never has to ask for it again. That feedback loop — agent identifies a gap, I add a deterministic tool that fills it — is the model I want to keep extending.
The sandbox tier runs CAPEv2 on a separate VM with a Linux Ubuntu 22 guest as the current detonation target. ELF samples land cleanly; CAPE captures process trees, dropped files, network traffic as a PCAP, and produces a structured report JSON that downstream tiers consume. I have detonated test droppers (/bin/whoami as a sanity check, then real ELF samples) and the pipeline routes outputs into the same per-sample wiki directory the static tier writes to.
What does not work yet: the Windows 11 guest. I defined a Windows 11 LTSC EVAL domain on the CAPE host but the install never started — the OVMF firmware build that ships with Ubuntu 22.04 cannot walk Microsoft's El Torito boot catalog on the Win11 ISO. I have three documented recovery paths (upgrade OVMF, extract the ESP as a separate FAT32 disk, VNC-driven manual install) and I will pick one when I sit down to it. For now PE samples skip the sandbox stage and proceed to the agent tier with static enrichment only. The pipeline is built to handle missing guests gracefully — better to ship the platform and add the guest than to block on it.
The deep tier is where the LLM agent earns its rent. I am running Hermes — NousResearch's CLI agent — with kimi-k2.6 via Ollama Cloud as the model. The agent loads a system prompt that gives it a persona and a hard mandate: produce a report.md, append to the by-family index, write canonical wiki entries with evidence references, and post a summary to Discord. The agent has eighty-eight MCP tools available across five servers: pyghidra-mcp for Ghidra decompilation, radare2-mcp for fast disassembly, opencti-mcp for IOC enrichment and Note write-back, a CAPE wrapper for sandbox results, and a local-filesystem wrapper for the corpus and wiki. The agent decides which tools to call in what order. I do not script it.
What it actually looks like: yesterday it picked up an asgardprotector sample — a 1.87 MB PE32+ with an embedded Microsoft CAB and a forged 2085 timestamp. Six minutes of agent time produced a report.md with file metadata pulled via exiftool and pefile, the embedded CAB confirmed via binwalk at offset 0x688BC, the two embedded payloads (AutoIt3.exe and Terminals.a3x) identified as the AutoIt-compiled-script-dropper pattern, the signature directory confirmed empty (the original wextract.exe is Microsoft-signed; this copy was repacked), and a wikilink graph connecting the sample to a sibling using the same build pattern. Every claim in the report carries a footnote pointing at the tool output that produced it. That last detail is what convinced me to ship the architecture.
The wiki is intentionally Karpathy-style — plain markdown files in a directory tree, no database, no application server, no migrations. Per-sample analyses live under raw/analyses/<sha256>/. Family-level canonicalizations live under by-family/<name>/. MITRE techniques get their own pages under techniques/. Everything is linked with wikilink syntax that an Obsidian-style renderer could resolve, but the source of truth is just markdown on disk. The agent reads existing entries before writing new ones, so the second sample of a known family lands as a "see also" link in the family page rather than as a duplicate report. There is no concept of "create new entry" in the agent prompt — only "find the right place, write or extend, link."
After the wiki entry is written, three things happen. The agent posts a summary to a Discord channel via a webhook (an attached .md file lets analysts pull the full report into their own tools). The agent writes a structured Note back to the OpenCTI artifact the sample originated from — closing the loop so the platform that surfaced the sample also surfaces the analysis. And the agent leaves an entry in index.md so the wiki has a chronological log of analyses without depending on filesystem mtimes. The choreography is enforced in the system prompt; I have caught and fixed two cases where the agent skipped a step early in development.
There is a fourth thing running on the same agent host that is not part of the malware loop but is worth mentioning: a Threat Watch generator that pulls ten general InfoSec RSS feeds, scores items deterministically (CVSS, named actors, critical-infra keywords), and uses the same Ollama Cloud connection to draft a single quality-gated post per day. The post lands on this blog in the Threat Watch column and on Discord. Same hardware, same model, completely different workload — what surprised me is how cleanly the static-scoring + LLM-drafting split worked. The LLM is good at writing; deterministic rules are good at picking. Mixing them lets each play to its strengths.
What this all produces, as of writing: 571 samples through the static triage pass, 122 OpenCTI Notes written back to the platform, four malware families canonicalized in the wiki (asgardprotector, chacha8, coinminer, meterpreter), seven completed deep agent analyses, seven CAPE detonations, 507 samples sitting in the deep queue, one published Threat Watch post. The deep queue is intentionally throttled — one sample every three hours — both because Ollama Cloud is rate-limited and because I want to keep an eye on quality before opening the floodgates. Most of the volume so far is opportunistic. The family-page count is the metric I actually care about, because it is the closest thing to "what does this stack know that I did not know yesterday."
Three things on deck. First, the Win11 CAPE guest — once OVMF cooperates, the PE sample volume goes through the same sandbox stage as ELF. Second, a novel-PoC generator that reads the wiki's family pages, identifies behaviors that have been documented across multiple samples, and proposes new PoC implementations for manual researcher review. The point there is not to ship offensive tooling; it is to use the wiki as a source of capability hypotheses that humans can validate. Third, expanding the agent's tool registry — floss for obfuscated string extraction, a YARA generator, a CAPE config extractor, and probably a Volatility wrapper for memory analysis if the sandbox grows up. None of these is novel on its own. The leverage comes from chaining them under one agent with consistent output formatting.
If you are thinking about building something like this, the unintuitive lesson from the last two weeks is that the LLM is the easiest part. Picking a model, wiring up MCP tools, writing a system prompt — that is one weekend. The hard parts are: deterministic tooling that the agent can rely on (the SFX extractor, the YARA rules, the static classifier); a sane place for the output to go (the markdown wiki, not a database); rate limits and error handling on the things you do not control (Ollama Cloud, OpenCTI's queue, GitHub pushes); and the discipline to throttle the agent to one sample at a time until you trust the output. Quality compounds slowly. Get the boring parts right and the agent will surprise you. Get the boring parts wrong and you will be debugging your own infrastructure instead of analyzing malware. I would rather analyze malware.