Building a Self-Improving Malware Classifier with Claude

Tom Bowyer

05 Mar 2026 — 3 min read

We're building Sandstrike, a machine-learning classifier that scans newly published npm and PyPI packages for supply-chain malware. Every 15 minutes, it pulls the latest packages from both registries, extracts features, and runs inference. Suspicious packages get flagged to Slack.

The model is good, but it's not perfect. Every cycle flags 3-5 packages, and most of them are false positives: legitimate packages from real companies that happen to trip the classifier's heuristics. The manual review workflow looked like this:

See Slack alert
For each flagged package: check the registry metadata, find the GitHub repo, look at the org, evaluate legitimacy
If it's a false positive: download the package, add it to the benign training set
When enough FPs accumulate, retrain the model with the new data
Export, deploy, and restart the monitor

I was doing this by hand, multiple times per day. The research step alone, checking npm/PyPI metadata, verifying GitHub repos, searching for company legitimacy, takes 2-5 minutes per package. Multiply that by 15 packages, and you're looking at an hour of tedious triage work.

This is exactly the kind of task that AI should be doing.

The Architecture

The solution is embarrassingly simple: a bash script that runs claude -p every 30 minutes with a detailed prompt file explaining the review methodology.

Claude Code runs on a Mac (for authentication), then SSHs into the environment container that hosts the ML pipeline. It has access to:

Bash — for SSH commands to the training server
WebFetch — for registry metadata (PyPI JSON API, npm registry)
WebSearch — for researching unknown organizations
Read/Write/Grep/Glob — for local file operations and logging

That's it. No Agent spawning, no complex orchestration. Just Claude with a well-written prompt and the right tools.

The Prompt

The prompt file is ~160 lines of structured instructions. It's the most important piece of the system. Here's what it covers:

Step 1: Query the database. A Python script, since sqlite3 CLI isn't installed on the container, joins the seen_packages table against a new reviews table to find unreviewed flags.

Step 2: Research each package. Specific instructions per ecosystem:

PyPI: fetch /pypi/{name}/json, check author, email, project URLs, version count
npm: fetch registry.npmjs.org/{name}, check maintainers, repo, downloads
Both: verify linked GitHub repos via gh api, search for org reputation

Step 3: Classify. Clear criteria for FP vs TP vs BORDERLINE:

FP signals: real GitHub repo, known org, corporate emails, many versions, meaningful downloads
TP signals: GitHub 404, empty metadata, tiny stub, no author identity, name squatting
BORDERLINE: when uncertain — these don't get added to training data

Step 4-6: Act on decisions. Download FPs to training directories, record reviews in SQLite, and check if the retrain threshold (10+ new FPs) has been reached.

The key insight: the prompt encodes the exact same decision process I was doing manually. It's not asking Claude to be creative or make judgment calls outside its competence. It's asking it to do web research and pattern matching, both of which it's extremely good at.

The Self-Improvement Loop

Here's where it gets interesting. The system doesn't just classify packages — it feeds corrections back into the model:

FP packages get downloaded to the false positive directory
Manifests get updated, so the feature extraction pipeline includes them
When 10+ new FPs accumulate, the autopilot kicks off retraining:
- Feature extraction
- Model training
- Export the model
- The monitor will restart to pick up the new model

The next autopilot run checks if extraction is still running and picks up where the previous one left off. The model improves with each cycle because each false positive it reviews serves as training data, helping prevent similar false positives in the future.

The Sandstrike monitor catches threats. The autopilot improves its ability to catch threats. That's the loop. Hopefully, we don't overfit, but that's another problem.