Knowledge distillation — also called model distillation — is a machine-learning technique that compresses a large AI model’s capabilities into a smaller, faster one. The big model (the “teacher”) passes its learned patterns to a smaller model (the “student”), which ends up performing nearly as well at a fraction of the size and cost. The technique underpins some of the most widely deployed AI today, and it is now at the center of a growing wave of corporate and geopolitical disputes.
How It Works: Teacher and Student
Training a neural network normally means showing it millions of labeled examples — “this image is a cat, that one is a dog” — and adjusting its parameters until it gets the right answers. Distillation works differently.
Instead of training the student on raw labels, it trains on the teacher’s probability outputs. When a well-trained teacher model looks at a photo of a cat, it doesn’t just output “cat”. It outputs something like: cat 92%, dog 5%, fox 2%, rabbit 1%. Those small non-zero probabilities are valuable — they reveal what the teacher has learned about similarity: cats and dogs share ears, fur, and whiskers, so the teacher sees structural resemblance between them.
Geoffrey Hinton — one of the pioneers of deep learning — called this “dark knowledge”: the structured reasoning hidden in a model’s probability distributions that hard labels never capture. When the student trains on these softer targets, it learns not just the right answers but the teacher’s understanding of why those answers make sense.
A key setting is temperature: raising it smoothes out the probability distribution, surfacing the dark knowledge and making it easier for the student to absorb. Higher temperature → richer learning signal per training example.
Why It Matters: Smaller, Faster, Cheaper
The practical payoff is substantial. Large AI models require powerful servers, consume significant energy, and respond slowly. Distillation offers a way out:
- DistilBERT — a distilled version of Google’s BERT, documented at Hugging Face — is 40% smaller and 60% faster than the original, while retaining 97% of its language-understanding ability.
- DeepSeek’s 2025 distillations demonstrated that a distilled 14-billion-parameter model can outperform an independently trained 32-billion-parameter model on reasoning benchmarks, by inheriting the teacher’s reasoning patterns.
This makes distillation essential for deploying AI on smartphones, in cars, in medical devices, or anywhere the full model would be too slow or expensive to run. It is also why the technique matters to AI businesses: a small company can use distillation to build a capable model without training from scratch.
The Dark Side: Distillation Attacks
Distillation becomes controversial when it crosses a line: using someone else’s proprietary model as the teacher without permission, by querying it through its public API at massive scale.
Here is how such an attack unfolds: the attacker creates thousands of fraudulent accounts, sends millions of carefully engineered queries to the target model, records every response, and uses those input-output pairs to train a competing model that imitates the original. At sufficient scale, the attacker can extract not just surface answers but deep reasoning patterns, safety behaviors, and specialized capabilities — the product of years of research and billions of dollars in compute.
Every major AI provider explicitly prohibits this in their terms of service. The core argument: when a customer buys API access, they are paying for inference, not for a copy of the model’s knowledge. Distillation at scale is industrial espionage disguised as ordinary API traffic.
Detecting it is hard. Legitimate users and attackers look nearly identical from the outside — both send queries, both receive responses. Defenders look for behavioral signals: unusual systematic coverage of a model’s capability space, suspicious uniformity across thousands of accounts, or traffic timing that suggests coordinated orchestration rather than real users.
In the News
The largest alleged case on record: in June 2026, Anthropic accused Alibaba of running a months-long distillation campaign targeting its Claude model — nearly 25,000 fraudulent accounts generating 28.8 million interactions between April and June 2026, systematically targeting Claude’s software-engineering, agentic reasoning, and cybersecurity capabilities.
Read the full brief: Anthropic Accuses Alibaba of Record-Scale AI Distillation Attack on Claude.
FAQ
Is model distillation legal?
Legitimate distillation using your own data or open-source teacher models is entirely legal and widely practiced in industry and academia. Using another company’s proprietary model without permission violates their terms of service and may raise trade-secret and unfair-competition claims.
Can a distilled model outperform its teacher?
Not in general — the student is bounded by the teacher’s knowledge. But a student trained on a high-quality teacher can outperform independently trained models of the same size, which is what makes distillation powerful.
What is the difference between distillation and fine-tuning?
Fine-tuning adapts an existing model to a specific task by continuing to train it on new data. Distillation creates a new, smaller model by training it to mimic a larger one. They serve different goals and are often combined.
How is distillation different from model quantization?
Quantization compresses a model by reducing the numerical precision of its weights (for example, from 32-bit to 8-bit floats). Distillation trains a brand-new, architecturally smaller model. Both reduce size and cost; distillation typically achieves greater compression with less accuracy loss.