Ari Karchmer — October 2023
Based on my papers [CK21] and [Kar23], presented at TCC '21 and SaTML '23.
TLDR
Can machine learning models be protected from theft when served through a query interface? Observational Model Extraction Defenses (OMEDs) try to detect adversarial clients by monitoring their query patterns. We show that OMEDs are fundamentally limited: using Covert Learning—a cryptographic learning paradigm where queries are indistinguishable from random—an attacker can extract a model while appearing completely benign to any efficient OMED. The upshot is that no polynomial-time observational defense can provably prevent model extraction for concept classes like decision trees, assuming standard cryptographic hardness (LPN). This result is theoretical in nature, but the possibilities are not.
INTRODUCTION
Consider LLM chatbots like OpenAI's ChatGPT or Anthropic's Claude. Could a prolonged interaction with ChatGPT be sufficient to "clone" it? In other words, could someone engage with Claude extensively enough to develop "Claudio," a reasonable impersonation of Claude?
In a model extraction attack, an adversary maliciously probes an interface to a machine learning (ML) model in an attempt to extract the ML model itself. Model extraction was first introduced as a risk by Tramer et al. [TZJ+16] at USENIX '16.
Besides impersonating chatbots, having an ML model stolen poses significant risks in any machine learning as a service (MLaaS) scenario. Two main concerns stand out:
DEFENDING AGAINST MODEL EXTRACTION
Clearly, preventing model extraction is important. Understanding how to defend against model extraction has been extensively studied in academic literature (see e.g., [KMAM18], [TZJ+16], [CCG+20], [JSMA19], [PGKS21]).
Most model extraction defenses (MEDs) in the literature fall into two categories:
In this post, we focus on OMEDs rather than noisy response defenses, since the latter inherently sacrifice predictive accuracy—making them unsuitable for critical ML applications such as autonomous driving, medical diagnosis, or malware detection.
OMEDs preserve accuracy entirely since they do not modify the underlying model. A common implementation involves a "monitor" that analyzes a batch of client requests to compute a statistic measuring the likelihood of adversarial behavior. The goal is to reject a client's requests if their behavior surpasses a certain threshold. Examples can be found in [KMAM18], [JSMA19], and [PGKS21].
Essentially, observational defenses aim to control the distribution of client queries. They classify clients that deviate from acceptable distributions as adverse and subsequently restrict their access to the model.
Any client interacting with the ML model, whether adverse or benign, must query in accordance with an acceptable distribution. If they do not, the OMED will refuse service. But how should we determine the appropriate acceptable distributions? To date, these distributions have been chosen heuristically. For instance, in [JSMA19], an acceptable distribution is one where the distribution of Hamming distances between queries is normally distributed.
No formal definitions of security against model extraction have been proposed. Moreover, there has been limited formal work to understand the theoretical foundations of the observational defenses suggested in the literature. This gap was highlighted by Vinod Vaikuntanathan in his talk at the Privacy Preserving Machine Learning workshop at CRYPTO '21. As a result, a "cat-and-mouse" dynamic has developed between attacks and defenses, with no satisfying guarantees being established for either side.
CAN OMEDS PROVIDE PROVABLE SECURITY?
In [Kar23], we explore whether OMEDs can provide satisfying provable guarantees of security. Inspired by modern Cryptography, where the security of protocols is mathematically guaranteed by the infeasibility of certain hard computational problems, we sought to apply similar principles to model extraction defenses.
But how do we define security against model extraction? An initial approach might consider leveraging zero-knowledge style security, where a client learns nothing about the ML model beyond what they could have learned prior to interacting with it. While appealing, this is overly restrictive—at a minimum, the client should be able to learn some information "in good faith" from interacting with the model.
A revised security goal could instead ensure that a client learns nothing beyond what they could learn from a set of random queries to the model. This strikes a balance between the overly restrictive zero-knowledge guarantee and allowing complete query access.
One key insight of [Kar23] is that the notion of security implicitly underlying existing OMEDs aligns with this revised goal. While OMED literature typically cites the objective of detecting model extraction attempts, the downstream effect is confining client queries to specific, acceptable distributions by enforcing benign behavior. This enforcement assumes that any information a benign client can deduce from queries sampled from these distributions is considered secure.
Consider OMEDs for ML classifiers. The theory of statistical learning, particularly the VC dimension framework, suggests that a number of samples proportional to the VC dimension suffices for PAC learning. This seems to undermine OMED-based protection: an adversary with unbounded computational power could simply query according to an acceptable distribution sufficiently many times and apply a PAC-learning algorithm. The outcome would be a function that closely approximates the underlying ML model with high confidence.
However, this attack model does not account for the complexity of the implied extraction attack. Depending on the model's complexity, such an attack may require superpolynomial query and computational resources. For many important classifier families (e.g., boolean decision trees), no polynomial-time PAC-learning algorithms are known, despite extensive efforts from the learning theory community. This remains true even when queries are restricted to uniformly distributed examples and the decision tree is "typical" rather than worst-case hard.
This provides evidence that the security model implicitly considered by observational defenses may effectively prevent unwanted model extraction by computationally bounded adversaries. The OMED constrains the adversary to interact with the query interface in a manner that mimics an acceptable distribution, from which learning the underlying model is computationally challenging.
Security against model extraction by computationally bounded adversaries can thus be provable in a complexity-theoretic sense, akin to Cryptography. One could aim to establish a reduction from polynomial-time PAC-learning to polynomial-time model extraction in the presence of observational defenses: "any efficient algorithm capable of learning an approximation of a proprietary ML model while constrained by an observational defense would yield a PAC-learning algorithm, which is currently beyond all known techniques."
COVERT LEARNING
Having established that OMEDs could potentially provide provable security against computationally bounded adversaries, the next natural question is: can OMEDs be efficiently implemented to withstand these adversaries? To answer this, we need to introduce the concept of Covert Learning [CK21]—a learning algorithm that lies at the center of the impossibility results of [Kar23].
The Covert Learning model is a variant of Valiant's PAC-learning model, situated in the agnostic learning with queries setting. In agnostic learning with queries, a learning algorithm has access to an oracle that labels points in the domain of the target concept. These points can be chosen arbitrarily by the learning algorithm. In the "agnostic" setting, the goal is for the learning algorithm to identify a model (within some predefined "touchstone" class) that approximates an arbitrary target concept approximately as well as the best possible model in the touchstone class.
The essence of the Covert Learning model is to formalize a new type of privacy in learning theory. Specifically, Covert Learning algorithms provide the guarantee that the queries leak very little information about the concept, with respect to an adversary that can observe the queries. The Covert Learning algorithm gains privacy for what it is learning.
In other words, Covert Learning allows one to learn a concept (using knowledge of some internal randomness only known to itself),[1] while the adversary remains nearly completely "in the dark" about the concept, even given a view of the entire transcript of membership queries and oracle responses. Crucially, the adversary is not privy to the secret randomness of the learning algorithm.
An example to build intuition: imagine being in a course with one teacher and many classmates, where you are graded "on a curve." To finish ranked first, you need to learn the day's concept by asking the teacher yes-or-no questions. The catch is that every classmate observes your questions and the teacher's answers. Can you design a set of questions that allows you to learn the concept while preventing your classmates from learning it similarly? Covert Learning algorithms provide a solution under natural assumptions.
Mathematically, Covert Learning formalizes this using the simulation-based security paradigm from Cryptography. Roughly speaking, any membership query learning algorithm is a Covert Learning algorithm if it has an accompanying simulator (a polynomial-time algorithm) which, when given access only to random examples, emulates the distribution over the membership queries in a computationally indistinguishable way. The simulator can produce a "believable" transcript of the interaction without query access to the underlying concept.
For a concept class where access only to uniformly random examples makes learning computationally hard, this proves that the learning transcript reveals very little information to a computationally bounded adversary:
These two points together lead to a contradiction. Hence, Covert Learning algorithms are reserved for concept classes not (known to be) efficiently learnable in the original PAC-model with respect to a certain "hard example distribution"—such as decision trees and low-depth neural networks.
A special case worth highlighting is natural Covert Learning. Rather than requiring a simulator for the query interaction, natural Covert Learning requires that the distribution over the membership queries is computationally indistinguishable from a single pre-defined "hard example distribution" (e.g., the uniform distribution).
CONNECTING COVERT LEARNING TO MODEL EXTRACTION
The adversary in natural Covert Learning—a distinguisher that attempts to distinguish the membership queries from random examples—is essentially the same entity as the OMED in a model extraction attack.
See the resemblance between Figures 1 and 2.
This is the key insight that allows us to construct a provable and efficient attack on any OMED.
At first glance, this might seem confusing because normally the adversary in model extraction is the malicious client trying to reverse-engineer the model. However, consider that the job of the OMED is to distinguish benign clients from adverse clients. Thus, from the perspective of the actual adversary (the model extractor), the OMED is adversarial and has essentially the same job as the natural Covert Learning adversary: to detect some property about the client's distribution of queries. This becomes apparent when comparing Figures 1 and 2.
By exploiting this connection, we can demonstrate that natural Covert Learning algorithms form a recipe for an attack that can fool the OMED in the same way they fool the Covert Learning adversary. The learning guarantee of natural Covert Learning lets us argue that the client can still extract the underlying model. Hence, a single Covert Learning attack can achieve extraction while "fooling" any OMED—any OMED will output "accept" with high probability even though an extraction attack is being performed.
THE RESULTS OF [KAR23]
In [Kar23], I provide a negative answer to the question of whether OMEDs can be efficiently implemented against computationally bounded adversaries, through the following contributions:
WHY COVERT LEARNING IS THE RIGHT TOOL
To wrap up, it is worth emphasizing why Covert Learning algorithms are at the crux of model extraction attacks in the presence of OMEDs.
All models are at risk of extraction due to their exposure via some interface. The VC dimension argument shows that, given enough random examples and sufficient computational power, an adversary has a PAC-learning algorithm that would work for extraction. Since this attack only uses random examples, the model extraction adversary can choose to make queries according to a distribution that is "accepted" by the OMED.
However, the VC dimension argument generally relies on unbounded computational power, and only very simple concept classes are known to be PAC-learnable in polynomial time from random examples (e.g., linear models). The natural question is whether there exist effective defenses against polynomial-time adversaries, as is customary in modern Cryptography. Many important models (e.g., decision trees, low-depth neural networks) are not known to be PAC-learnable in polynomial time from random examples. On the other hand, a class like decision trees is learnable in polynomial time when using carefully synthesized queries. But classic membership query algorithms like the Kushilevitz-Mansour algorithm [KM93] make queries that can be tested and identified as problematic by the OMED.
Covert Learning algorithms bridge the gap between these two settings. They constitute attacks that are benign-looking but actually adverse: polynomial-time membership query learning algorithms that synthesize queries in a special way that is provably hard to distinguish from the classic PAC-learning with random examples setting. This means that Covert Learning attacks are both undetectable by an efficient OMED and run in polynomial time. By contrast, the VC dimension argument is undetectable but not efficient, while the membership query attack of [KM93] is efficient but is detectable by the OMED.
[1] Note that the internal randomness is not even shared with the concept oracle. To use an analogy from Cryptography, Covert Learning is "public-key" in nature, as opposed to "symmetric-key," which might rely on shared randomness between the learner and the concept oracle.
REFERENCES