This is part 1 of a 3-part (Part 2, Part 3 -- coming soon) blog post which is in part based on my papers [CK21] and [Kar23], which appeared at TCC '21 and SaTML '23, respectively.
Consider LLM chatbots like OpenAI's ChatGPT or Anthropic's Claude. Could a long conversation with ChatGPT be enough to "clone" it? Said differently, could someone get to know Claude so well, that they could program "Claudio," who performs a reasonable impersonation of Claude?
In a model extraction attack, more generally, an adversary maliciously probes an interface to a machine learning (ML) model in an attempt to extract the ML model itself. Model extraction was first introduced as a risk by Tramer et al. [TZJ+16], at USENIX `16.
(An imaginative depiction of Claude meeting Claudio.)
Besides impersonating chatbots, having an ML model stolen is a risk whenever interaction with the model is allowed, like in any machine learning as a service (MLaaS) scenario. So, let's highlight two main concerns that summarize why we need to seriously care about the model extraction scenario.
In part 1 of this series, I'm going to spend more time introducing the model extraction scenario first, and then give an overview of a specific type of model extraction defense, which I call an Observational Model Extraction Defense (OMED), with a focus on giving my interpretation of some of the theoretical underpinnings of these defenses. After that, I'll describe the techincal contributions of [Kar23], which serve to cast a lot of doubt on whether or not these OMEDs can provide any provable security guarantees against the model extraction problem.
So, clearly, maintaining the preventing model extraction is important. Understanding how to defend against model extraction has been considered (see e.g. the papers [KMAM18][TZJ+16][CCG+20][JMAM19][PGKS21]).
Most model extraction defenses (MEDs) existing in the academic literature belong to two types.
In this blog post, we will not focus on the noisy responses defense, because it necessarily sacrifices predictive accuracy of the ML model. In general, defenses are all about balancing security against model extraction with accuracy. The noisy responses defense sacrifices accuracy in favor of security against model extraction in a strong way, and may not be a viable option at all for many ML systems where accuracy is critical such as autonomous driving, medical diagnosis, or malware detection.
OMEDs, on the other hand, totally pereserve accuracy, since they do not change the underlying model.
A common implementation of the observational defense involves a so-called "monitor" that takes as input a batch of requests submitted by the client and computes some statistic that measures the likelihood of adversarial behavior. The goal is to reject a client's requests when the queries pass a certain threshold on the statistic. Some examples of proposals following this template are [KMAM18][JMAM19] and [PGKS21].
Essentially, observational defenses aim to control the distribution of the client's queries. They do this by classifying any clients that fail to conform to the appropriate distributions as adverse, and then prohibiting them from accessing the model. See Figure 1 below.
Said differently, any client that wishes to interact with the ML model, adverse or benign, must query according to an acceptable distribution, since if they do not the OMED will refuse service.
This prompts the question: how should we choose the appropriate acceptable distributions? To date, the choice of such distributions has been made heuristically. For instance, in [JMAM19], an acceptable distribution is one with the property that the distribution over hamming distances between queries is normally distributed.
However, no formal definitions of security against model extraction have even been suggested. Neither has there been much formal work done to understand the theoretical underpinnings of the observational defenses proposed in the literature. In fact, this was highlighted by Vinod Vaikuntanathan as an open problem in his talk at the Privacy Preserving Machine Learning workshop at CRYPTO '21.
As a result, a "cat-and-mouse" progression of attacks and defenses has developed, while no satisfying guarantees have been discovered ("for neither cat nor mouse").
In my recent paper [Kar23], we consider whether OMEDs can have satisfying provable guarantees of security. We are inspired by the modern theory of Cryptography, where the security of some protocol is mathematically guaranteed by the infeasibility of some very hard computational problem.
But, how do we even define security against model extraction? An initial attempt could try to leverage zero-knowledge style security. Zero-knowledge proofs and their simulation-based security can make up another blog post (or an entire textbook); but, essentially, a zero-knowlege guarantee would say that a client learns nothing about the ML model that they could not have already learned prior to the interaction with the model. Hence, the client learned nothing about the model from its queries.
This sounds great. However, it is too strong of a security goal: at the very least the client will learn some queried examples. For example, if we imagine the MLaaS setting, the client must be granted at least some ability to learn information ("in good faith"). Otherwise, the client may take business elsewhere, or not use the model at all.
What kind of zero-knowledge guarantees could we feasibly hope to obtain, then? One possible revised goal, could be to guarantee that a client learns zero knowledge beyond whatever they could learn from a set of random queries on the model. This privilege constitutes a middle ground between the too restrictive fully zero-knowledge guarantee, and allowing a client total query access to the model.
One of the most important observations of my paper [Kar23] is that, in fact, this notion of security appears to be implicitly behind existing OMEDs! The literature on OMEDs tends to cite the goal of detecting model extraction, but the downstream effect is that the observational defenses seek to exactly confine the queries obtained by the client to some specific distributions (by enforcing a particular benign behavior). The benign behavior is enforced because the OMED will reject the client's queries if their distribution fails some chosen statistical test. Hence, the idea of only serving clients confined to these benign query distributions undoubtedly assumes that whatever can be deduced about the model by a benign client using queries sampled from that distribution is indeed okay ("secure enough").
To think about this deeper, let's focus on the case of OMEDs for ML classifiers (which output a prediction about whether some item (e.g. a picture) belongs to the " + " class or the " - " class). At a first glance, the theory of statistical learning (Vapnik and Chervonenkis) --- which tells us that a number of samples proportional to the VC dimension of the hypothesis class suffices for PAC-learning --- seems to end the hopes of using the current notion of security to obtain any meaningful protection.
This is because an adversary with unbounded computational power could simply query the model according to one of the appropriate distributions of random examples for a sufficient number of times, and then apply a PAC-learning algorithm (which is guranteed to exist by the fundamental theorem of statistical learning). The output of this algorithm would be a function which would be a strong approximation to the underlying ML model with high confidence.
However, this attack doesn't account for the complexity of the implied model extraction attack: depending on the complexity of the model, this type of attack may have superpolynomial query and computational complexity. For instance, for many important families of classifiers (e.g. boolean decision trees), no polynomial time PAC-learning algorithms are known, despite intense effort from the learning theory community. Actually, this is true even when the queries are restricted to being uniformly distributed, and the decision tree is "typical" (not only worst-case hard).
All in all, this adds evidence for the claim that the model of security implicitly considered by observational defenses might actually be effective in preventing unwanted model extraction by computationally bounded adversaries. The OMED forces the adversary to interact with the query interface in a way that mimics uniformly random examples, or some other distribution, from which it is hard to learn the underlying model.
Then, security against model extraction by computationally bounded adversaries can be provable in a complexity-theoretic way (like in Cryptography): one could hope to give a reduction from polynomial time PAC-learning to polynomial time model extraction in the presence of observational defenses. In other words, one could hope to prove a theorem that says "any efficient algorithm to learn an approximation of a proprietary ML model when constrained by an observational defense yields a PAC-learning algorithm (that is currently beyond all known techniques)."
So we have established that OMEDs can be possibly used to obtain a notion of provable security against computationally bounded adversaries. The main question we tackle in [Kar23], is the natural follow-up: can OMEDs be efficiently implemented against these efficient adversaries?
In [Kar23], I provided a negative answer to this question. I did it via the following program: