Model extraction, LLM abuse, steganography, and covert learning

part 1

-Ari Karchmer; October 2023

Introduction

Consider LLM chatbots like OpenAI's ChatGPT or Anthropic's Claude. Could a prolonged interaction with ChatGPT be sufficient to "clone" it? In other words, could someone engage with Claude extensively enough to develop "Claudio," a reasonable impersonation of Claude?

In a model extraction attack, an adversary maliciously probes an interface to a machine learning (ML) model in an attempt to extract the ML model itself. Model extraction was first introduced as a risk by Tramer et al. [TZJ+16] at USENIX '16.

Besides impersonating chatbots, having an ML model stolen poses significant risks in any machine learning as a service (MLaaS) scenario. Let's highlight two main concerns that emphasize why the model extraction scenario is critical.

  • Intellectual Property Concerns. Often, an ML model is considered confidential and represents valuable intellectual property that should remain proprietary. In extreme cases, such as modern Large Language Models (LLMs), the model in the wrong hands could lead to significant problems.
  • Security and Privacy Concerns. Preventing model extraction enhances security and privacy. For instance, adversarial example attacks become easier once the model is known rather than being a black-box (see e.g., [CJM20] and references therein). The same holds true for model inversion attacks, where adversaries obtain information about the data used to train the model.

Defending against model extraction

Clearly, preventing model extraction is important. Understanding how to defend against model extraction has been extensively studied in academic literature (see e.g., [KMAM18], [TZJ+16], [CCG+20], [JMAM19], [PGKS21]).

Most model extraction defenses (MEDs) in the literature fall into two categories:

In this blog post, we will not focus on noisy response defenses, as they inherently sacrifice the predictive accuracy of the ML model. Generally, defenses involve balancing security against model extraction with maintaining model accuracy. The noisy response defense significantly compromises accuracy, making it unsuitable for many critical ML applications such as autonomous driving, medical diagnosis, or malware detection.

OMEDs

Observational Model Extraction Defenses (OMEDs), on the other hand, preserve accuracy entirely since they do not modify the underlying model.

A common implementation of an observational defense involves a "monitor" that analyzes a batch of client requests to compute a statistic measuring the likelihood of adversarial behavior. The goal is to reject a client's requests if their behavior surpasses a certain threshold based on this statistic. Examples of such implementations can be found in [KMAM18], [JMAM19], and [PGKS21].

Essentially, observational defenses aim to control the distribution of client queries. They classify clients that deviate from acceptable distributions as adverse and subsequently restrict their access to the model. See Figure 1 below for a depiction of this setting.

Model Extraction Depiction
Figure 1: A depiction of the model extraction setting in the presence of an OMED. The adverse client queries the ML model, attempting to extract an approximation. The OMED monitors the interaction and decides to accept (and forward the labels) or reject the client based on whether they are deemed adverse or benign.

In other words, any client interacting with the ML model, whether adverse or benign, must query in accordance with an acceptable distribution. If they do not, the OMED will refuse service.

This raises the question: how should we determine the appropriate acceptable distributions? To date, these distributions have been chosen heuristically. For instance, in [JMAM19], an acceptable distribution is one where the distribution of Hamming distances between queries is normally distributed.

However, no formal definitions of security against model extraction have been proposed. Moreover, there has been limited formal work to understand the theoretical foundations of the observational defenses suggested in the literature. This gap was highlighted by Vinod Vaikuntanathan in his talk at the Privacy Preserving Machine Learning workshop at CRYPTO '21.

As a result, a "cat-and-mouse" dynamic has developed between attacks and defenses, with no satisfying guarantees being established for either side.

Towards provable security against model extraction

In my recent paper [Kar23], we explore whether OMEDs can provide satisfying provable guarantees of security. Inspired by modern Cryptography, where the security of protocols is mathematically guaranteed by the infeasibility of certain hard computational problems, we sought to apply similar principles to model extraction defenses.

But how do we define security against model extraction? An initial approach might consider leveraging zero-knowledge style security. Zero-knowledge proofs and their simulation-based security provide a framework where a client learns nothing about the ML model beyond what they could have learned prior to interacting with it. Essentially, the client gains no additional information about the model from their queries.

While this notion is appealing, it is overly restrictive. At a minimum, the client should be able to learn some information ("in good faith") from interacting with the model. Otherwise, clients may be deterred from using the model altogether, which is impractical for many applications.

Therefore, a revised security goal could be to ensure that a client learns nothing beyond what they could learn from a set of random queries to the model. This approach strikes a balance between the overly restrictive zero-knowledge guarantee and allowing complete query access to the model.

An implicit model of security for OMEDs

One of the key insights of my paper [Kar23] is that the notion of security implicitly underlying existing OMEDs aligns with this revised goal. While literature on OMEDs typically cites the objective of detecting model extraction attempts, the downstream effect is the confinement of client queries to specific, acceptable distributions by enforcing benign behavior. This enforcement assumes that any information a benign client can deduce from queries sampled from these distributions is considered secure.

To delve deeper, consider OMEDs for ML classifiers (which output predictions indicating whether an input belongs to a certain class). At first glance, the theory of statistical learning, particularly the Vapnik-Chervonenkis (VC) dimension framework, suggests that a number of samples proportional to the VC dimension suffices for Probably Approximately Correct (PAC) learning. This observation seems to undermine the potential of using the current security definitions to offer meaningful protection.

This is because an adversary with unbounded computational power could simply query the model according to an acceptable distribution sufficiently many times and apply a PAC-learning algorithm (which is guaranteed to exist by the fundamental theorem of statistical learning). The outcome would be a function that closely approximates the underlying ML model with high confidence.

What about computationally bounded adversaries?

However, this attack model does not account for the complexity of the implied model extraction attack. Depending on the model's complexity, such an attack may require superpolynomial query and computational resources. For many important classifier families (e.g., boolean decision trees), no polynomial-time PAC-learning algorithms are known, despite extensive efforts from the learning theory community. This remains true even when queries are restricted to uniformly distributed examples and the decision tree is "typical" rather than worst-case hard.

Overall, this provides evidence that the security model implicitly considered by observational defenses may effectively prevent unwanted model extraction by computationally bounded adversaries. The OMED constrains the adversary to interact with the query interface in a manner that mimics an acceptable distribution, from which learning the underlying model is computationally challenging.

Consequently, security against model extraction by computationally bounded adversaries can be provable in a complexity-theoretic sense, akin to Cryptography. One could aim to establish a reduction from polynomial-time PAC-learning to polynomial-time model extraction in the presence of observational defenses. In other words, a theorem could state that "any efficient algorithm capable of learning an approximation of a proprietary ML model while constrained by an observational defense would yield a PAC-learning algorithm, which is currently beyond all known techniques."


The contributions of [Kar23]

Having established that OMEDs could potentially provide a notion of provable security against computationally bounded adversaries, the next natural question addressed in [Kar23] is: Can OMEDs be efficiently implemented to withstand these efficient adversaries?

In [Kar23], I provide a negative answer to this question through the following approach:

Next time

In the next blog post, we will delve into the concept of Covert Learning and further elucidate its connection to model extraction.

References