Ari Karchmer; January 2025
As AI models become increasingly complex and powerful, one critical question keeps emerging: how exactly do these models use their training data? When a model makes a prediction, can we trace back which training examples were most influential in shaping that decision? These aren't just academic questions — they're crucial for understanding how our AI systems work and ensuring they're behaving as intended. This is data attribution, and it can be used for:
Concretely, consider a medical diagnosis model — wouldn't it be valuable to know which training cases most influenced its decision about a particular patient?
In this post, we'll explore two fascinating approaches to answering these questions: classical influence functions (Hampel, 1974; Koh and Liang, 2017) and a newer method called datamodels (Ilyas et al., 2022). We'll see how these methods help us look inside the black box of machine learning models and understand their decision-making processes, and we'll also dive into a bit of the mathematics that links these two methods together.
Influence functions, which have roots in robust statistics, aim to measure how the parameters of a trained model (e.g., weights in a neural network) would change if we adjusted the “importance” of a single training point ever so slightly. Concretely:
As elegant as influence functions are, applying them to deep learning can be challenging because of non-unique optima, large/singular Hessians, and the fact that deep networks often aren't trained to exact convergence.
More details: Optima of neural network parameters are often multiple due to overparameterization. This makes the very meaning of \(\mathcal{I}_{\theta^*}(z_j)\) unclear. Furthermore, this makes the Hessian \(\mathbf{H}\) potentially singular (and thus \(\mathcal{I}_{\theta^*}(z_j)\) nonexistent). Second, in deep learning, models are not guaranteed to be (and rarely are) trained to true convergence. In this case, \(\theta^*\) may not be a genuine optimum, which again implies \(\mathbf{H}\) has negative eigenvalues and therefore may be singular.
Due to these challenges, influence functions applied to large-scale deep learning settings have often failed to approximate the counterfactual effect of removing certain training data. In part, this is because one must often condition the Hessian to ensure the influence function exists or approximate computation of the inverse Hessian vector product (IHVP) to handle the Hessian’s infeasible dimensionality. As pointed out by Bae et al. (2022), both these actions introduce approximation error.
Datamodels provide an alternative that sidesteps some of the pitfalls of calculating derivatives in high-dimensional, non-convex landscapes. Instead of analyzing changes in model parameters, datamodels look directly at how the model’s predictions change when different subsets of training data are used.
Consider a training set D of size N. A key observation is that any learner’s outputs on a test example y can be viewed as a function: \[ f(y;\, S), \] where S ⊂ D is the subset used to train the model. A datamodel tries to predict f(y; S) by learning a simple linear function: \[ g_{\beta}(1_S) = \beta_0 \;+\; \sum_{i=1}^N \beta_i \,[1_S]_i, \] where 1S is the “indicator vector” that is 1 if zi ∈ S and 0 otherwise.
Even though this linear approximation might seem too simplistic, datamodels often predict f(y; S) with remarkable accuracy. They work especially well for predicting counterfactuals, such as what happens when you remove a particular point zi from the full dataset.
Let me now transition to talking about my insight from recent work studying these approaches. What I see is that datamodel coefficients end up playing a role very similar to what continuous influence functions measure. Indeed, datamodel coefficients can be viewed as approximating the average effect of the infinitesimally upweighting a point. Here, the average is over a randomized training process. On the other hand, classical continuous influence functions focus on the infinitesimal effect of upweighting a point with respect to the optimal parameters of the model. This is what causes there issues in modern deep learning settings.
More formally, we can define a randomized training function \(\Phi_{\ell}(r; w)\) that trains a model with a random seed r on a dataset weighted by \(w \in \mathbb{R}^N\). This means that empirical loss has weight vector \(w\): \[\theta^\star \triangleq \arg\min_{\theta\in\mathbb{R}^D} \frac{1}{N}\sum_{i=1}^N \ell(z_i, \theta)\] Then the statistical response measurement is:
\[ F_{\Phi}(w) = \mathbb{E}_{r} \bigl[ f\bigl(\Phi_{\ell}(r; w)\bigr)\bigr]. \]
Under this viewpoint, a datamodel amounts to learning a function \(\beta\) that best predicts \(F_\Phi(1_S)\) — i.e., the expected model output when training on subset \(S\).
In essence, one can show that the optimal datamodel coefficients are exactly the first-order Taylor approximation of the statistical response measurement function! Concretely, consider:
\[ \beta \;\triangleq\; \arg\min_{\beta \in \mathbb{R}^{N+1}} \; \mathbb{E}_{S \sim 2^N}\Bigl[\,\bigl(F_\Phi(1_S) \;-\;\beta_0 - \sum_{i \in S} \beta_i\bigr)^2\Bigr]. \]
Then, for each \(i \in [N]\), we have that
\[ \beta_i \;=\; \left.\frac{d F_\Phi(w)}{d w_i}\right|_{w=1}, \]
which is the same as the first-order partial derivative of \(F_\Phi\) with respect to the weight \(w_i\) at the full-weight vector \(w=1\). Informally, this shows that learning a datamodel is akin to taking a first-order expansion of the model's output around the “train on all data” point.
The immediate corollary is that if we want to approximate leave-one-out predictions or other small perturbations of the training set, the linear datamodel is effectively doing a linear (i.e., first-order) approximation of those counterfactual outputs.
Therefore, using an optimal datamodel \[ \beta \quad=\quad \arg\min_{\beta\in\mathbb{R}^{N+1}}\,\mathbb{E}_{S \sim 2^N}\Bigl[\bigl(F_\Phi(1_S) \;-\;\beta_0 \;-\;\sum_{i\in S}\beta_i\bigr)^2\Bigr] \] to evaluate something like the leave-one-out counterfactual \(\beta(1_{D\setminus\{i\}})\) is, in essence, the same as evaluating the degree-1 Taylor polynomial \[ F_\Phi\bigl(1_{D\setminus\{i\}}\bigr) \;\approx\; F_\Phi(1_{D}) \;-\; \tfrac{1}{N}\,\left.\frac{dF_\Phi}{dw_i}\right|_{w=1} \] This illuminates exactly how datamodeling applies the continuous influence technique, while effectively bypassing all the issues surrounding overparameterization or non-covergence (and their effect on non-singularity of the Hessian).
The reason that Datamodels obtain much better empirical performance is because they do away with many issues in applying classical influence functions to deep learning. Instead of assuming first and second-order optimality (as classical influence functions do), we simply average over retraining runs, capturing the statistical effect of changing a point’s weight. This inherently accounts for imperfect training, such as non-convergence or ill-conditioned Hessians, and is more robust in practice.
Another point is that this suggests that modeling the counterfactual with respect to the average effect of re-training is often easier than capturing the effect on a specific model (especially in adversarial or pathological cases). By focusing on statistical averages, datamodels bypass many of the pitfalls that plague influence functions in modern, overparameterized regimes.