Highlights of NeurIPS 2023

2023/12/12

The scale of NeurIPS 2023 is hard to overestimate. 3,584 papers accepted, many workshops, and many more attendees made sifting through all the papers obviously impossible. Still, it was interesting to see that some of the biology applications were consistently among the most visited posters across the conference! On this topic, please check out my colleagues’s great paper, repository, and website where they present ProteinShake, an easy-to-use protein structure datasets regardless of the modality of a protein you want to build a model on: be it point clouds, voxels, or graphs. v1 is coming out soonTM, and I’m excited to contribute some of my research to it as well.

Now, here are some papers that caught my eye.

Feature Likelihood Score

Jiralerspong et al. ArXiV.

The authors tackle the task of crafting metrics that capture the quality of generated samples following the trichotomy of novelty, fidelity and diversity in the image domain. They introduce the feature likelihood score (FLS), a parametric score that combines kernel density estimation and image features obtained from a vision model to assess the quality of generated samples. The authors sell this metric as satisfying the evaluation “trichotomy”, i.e. detecting fidelity, diversity and novelty. See below:

fig1_2

Computing the FLS involves the following steps illustrated below:

fig3

  1. Map the image to a feature space. Using Inception v3, CLIP or DINO v2 for instance, this reduces dimensionality and making each dimension more meaningful. This procedure has been used before in e.g. FID.

  2. Akin to kernel density estimation, we can proceed to model the density of a distribution using a mixture of isotropic gaussians. The likelihood of a new point from the generated distribution is computed from the mean likelihood of all of the Gaussians in the mixture.

  3. Use the training set to estimate the bandwidths. Instead of deriving the bandwidth statistically or minimizing a loss through cross-validation, a subset of the training set is sampled to estimate the bandwidths by minimizing the negative log likelihood (NLL) of this subset.

  4. Evaluate the mixture of gaussian density on the test set. This is used to quantitatively evaluate the density obtained in Step 3 to get the negative log-likelihood of $ \mathcal{D}_{\text{test}} $ under $ p_{\hat{\sigma}}(\mathbf{x}) $. This is important, because in the case that the generated samples are very similar to the training set, as is the case in the simulated k approximate copies in the figure below (taken directly from Jiralerspong et al), the negative log likelihood would have been abnormally low compared to samples from the test set who would have a higher NLL. Thus, evaluating on the test is a succinct way of checking the generalization performance of a generative model, which is usually overlooked in FID and IS.

fig4

The general formulation of the FLS score is given by: $$ \text{FLS}(\mathcal{D}_{\text{test}}, \mathcal{D}_{\text{gen}}) := -\frac{100}{d}\log p_{\hat{\sigma}} (\mathcal{D}_{\text{test}}|\mathcal{D}_{\text{gen}}) - C$$ $ C $ can be set to zero with an ideal generator (i.e. a generator that could sample images from the true data distribution). In practice you can estimate $C$ by splitting the training set in 2, one part to represent the perfect $\mathcal{D}_{\text{gen}}$ and the other to compute $\hat{\sigma}$. Note that $C$ does not change the relative score between models.

Higher FLS values can be due to:

  1. Poor sample fidelity (Gaussian centers are far from the test set and thus a higher NLL).
  2. A failure to sufficiently cover the data manifold will lead to some test samples yielding high NLL (mode dropping).
  3. Overfitting to the training set will yield a MoG density overfitting and yielding a low NLL on the test set (mode collapse).

Specific failure modes of a generative models can be captured with FLS. For instance, we can define $\delta$-memorization of a given sample $x_j^{\text{gen}}$ (i.e. identify if a given sample is an approximate copy of a training sample) using: $$ \mathcal{O}_j:=\max_{i}\mathcal{N}(\varphi(x_i^{\text{train}})|\varphi(x_j^{\text{gen}});\hat{\sigma}_j^2I)>\delta $$

and overfitting using the generalization gap: $$ \text{FLS}(\mathcal{D}_{\text{train}}, \mathcal{D}_{\text{gen}}) - \text{FLS}(\mathcal{D}_{\text{test}}, \mathcal{D}_{\text{gen}}) < 0$$

Individual sample fidelity can also be measured by centering the MoG on the test set and fitting the bandwidths on the training set, and then compute the likelihood of the generated samples as a measure of sample quality, i.e. by computing:

$$ \mathcal{Q_j} = \mathcal{N}(\varphi(\mathbf{x}_j^{\text{gen}})| \varphi(\mathcal{D_{\text{test}}}); \Sigma) $$

In their experiments, the authors seek to correlate FLS with sample fidelity, diversity and novelty.

They first set of experiments assess sample fidelity by applying a set of small transformations to generated samples that lead to dramatic changes in FID but not in FLS, but FLS does change substantially if a large transformation is applied to the generated images.

fig5

To assess sample diversity, two cases are covered. First an increasing amount of classes are included in the generated samples, and then additional examples of the same classes are added to assess the ability of FLS to asssess the coverage of the data manifold. We expect FLS to decrease and increase, respectively, which is what we also observe.

fig6

To assess the ability of FLS to assess sample novelty, the generalization gap defined above is computed on progressively perturbed sets of generated images. This is where FLS shines, with consistently decreasing values as perturbations are applied, and consistently starting and and staying below the overfit line, as would be desired.

fig7

Interestingly, the authors also investigated the behaviour of FLS as it could be used in practice, i.e. to detect multiple failure modes simultaneously, to ensure that interaction effects wouldn’t result in a reasonable range of FLS values. I don’t think I can state it more clearly than the authors: “[f]or [assessing] fidelity, [they] use an increasing blur intensity. For diversity, [they] take a subset of the generated samples and replace the rest of the sample with copies of this subset. For novelty, [they] replace generated samples with copies of the training set.” The result is below, and show little interactions between the perturbations, save maybe a bit at the extreme values of diversity and fidelity and likewise for extreme versions of diversity and novelty.

fig8

Great paper overall, the only caveat is that while specifications of $\varphi$ are well-established in computer vision, such featurizers still need to be standardized in order fields. The appendix investigates alternative featurizers, and such studies will need to take center stage for applications in other domains. Still, the framework as presented is very strong and promises to alleviate a substantial bottleneck encountered in generative modeling today.

Efficient Learning of Linear Graph Neural Networks via Node Subsampling

Shin et al. OpenReview.

This paper considers the graph-weighted linear regression problem, which can be formulated as follows:

$$ \min_{w}\frac{1}{2n} \left\Vert \mathbf{y} -AX \mathbf{w} \right\Vert^2_2 $$.

where $A\in\mathbb{R}^{n\times n}$ is the adjacency matrix and $X\in \mathbb{R}^{n\times d}$ the node feature matrix. They claim to make the computation of $AX$ dramatically more efficient by using a node sampling strategy based on leverage scores. With this strategy, they claim to only need to observe $O(nd\varepsilon^{-1}\log n)$ entries of $A$ in time $O(nd^2\varepsilon^{-2}\log n)$ with the guarantee that the learned weights deviate by at most $\varepsilon$ under the $ \ell_2 $ from the model learned using the entire adjacency matrix $A$.

fig1_ih

More concretely, the speedups are achieved using the following steps:

  1. Estimation of leverage scores (in a uniform or data-dependent manner)

    The leverage score of the $i$th row of $X$ is given by $ \ell_i(X) :=\mathbf{x}_i^T(X^TX)^{\dagger}\mathbf{x}_i $ where $(\cdot)^{\dagger}$ is the Moore-Penrose pseudoinverse. The paper gives an intuition for leverage scores by stating that it measures how important it is in composing the row space of $X$. If a row has a component orthogonal to all the other rows, its leverage score is 1. Removing it would decrease the rank of $X$, completely changing its row space. The leverage scores are the diagonal entries of the projection matrix $X(X^TX)^{-1}X^T$, which can be used to show that $\sum_{i=1}^{n}\ell_i(X)=\text{rank}(X)$. The authors go on to describe ways to estimate of $AX$ referred to as $\widehat{AX}$ based on the outcome of a Bernoulli trial where each row $j$ is sampled with probability $ p_j $. In the data-dependent case, $ p_j = \min \left[ B \cdot \frac{\lVert A_{:j}\rVert \lVert X_{:j} \rVert}{\sum_j\lVert A_{:j} \rVert\lVert X_{:j} \rVert} , 1 \right] $. Intuitively, this probability is based on node’s importance in terms of their contribution to the product of their adjacency matrix column and the data matrix. Interestingly, the authors provide that this estimate $\widehat{AX}$ has an approximation guarantee of $AX$. As a result, $\widehat{AX}$ can be used to produce provably good overestimates (i.e. upper bounds) of the leverage scores of the augmented matrix $[AX|-\mathbf{y}]$.

  2. Perform leverage score sampling to obtain $\bar{A}X$

    The overestimates obtained in the previous step can be used to get a matrix $\tilde{A}$ and scaled labels $ \tilde{\mathbf{y}} $ consisting of a reduced number of rows of $A$ and $\mathbf{y}$ respectively.

  3. Solve the regression score on sampled $\bar{A}X$

    The regression problem can be solved in this step using $ \tilde{A}X $, which the authors show is a spectral approximation for $AX$. The whole algorithm therefore does not require computing the full $AX$ and the end-to-end algorithm only has an $O(n\log n)$ dependence on $n$.

The speedups and relative performance with given budgets $B$ can be seen below relative to using the full matrix:

fig1_ih

fig1_ih

The authors also discuss applications of this procedure to multi-layer and non-linear GCNs.

Overall an interesting technical contribution, I am curious if leverage scores can be adopted in other MPNNs though.

ImageNet-Hard

Taesiri et al. ArXiV.

The premise the authors start off with is that image classification models are inherently information discarding machines. They hypothesize that zooming into the image and framing it properly, yields to the correct classification of 98.91% of images by AlexNet. An overview of the approach is shown below

fig1_ih

Then, they discovered a bias in ObjectNet and ImageNet-A, where there is a strong centrality bias towards the center of the image, i.e. zooming on the center of the image yields the best classification results.

fig3_ih

When they force models to zoom on particular patches of the image, they found they could beat MEMO, the current SOTA test time image augmentation method.

Most interestingly, however, is that they constructed ImageNet-Hard, collection of images which could not be correctly classified by CLIP-ViT-L/14 on none of the 334 crops of images from IN-V2 [56], IN-ReaL, IN-A, IN-R, IN-S, and ON. After manual curation of the labels, a total of 10,980 images are left. A version of it upsampled to 4k using GigaGAN is also synthesized. Some samples of the resulting dataset can be seen below.

fig5_ih

It’s an interesting case study in data curation and analysis as it pertains to ML. Similar studies should also be done in other domains to identify difficult-to-classify samples. In tandem with confidence-producing models, this could allow practitioners to evaluate a model on difficult datasets.