Paper--Predicting What You Already Know Helps, Provable Self-Supervised Learning

2021-01-11

Why?

Pretext tasks
- Reconstruct images from corrupted versions or just part it: including denoising auto-encoders, image inpainting, and split-brain autoencoder
- Using visual common sense, including predicting rotation angle, relative patch position, recovering color channels, solving jigsaw puzzle games, and discriminating images created from distortion.
- Contrastive learning: learn representations that bring similar data points closer while pushing randomly selected points further away or maximize a contrastive-based mutual information lower bound between different views
- Create auxiliary tasks: The natural ordering or topology of data is also exploited in video-based, graph-based or map-based self-supervised learning. For instance, the pretext task is to determine the correct temporal order for video frames.

Theory for self-supervised learning: contrastive learning

Contrastive learning may not work when conditional independence holds only with additional latent variables

Theory	Limitations
Shows shows guarantees for contrastive learning representations on linear classification tasks using a class conditional independence assumption	Not handle approximate conditional independence
Contrastive learning representations can linearly recover any continuous functions of the underlying topic posterior under a topic modeling assumption for text	The assumption of independent sampling of words that they exploit is strong and not generalizable to other domains like images
Studies contrastive learning on the hypersphere through intuitive properties like alignment and uniformity of representations	No connection made to downstream tasks
A mutual information maximization view of contrastive learning	Some issues point by paper [45]
Explain negative sampling based methods use the theory of noise contrastive estimation	guarantees are only asymptotic and not for downstream tasks.
Conditional independence assumptions and redundancy assumptions on multiple views are used to analyze co-training	not for downstream task

Observations
- Forming the pretext tasks:
  - Colorization: can be interpreted as \(p(X_1,X_2|Y)=p(X_1|Y)\times p(X_2|Y)\), aka \(X_1,X_2\) are independently conditioned on \(Y\)
  - Inpainting: \(p(X_1,X_2|Y,Z)=p(X_1|Y,Z)\times p(X_2|Y,Z)\),aka the inpainted \(X_2\) is conditionally independent of \(X_2\) (the remainder) given \(Y,Z\).
- The only way to solve the pretext task is to first implicitly predict \(Y\) and then predict \(X_2\) from \(Y\)
Limitations:
- The underlying principles of self-supervised learning are still mysterious since it is a-priori unclear why predicting what we already know should help.

What conceptual connection between pretext and downstream tasks ensures good representations?

What is a good way to quantify this?

Symbol	Meaning
\(\mathbb{E}^L[Y\|X]\)	the best linear predictor of \(Y\) given \(X\)
\(\Sigma_{XY\|Z}\)	partial covariance matrix between \(X\) and \(Y\) given \(Z\)
\(X_1,X_2\)	the input variable and the target random variable for the pretext tasks
\(Y\)	label for the downstream task
\(P_{X_1X_2Y}\)	the joint distribution over \(\mathcal{X}_1 \times \mathcal{X}_2 \times \mathcal{Y}\)

Under approximate condition independence (CI) (quantified by the norm of a certain partial covariance matrix), show similar sample complexity improvements.
Testify pretext task helps when CI is approximately satisfied in text domain.
Demonstrate on a real-world image dataset that a pretext task-based linear model outperforms or is comparable to many baselines.

It will be estimated by:

approximation error:, where \(f^*=\mathbb{E}[Y|X_1]\) is the optimal predictor for the task
estimation error: , it's the difference between Predicting \(Y\) directly by \(X_1\) and Predicting by the representations from pretext task

This paper posits a mechanism based on conditional independence to formalize how solving certain pretext tasks can learn representations that provably decreases the sample complexity of downstream supervised tasks
Quantify how approximate independence between the components of the pretext task (conditional on the label and latent variables) allows us to learn representations that can solve the downstream task with drastically reduced sample complexity by just training a linear layer on top of the learned representation.