Paper--Colorful Image Colorization

2021-01-10

Why?

Previous work

Predicting colors in free way: taking the image’s \(L\) channel as input and its \(ab\) channels as the supervisory signal--> but tend to look desaturated, one explanation is using loss functions that encourage conservative predictions

Non-parametric methods: given an input grayscale image, first define one or more color reference images. Then, transfer colors onto the input image from analogous regions of the reference image(s).
Parametric methods: learn prediction functions from large datasets of color images at training time, posing the problem as either regression onto continuous color space or classification of quantized color values. --> Work in this paper is also classification task.

Concurrent work on colorization

Paper	loss	CNNs	Dataset
Larsson et al.	un-rebalanced classification loss	hypercolumns on a VGG	ImageNet
Iizuka et al.	regression loss	two-stream architecture in which fuse global and local features	Places
This paper	classification loss, with rebalanced rare classes,	a single-stream, VGG-styled network with added depth and dilated convolutions	ImageNet

Summary

Observations
- Even in gray images, the semantics of the scene and its surface texture provide ample cues for many regions in each image
- Color prediction is inherently multimodal --> sparks for a loss tailored to their work
Limitations:
- Loss only cares Euclidean distance: If an object can take on a set of distinct ab values, the optimal solution to the Euclidean loss will be the mean of the set. In color prediction, this averaging effect favors grayish, desaturated results. Additionally, if the set of plausible colorizations is non-convex, the solution will in fact be out of the set, giving implausible results.

Goals

Design colorization based pretext task to get a good image semantic representations: produce a plausible colorization that could potentially fool a human observer

How?

Idea

Produce vibrant colorization: Predict a distribution of possible colors for each pixel. Then, re-weight the loss at training time to emphasize rare colors. This encourages the model to exploit the full diversity of the large-scale data on which it is trained. Lastly, produce a final colorization by taking the annealed mean of the distribution.
Evaluate synthesized images: set up a “colorization Turing test”.

Data Preparation

Quantize the \(ab\) output space into bins with grid size \(10\) and keep the \(Q = 313\) values which are in-gamut. Then this is the label \(Z\) of each pixel. Formally, denote the raw label as \(Y\), then \(Z = H^{−1}_{gt} (Y)\), which converts ground truth color \(Y\) to vector \(Z\).

Implementation

Model
Loss function

where \(v(·)\) is a weighting term that can be used to re-balance the loss based on color-class rarity.
- Re-balancing
  
  The distribution of \(ab\) values in natural images is strongly biased towards values with low \(ab\) values.
  
  , where \(\tilde{p}\) the empirical probability of colors in the quantized ab space \(p\in \Delta Q\) from the full ImageNet training set and smooth the distribution with a Gaussian kernel \(G_{\sigma}\). \(\lambda=\frac{1}{2}, \sigma=5\) worked well.
Inferring point estimates

map probability distribution \(\hat{Z}\) to color values \(\hat{Y}\) with function \(\hat{Y} = H(\hat{Z})\)

They interpolate by re-adjusting the temperature \(T\) of the softmax distribution, and taking the mean of the result. Lowering the temperature \(T\) produces a more strongly peaked distribution, and setting \(T\rightarrow 0\) results in a 1-hot encoding at the distribution mode. They find that \(T=0.38\) captures the vibrancy of the mode while maintaining the spatial coherence of the mean.

Experiments

Dataset: ImageNet
Base models

Model Name	Loss	Train
Ours(full)	classification loss	from scratch with kmeans initialization, ADAM solver for about 450K iterations. \(\beta_1 = .9, \beta_2 = .99\), and weight decay = \(10^{−3}\) . Initial learning rate was \(3 × 10^{−5}\) and dropped to \(10^{−5}\) and \(3 × 10^{−6}\) when loss plateaued, at 200k and 375k iterations, respectively.
Ours(class)	classification loss withou rebalancing (\(\lambda=1\))	similar training protocol as Ours(full)
Ours(L2)	L2 regression loss	same training protocol
Ours(L2,ft)	L2 regression loss	fine tuned from our full classification with rebalancing network
Larsson et al.		CNN method
Dahl	L2 regression loss	a Laplacian pyramid on VGG features
Gray	--	every pixel is gray, with \((a, b) = 0\)
Random	--	Copies the colors from a random image from the training set

Colorization quality

AMT: participants confirm their results. They argue that their work produce a more prototypical appearance for those are poorly white balanced
Semantic interpretability (VGG classification): Are the results realistic enough colorizations to be interpretable to an off-the-shelf object classifier? They check it by by feeding their fake colorized images to a VGG network that was trained to predict ImageNet classes from real color photos.
- The result is \(3.4\%\) lower than Larsson's.
- Without any additional training or fine-tuning, one can improve performance on grayscale image classification, simply by colorizing images with our algorithm and passing them to an off-the-shelf classifier.
Raw accuracy (AuC):
- L2 metric can achieve accurate colorizations, but has difficulty in optimization from scratch
- class-rebalancing in the training objective achieved its desired effect
Compared with others
- LEARCH: On SUN dataset, authors have \(17.2\%\) on AMT task while LEARCH has \(9.8\%\)

Cross-channel encoding as SSL Feature learning

Datasets: ImageNet, PASCAL (fine tuned after training on ImageNet)
Backbone: AlexNet
Settings
- ImageNet: fixing the extractor and retrain the classifier (softmax layer) by labels
- PASCAL: : (1) keeping the input grayscale by disregarding color information (Ours (gray)) and (2) modifying conv1 to receive a full 3-channel \(Lab\) input, initializing the weights on the \(ab\) channels to be zero (Ours (color)).
Summary
- For ImageNet, there is a \(6\%\) performance gap between color and grayscale inputs. Except for the 1st layer, representations from other deeper layers catch and outperform most methods, indicating that solving the colorization task encourages representations that linearly separate semantic classes in the trained data distribution
- On PASCAL, when conv1 is frozen, the network is effectively only able to interpret grayscale images.

The properties of network

Is it exploiting low-level cues?

Given a grayscale Macbeth color chart as input, it was unable to recover its colors. On the other hand, given two recognizable vegetables that are roughly isoluminant, the system is able to recover their color.
Does it learn multimodal color distributions ?

Take effective dilation ( the spacing at which consecutive elements of the convolutional kernel are evaluated, relative to the input pixels, and is computed by the product of the accumulated stride and the layer dilation) as the measurement. Through each convolutional block from conv1 to conv5, the effective dilation of the convolutional kernel is increased. From conv6 to conv8, the effective dilation is decreased

Conclusion

Designing an appropriate objective function that handles the multimodal uncertainty of the colorization problem and captures a wide diversity of colors
Introducing a novel framework for testing colorization algorithms, potentially applicable to other image synthesis tasks
Setting a new high-water mark on the task by training on a million color photos.
Introduce the colorization task as a competitive and straightforward method for self-supervised representation learning, achieving state-of-the-art results on several benchmarks.