A Simple Framework for Contrastive Learning of Visual Representations
1.framework
T:family of augmentations. sequentially apply three simple augmentation:random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur
combination of random crop and color distortion is crucial to achieve a good performance.
f(.):encoder. without any constraint ---resnet
g(.):projection head,learnable nonlinear transformation---MLP with one hidden layer
loss function :(normalized temperature-scaled cross entropy loss) τ denotes a temperature parameter. loss decrease with respect to the similarity of positive pair. loss increase with respect to the similarity of negative pair.
evaluation Protocol:linear evaluation protocol---where a linear classifier is trained on top of the frozen base network, and test accuracy is used as a proxy for representation quality
2.experiment
data augmentation
conclusion:
1.no single transformation suffices to learn good representations
2.color histograms alone suffice to distinguish images. Neural nets may exploit this shortcut to solve the predictive task
3.it is critical to compose cropping with color distortion in order to learn generalizable features.
4.unsupervised contrastive learning benefits from stronger (color) data augmentation than supervised learning.
architecture for encoder and head
conclusion:
the gap between supervised models and linear classifiers trained on unsupervised models shrinks as the model size increases, suggesting that unsupervised learning benefits more from bigger models than its supervised counterpart.
projection head
the hidden layer before the projection head is a better representation than the layer after.
We conjecture that the importance of using the representation before the nonlinear projection is due to loss of information induced by the contrastive loss. In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects.
Loss Functions
conclusion:
1) L2 normalization (i.e. cosine similarity) along with temperature effectively weights different examples, and an appropriate temperature can help the model learn from hard negatives
2) unlike cross-entropy, other objective functions do not weigh the negatives by their relative hardness.( As a result, one must apply semi-hard negative mining (Schroff et al., 2015) for these loss functions: instead of computing the gradient over all loss terms, one can compute the gradient using semi-hard negative terms (i.e., those that are within the loss margin and closest in distance, but farther than positive examples). )
Batch Size
1)We find that, when the number of training epochs is small (e.g. 100 epochs), larger batch sizes have a significant advantage over the smaller ones.
2)With more training steps/epochs, the gaps between different batch sizes decrease or disappear, provided the batches are randomly resampled.
explanation:
1)larger batch sizes provide more negative examples, facilitating convergence (i.e. taking fewer epochs and steps for a given accuracy)
2)Training longer also provides more negative examples, improving the results.
Comparison with State-of-the-art
Linear evaluation.
semi-supervise learning
transfer learning