A Simple Framework for Contrastive Learning of Visual Representations

1.framework

T：family of augmentations. sequentially apply three simple augmentation:random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur

combination of random crop and color distortion is crucial to achieve a good performance.

f(.)：encoder. without any constraint ---resnet

g(.)：projection head,learnable nonlinear transformation---MLP with one hidden layer

loss function ：(normalized temperature-scaled cross entropy loss) A Simple Framework for Contrastive Learning of Visual Representations τ denotes a temperature parameter. loss decrease with respect to the similarity of positive pair. loss increase with respect to the similarity of negative pair.

A Simple Framework for Contrastive Learning of Visual Representations

evaluation Protocol:linear evaluation protocol---where a linear classifier is trained on top of the frozen base network, and test accuracy is used as a proxy for representation quality

2.experiment

data augmentation

A Simple Framework for Contrastive Learning of Visual Representations

conclusion:

1.no single transformation suffices to learn good representations

2.color histograms alone suffice to distinguish images. Neural nets may exploit this shortcut to solve the predictive task

3.it is critical to compose cropping with color distortion in order to learn generalizable features.

4.unsupervised contrastive learning benefits from stronger (color) data augmentation than supervised learning.

architecture for encoder and head

A Simple Framework for Contrastive Learning of Visual Representations

conclusion:

the gap between supervised models and linear classifiers trained on unsupervised models shrinks as the model size increases, suggesting that unsupervised learning benefits more from bigger models than its supervised counterpart.

projection head

A Simple Framework for Contrastive Learning of Visual Representations

the hidden layer before the projection head is a better representation than the layer after.

We conjecture that the importance of using the representation before the nonlinear projection is due to loss of information induced by the contrastive loss. In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects.

A Simple Framework for Contrastive Learning of Visual Representations

Loss Functions

A Simple Framework for Contrastive Learning of Visual Representations

conclusion:

1) L2 normalization (i.e. cosine similarity) along with temperature effectively weights different examples, and an appropriate temperature can help the model learn from hard negatives

2) unlike cross-entropy, other objective functions do not weigh the negatives by their relative hardness.( As a result, one must apply semi-hard negative mining (Schroff et al., 2015) for these loss functions: instead of computing the gradient over all loss terms, one can compute the gradient using semi-hard negative terms (i.e., those that are within the loss margin and closest in distance, but farther than positive examples). )

Batch Size

A Simple Framework for Contrastive Learning of Visual Representations