Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision - 42Papers