Temporal Modeling via Video Transformer and Masked Visual-token Modeling - 42Papers