CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers - 42Papers