Autoregressive Diffusion-Based Story Generation with Visual Memory
Make-A-Story: Visual Memory Conditioned Consistent Story Generation
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames.sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed.
Our experiments for story generation on the mugen and the flintstonessv dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.
Multimodal deep learning approaches have pushed the quality and the breadth of conditional generation tasks such as image captioning and text-to-image synthesis.
One such task is that of or.
The goal of which is to generate a sequence of illustrative image frames with coherent semantics given a sequence of sentences.
Recent advances on the task of story generation have made significant advances along these lines, showing high visual fidelity and character consistency for story sentences that are self-contained and unambiguous (explicitly mentioning characters and the setting each time).
While impressive, this setup is fundamentally unrealistic.
While it maybe possible to apply such methods to story text to first resolve ambiguous references and then generate corresponding images using existing story generation approaches, this is sub-optimal.
We address this by proposing a new autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames.
Result
In this paper, we formulate consistent story generation in a more realistic way by co-referencing actors/backgrounds in the story descriptions.
We develop an autoregressive story-ldm approach with memory attention capable of maintaining consistency across the frames based on the previously generated frames and their corresponding descriptions.
We introduce modified datasets to evaluate the performance for reference resolution.