Document data have a 2-dimensional spatial layout : text content is structurally spread around in different locations based on diverse document types and formats (e.g., invoices vs. tax forms) ; formatted data such as figures, tables and plots are laid out across the document.Effectively and efficiently modeling and understanding the layout is vital for document information extraction and content understanding, for example, title/signature extraction, fraudulent check detection, table processing, document classification, and automatic data entry from documents.Document artificial intelligence studies information extraction, understanding, and analysis of digital documents, e.g., business invoices, tax forms, academic papers, etc.It is a multimodal task where text is structurally embedded in documents, together with other vision information like symbols, figures, and style.Document data have a strong cross-modal interactions between text and visual modalities, because the text modality is visually-situated in an image.To address these challenges, we propose universal document processing (udop), a foundation document ai model that unifies vision, text, and layout and different document tasks.We propose to model them with the uniform layout-induced representation (: in the input stage, we add embeddings of text tokens with the features of the image patch where the tokens are located).To form a uniform paradigm for different vision, text, layout tasks, udop first builds a homogeneous vocabulary for texts and document layout that converts layout, i.e.Bounding boxes, to discretized tokens.Second, we propose vision-text-layout (vtl) transformer, consisting of a modality-agnostic encoder, text-layout decoder and vision decoder.