From ChatGPT to Sora

A year after ChatGPT, OpenAI suprised the world again, with its video generation model, Sora. Honestly, I was shocked after skimming the technical report, even though everyone is familiar with ChatGPT today. I strongly recommend everyone to read the report before all those news and twitters.

Although model and implementation details are not included in this report, I follow the references to learn the main idea behind Sora. In this post, I’d like to share my understanding.

Architecture overview

Sora is a diffusion transformer. The remarkable scaling properties of transformer makes it successful in ChatGPT. Sora is also a transformer-based model. The main idea of Sora is to embed the video data into the tokens, with temporal and spatial information encoded.

Since I have brief understanding of ChatGPT, I pay more attention on the differences: (1)what is diffusion model? (2)how to embed a video into the tokens?

Diffusion model

[TBD]

Background theory “Deep unsupervised learning using nonequilibrium thermodynamics.” (Sohl-Dickstein, Jascha, et al.)
Diffusion model “Denoising Diffusion Probabilistic Models” (Ho, Jonathan, Ajay Jain, and Pieter Abbeel.)
Diffusion model with transformer “Scalable diffusion models with transformers.” (Peebles, William, and Saining Xie.)

Embedding the video data

[TBD]

Transformers for image “An image is worth 16x16 words: Transformers for image recognition at scale.” (Dosovitskiy, Alexey, et al.)
Transformers for video “Vivit: A video vision transformer.” (Arnab, Anurag, et al.)
”Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.” (Dehghani, Mostafa, et al.)