A year after ChatGPT, OpenAI suprised the world again, with its video generation model, Sora. Honestly, I was shocked after skimming the technical report, even though everyone is familiar with ChatGPT today. I strongly recommend everyone to read the report before all those news and twitters.
Although model and implementation details are not included in this report, I follow the references to learn the main idea behind Sora. In this post, I’d like to share my understanding.
Architecture overview
Sora is a diffusion transformer. The remarkable scaling properties of transformer makes it successful in ChatGPT. Sora is also a transformer-based model. The main idea of Sora is to embed the video data into the tokens, with temporal and spatial information encoded.
Since I have brief understanding of ChatGPT, I pay more attention on the differences: (1)what is diffusion model? (2)how to embed a video into the tokens?
Diffusion model
[TBD]
- Background theory “Deep unsupervised learning using nonequilibrium thermodynamics.” (Sohl-Dickstein, Jascha, et al.)
- Diffusion model “Denoising Diffusion Probabilistic Models” (Ho, Jonathan, Ajay Jain, and Pieter Abbeel.)
- Diffusion model with transformer “Scalable diffusion models with transformers.” (Peebles, William, and Saining Xie.)
Embedding the video data
[TBD]
- Transformers for image “An image is worth 16x16 words: Transformers for image recognition at scale.” (Dosovitskiy, Alexey, et al.)
- Transformers for video “Vivit: A video vision transformer.” (Arnab, Anurag, et al.)
- ”Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.” (Dehghani, Mostafa, et al.)