From ChatGPT to Sora

A year after ChatGPT, OpenAI suprised the world again, with its video generation model, Sora. Honestly, I was shocked after skimming the technical report, even though everyone is familiar with ChatGPT today. I strongly recommend everyone to read the report before all those news and twitters.

Although model and implementation details are not included in this report, I follow the references to learn the main idea behind Sora. In this post, I’d like to share my understanding.

Architecture overview

Sora is a diffusion transformer. The remarkable scaling properties of transformer makes it successful in ChatGPT. Sora is also a transformer-based model. The main idea of Sora is to embed the video data into the tokens, with temporal and spatial information encoded.

Since I have brief understanding of ChatGPT, I pay more attention on the differences: (1)what is diffusion model? (2)how to embed a video into the tokens?

Diffusion model


Embedding the video data