Meta AI revealed last week a new tech that can generate a video from a text prompt: makeavideo.studio/
How does it work?
This 🧵 to explain it!
But first, let's think about the challenge.
There are billions of images with alt-text on the web.
This helps creates a huge dataset, and it's the key to training networks that understand the relationship between image and text.
But there is no equivalent for videos.
The trick is that Make-A-Video learns the relationships between image & text, and movement from videos without the text.
This way, Make-A-Video generates videos from text without the need for paired text-video data.
Here is the architecture.
We can see it takes the text as an input, and generate the video as an output.
Let's have a closer look
The first part "Prior P" is the part doing the Text to Image.
They train it on paired text-image data and do not fine-tune it on videos.
It doesn't output an image directly but a latent representation of it. (it would needs a decoder to make an image out of it).
But instead of decoding one image, the spatiotemporal decoder output a sequence of images: 16 RGB frames, each of size 64 × 64.
That's the real innovation of this paper IMO.
After training on images, they add and initialize the temporal layers and fine-tune them over unlabeled video data.
That's the key to this paper: learning text to image and movement separately.
It is also its main limitation, but I will came on that later.
What is coming next is more classical, and is about transforming the 64x64 16 keyframes into a video
First step is frame interpolation from 16 frame to 76 frames
Then 2 networks increase the images to 256 × 256 and 768 × 768 respectively.
The first is spatiotemporal super resolution, to keep the consistency between frames over time.
The second one is doing super resolutions frame per frame and output the final video.
While this method is amazing, there are a couple of limitations:
The number of output images are fixed by the network, so it means all the generated video have the same duration.
But main bottle neck is that text is used only to generate images during the trainning.
So I don't think this tech can be used to describe a movement.
For instance I bet it can't generate things like "a ball moving left to right".
That's it for today, but there is another paper that seems to overcome this limitation.
Stay tuned :-)