Everyone’s favorite text-to-image generator Dall-E has a new competitor from Meta: A video-to-text generator called Make-A-Video. The tool generates short, soundless video snippets based on the same type of text prompts you feed to Dall-E.
But Dall-E is child’s play compared to Make-A-Video, at least according to Mark Zuckerberg. The Meta CEO noted in a Facebook post, “It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.” Make-A-Video doesn’t have that problem because it “understand[s] motion in the physical world and apply it to traditional text-to-image generation.”
Another Make-A-Video feature is the ability to add motion to static images. Make-A-Video’s transformation of a static image of a woman doing a yoga pose, for example, has her leaning deeper into her stretch as a light flare shimmers on the lens. Other examples of the tool are available on its website, which notes that you can also show Make-A-Video an existing video and be presented with several new interpretations.
We’ll take all these examples with a grain of salt, since Make-A-Video isn’t yet available to the public, but it is a wild new potential development for artificial intelligence.
Meta has published a paper about the tool which you can read at this link. It details how it was trained, along with the technical limitations of the tool, which include its inability to generate clips longer than five seconds and deliver resolutions higher than 768 by 768 pixels at 16 frames per second. The Verge notes that the only text-to-video model available to the public, called CogVideo, is burdened by the same limitations.