Sora: A Fascinating Text-to-Video Model With Limitations

OpenAI, an artificial intelligence (AI) firm, has introduced its first-ever text-to-video model called Sora. This new generative AI model is aimed at transforming simple text prompts into detailed videos. It can also continue existing videos and generate scenes based on still images. OpenAI claims that Sora is capable of producing movie-like scenes in resolutions up to 1080p, complete with multiple characters, specific types of motion, and accurate details of the subject and background.

Sora operates using a diffusion model, similar to OpenAI’s image-based predecessor, Dall-E 3. This model generates output by taking a video or image that initially appears like “static noise” and progressively transforms it by gradually “removing the noise” in multiple steps. OpenAI states that Sora is built on past research from models like ChatGPT and Dall-E 3, which helps it represent user inputs more faithfully. Sora still has some weaknesses, particularly in accurately simulating the physics of complex scenes. For example, it may fail to show a bite mark on a cookie after a person takes a bite.

OpenAI has initially made the model available to cybersecurity researchers and select designers, visual artists, and filmmakers for evaluating potential risks. The company aims to gather feedback on improving the model. This cautious approach is a response to the ethical and legal concerns highlighted in a Stanford University report in December 2023. The report revealed that AI-powered image-generation tools were being trained on illegal child abuse material, raising serious concerns about the use of text-to-image or video models.

Videos showcasing Sora’s capabilities have been widely circulated on social media platform X. The platform is currently buzzing with over 173,000 posts about Sora. OpenAI CEO Sam Altman has even offered to fulfill custom video-generation requests from users on X. Altman shared seven Sora-generated videos, ranging from a duck riding a dragon to golden retrievers recording a podcast on a mountain top. The impressive demonstrations have left many users speechless.

AI commentator Mckay Wrigley and others have applauded Sora’s video generation capabilities. Nvidia senior researcher Jim Fan explicitly stated that Sora shouldn’t be perceived as just another creative tool, like Dall-E 3. According to Fan, Sora is more than that; it is a “data-driven physics engine.” This means that the AI model doesn’t merely generate abstract videos but also creates the physics of objects in the scene itself in a deterministic manner.

Pieter Kellerman

Pieter Kellerman

One thought on “Sora: A Fascinating Text-to-Video Model With Limitations

  1. Sora’s weaknesses are understandable, considering the complexity of simulating physics in AI-generated videos. Progress takes time!

Leave a Reply