ByteDance, the parent company of TikTok, has introduced a new artificial intelligence (AI) model for video generation called OmniHuman-1, capable of taking either a single image or a textual prompt to develop hyperrealistic videos of people talking, dancing, singing, or even playing instruments.
The model is trained to replicate human speech, movement, and gesturing accurately. The company’s website indicates that OmniHuman can generate very realistic motions, natural gestures, and exceptional detail, whether presented in portrait, half-body, or full-body shots. At its core, OmniHuman is a multimodality-conditioned human video-generation model. This means it takes several types of inputs, e.g., images or audio clips, and generates videos that have a close resemblance to reality.
Currently, OmniHuman is under research and is not available to the public. The developers have shared some demo videos and hinted at the possibility of some code release in the future.
DeepSeek’s massive language model is one more revolution to hit the Chinese AI market after DeepSeek-v3. The intention is that, off, its powerful rival now puts it against modern contemporaries Runway’s Gen-3 Alpha, Luma AI’s Dream Machine, and Sora from OpenAI, which is supposed to come in January 2024, as now embedded well into the mainstreaming sub-sector of AI.
How OmniHuman works:
OmniHuman is a state-of-the-art framework for video generation that is conditioned on multi-modality, from motion signals to the synthesis of a human on a single image to audio-only, video-only, or a combination of both.
Incorporating a modality motion conditioned mixed training scheme, which allows the scaling up of data from mixed conditioning, leads to allowing benefits in the model training. This allows OmniHuman to address this reality effectively, thus circumventing the challenges that the past end-to-end methods faced in their challenges to the availability of high-quality data.
Ranchers used the principles of a “multiconditioned” method to facilitate an AI mechanism mediated by OmniHuman, trained on millions of hours of video data, more than 180,000 hours, to train for lip coordination with audio in sync. Thus allowing the synthesis of natural animations.
Characteristics: One key feature includes multi-modality motion conditioning, alongside transitions like lip sync and gestures, the various inputs to accommodate many animated formats with top-notch animation quality, including other domains as well.
Top Competitors of the OmniHuman
Sora is the text-to-video model from OpenAI capable of generating high-quality videos up to one minute long from textual prompts. It stands out for maintaining remarkable spatial and temporal consistency that allows it to have a superior grasp on 3D environments, physics, and realistic motion.
Sora can extend existing videos and fill in missing video information and assists in storytelling and creative video production with dynamic camera movements. While the exact training details are still not disclosed, OpenAI has stated that the model was trained on a composite of publicly available and licensed datasets, thereby ensuring broad diversity and thus factual accuracy in its outputs.
Runway’s Gen-3 Alpha is an improved version of the last generation video model for fast and high-quality video generation. It greatly enhances the capabilities of its predecessors in structuring the video in form, shape, and motion, with the capability to generate a visually coherent and extremely intricate sequence of video shots from a few texts.
The model excels at sustaining character leanness and smoothness of motion and is thus important for professional content creators. Built on a whopping assortment of 240 million images and 6.4 million video clips, Gen-3 Alpha uses that data to bounce into generating realistic, detailed, and coherent videos in record times.
Luma AI’s Dream Machine is a transformer-based video model designed for both scalability and efficiency, generating physically persuasive and visually consistent shots. It incorporates multimodal input, allowing users to create realistic videos guided by both textual prompts and images, providing greater flexibility in content creation.
It comes with a user-friendly interface with storyboard generation and style references, offering far more predictable and customisable results. While the particular training details have not been fully shared, Dream Machine has effectively been trained on a large-scale video clip dataset, providing it with a very proficient understanding of generally recognised motion patterns and visual dynamics.
Captain America: Brave New World
OmniHuman-1 vs. Sora
The text-to-video extraction of OpenAI’s Sora and ByteDance’s OmniHuman-1 takes different forms architecturally. Sora uses a transformer-based model incorporating physics simulations and temporal continuity that excels at realistic scene synthesis and spatial consistency in highly complex three-dimensional environments.
OmniHuman-1 takes the other route, offering a multi-stream GAN (Generative Adversarial Networks) approach focused on human movement and character continuity for mash-up-style, high-fidelity, lifelike puppet animation.
Whereas Sora seeks general environmental realism, OmniHuman-1 excels in detailed character dynamics; they are thus models with particular strengths for video generation.
In an announcement, ByteDance effectively publicly thanked the attractiveness of OmniHuman, as it says, “The performance outstrips the state-of-the-art generation, materialising ultra-realistic human videos from weak signal inputs, and audio in particular.”
In a paper on arXiv, the company declared the capability of the present model for any aspect ratio of pictures ranging from portrait to half-body to full-body and producing results that are by and large high-fidelity, further showcasing the versatility of such models across different cases.
Still, it must be said that a clean comparison is impossible for the time being, as neither parent company has yet released their scores on various benchmarks. Any comparison that can be made right now can only rely on the feelings of users, though subjective at best.