Deepfakes are getting weirder (and better)
Some big developments in generative AI make it clear we will soon have a hard time distinguishing between synthetic media and real life
The earliest versions of aiEDU’s professional learning presentations featured an AI-generated deepfake video.
At the time, this was novel stuff — and my team generally had the opportunity to watch people’s eyes widen as they witnessed, for the first time ever, an extremely compelling (if not entirely indistinguishable!) likeness of someone created with AI.
This was back in 2020-21, a simpler time before ChatGPT ushered in the AI zeitgeist that we find ourselves in today.
It was a time when most of the AI that people interacted with was invisible — content algorithms, mapping, fraud detection, stuff like that. That deepfake of Morgan Freeman was by far the most compelling part of my presentation, and I often opened with the short video to hook audiences and convince them that this “AI thing” was something to pay attention to.
It was a time before generative AI, and I led many discussions about the implications of this technology being in the hands of thousands or even millions of people. Most agreed that would be a dystopian world.
Then came the era of LLMs and generative AI.
Now, the world has been flooded with an endless array of synthetic content. I’m guessing most of you will have tried your hand at creating AI images and artwork. GenAI is in the palm of our hands, right on our phones – it’s part of Instagram, TikTok, and content creation apps for making your own photos or videos.
And… everything seems fine?
There were some troubling reports of deepfakes being used in elections throughout 2024 and the rise of ‘nudification’ apps in schools across the U.S. These are indeed troubling, but not apocalyptic.
The lack of deepfakes becoming a widespread crisis can be attributed to a few things:
Good deepfakes are still hard to create.
Most of the really impressive stuff that is still available on YouTube is some form of face-swapping and requires a fair amount of expertise to pull off.Social norms make them difficult to share widely.
Social platforms haven’t necessarily removed all AI-generated content, but it is being penalized by their algorithms. Many deepfake videos have been pulled down by YouTube, and companies are getting ahead of legislation which seems inevitable to some degree.Most deepfakes aren’t very good.
This fake video of Ukrainian President Vladimir Zelensky telling his forces to “lay down arms and surrender to Russia” is scary in concept but a far cry from a perfect replica. That’s the case with most of the deepfakes making their way past content filters. Good enough to spark a conversation about the technology, but not yet indistinguishable.
Of course, it’s worth considering whether deepfakes have already crossed the Rubicon and are so good that we simply can’t identify them. But I don’t think we’re there yet.
So, how far are we from the real deal?
This stuff is hard to predict, but I’ve seen some recent developments that are worth paying attention to — not just from the perspective of disinformation, but also with an eye to the future of media and entertainment. We are certainly getting toward a stranger place, where it will be hard to tell what’s real from what’s fake.
You’ve already seen text-to-video generators like OpenAI’s Sora and Google’s Veo 2, which aren’t necessarily deepfakes but still are on the continuum of synthetic content. But this next wave of AI video and deepfake technology could put us over the top.
Here are a few new AI video models that may end up doing just that:
OmniHuman-1
OmniHuman is an AI video creation framework developed by ByteDance, who are the makers of TikTok and popular video-editing app CapCut. And from what they’ve shown so far, it is a significant leap forward in AI video.
OmniHuman’s breakthrough lies in its multimodality-conditioned learning, which is a fancy way of saying that it combines different types of inputs (video, photo, and audio) to create more realistic videos.
In the past, deepfake models generally relied on a single type of input to guide movement and expression in their output. This resulted in videos that looked stiff and unrealistic because the underlying AI lacked enough information to generate natural motion. But by mixing different input types, OmniHuman learns from a wider range of human movement patterns, which makes the final video more fluid, expressive, and contextually accurate.
For example: An audio-driven AI video model might excel at lip-syncing and basic facial expressions, but OmniHuman can use pose and motion inputs to create full-body movements, gestures, and even interactions with other objects. Check out this video of Albert Einstein (made with this photo from 1921) reciting words spoken by neuroscientist Jaak Panksepp during a TEDx Talk from 2017:
If given just an image and audio clip, OmniHuman can animate a person speaking with natural facial expressions and subtle head movements. If pose data is added, (via a wider-angle photo or video) it can generate full-body motion that aligns with the rhythm of speech or specific physical actions. OmniHuman’s framework doesn’t just learn how to copy specific movements, it learns the underlying patterns of human motion to make them more applicable across different scenarios.
As AI-generated videos continue to improve, they won’t just look more realistic — they’ll also feel more convincing in the way they move and interact with their surroundings.
One other issue with OmniHuman is its ownership and specifically how it might (or might not) be regulated.. ByteDance is, of course, a Chinese entity. We’ve already seen (ironically, through ByteDance’s own TikTok) how hard it can be to regulate Chinese-owned companies. That means that Omni-Human is not only potentially difficult to regulate but may require something even more difficult to achieve — international norms on its use.
VideoJAM
Usually, you can spot AI-created videos by the awkward movements of their on-screen subjects. But Meta AI’s new VideoJAM framework goes the extra mile by implementing realistic physics that prior video generators have struggled with.
The difference is pretty clear when you see them side-by-side:
VideoJAM uses Joint Appearance-Motion representation to better parse how on-screen elements both look and move together. Instead of focusing only on visual details, it also learns how to predict motion by using its own proprietary mechanism called Inner-Guidance to ensure natural and coherent movement. The combination of JAM representation and Inner-Guidance enables VideoJAM to create videos that look realistic and self-correct its outputs based on its established understanding of motion.
Whereas previous AI video models prioritized the appearance of their output, VideoJAM emphasizes natural motion and physical interactions to create videos that seem more real beyond just a crispy, high-quality picture.
Goku
AI video technology is moving fast — so fast that ByteDance just released another video generator over the past couple days while I was writing this post!
The fittingly named Goku seems to currently have claimed the ‘Final Boss’ crown of AI video technology. And it is, as the Internet meme suggests, tough to beat. Similarly to OmniHuman, Goku is a multimodal framework that can create outputs based on text prompts or still images. So far, the results have been better than anything we’ve seen from Sora and other AI video generators:
Goku uses a Rectified Flow Transformer (RFT) architecture that is designed to generate high-quality video with smoother and more natural human motion. RFT differs from the more common Diffusion Transformer (DiT) architecture used in AI video models.
In DiT, the output process gradually adds more detail to an image over many steps, which often produces unnatural or jerky movements. Contrastingly, RFT uses a “straight line” process that facilitates more efficiently generated and stable outputs which look more visually consistent. RFT’s linear workflow allows Goku to better avoid the telltale glitches of AI video and create more realistic images that aren’t instantly recognizable as artificial… For better or worse. 😬
A new era of verisimilitude
When you add these AI video tools up, you end up in a place that’s a bit different from where we started. Though many of our deepest fears about deepfakes have not come to fruition so far, the technology on display is much more than the “wow” or “that’s pretty neat” quality of when my team and I first began to show examples of deepfakes.
Though it’s hard to say how quickly, it is clear — especially after diving deeper into some of the newest deepfake technology — that it will be harder and harder to tell what’s real from what is fake. As someone who works day-in, day-out with school districts and educators to try to support teachers and students in a future that’s powered by AI, that will have a lot of ramifications moving forward.
What’s clear though is that we’re still in the early chapters of a longer science fiction novel.