
TVPaint Animation
The digital solution for your professional 2D animation projects.

Speaker-aware talking head animation for high-fidelity facial synchronization from a single image.

MakeItTalk is a sophisticated AI framework focused on speaker-aware talking head animation, originally introduced at SIGGRAPH 2020. Unlike simple warping methods, MakeItTalk utilizes a 3D Morphable Model (3DMM) to disentangle identity-specific and speech-specific facial landmarks. By predicting facial movements based on audio input, it generates highly realistic animations from a single portrait image. In the 2026 landscape, MakeItTalk serves as a critical, lightweight baseline for developers requiring real-time, landmark-based animation on edge devices where heavy diffusion-based models (like EMO or LivePortrait) might be computationally prohibitive. The architecture effectively captures not just lip movement, but also non-verbal cues such as head tilts, eye blinks, and brow movements, synchronized with the audio's prosody. It is particularly valued in the research community for its ability to animate diverse subjects, including oil paintings, sketches, and 2D cartoon characters, making it a versatile tool for stylized digital content creation and legacy photo revitalization.
MakeItTalk is a sophisticated AI framework focused on speaker-aware talking head animation, originally introduced at SIGGRAPH 2020.
Explore all tools that specialize in audio-driven lip syncing. This domain focus ensures MakeItTalk delivers optimized results for this specific requirement.
Uses a deep neural network to predict 3D facial landmarks from audio features, disentangling speaker identity from the speech content.
Utilizes 3D Morphable Models to represent facial geometry, allowing for realistic head rotation and perspective changes.
Trained on diverse datasets allowing the model to interpret non-photorealistic faces like sketches and paintings.
Predicts rhythmic head tilts and rotations based on the prosody and energy of the audio input.
Separates the audio signal into content (phonemes) and speaker (pitch/tone) components.
Landmark-based approach is significantly faster than pixel-level diffusion generation.
Implements a smoothing filter over the predicted sequence of landmarks to prevent jitter.
Clone the official repository from GitHub.
Install Python 3.8+ environment using Conda or venv.
Install PyTorch and torchvision with CUDA support.
Download the pre-trained 3DMM facial landmark predictors.
Download the speech-to-landmark content predictor weights.
Prepare a source portrait image with a clear frontal face.
Prepare a high-quality mono audio file (16kHz recommended).
Run the inference script pointing to image and audio paths.
Adjust the 'speaker-awareness' weight to fine-tune motion intensity.
Export the generated frames or concatenated MP4 video.
All Set
Ready to go
Verified feedback from other users.
"Highly regarded for its technical elegance and ability to handle non-human faces, though users note it lacks the cinematic realism of 2025/2026 diffusion models."
Post questions, share tips, and help other users.

The digital solution for your professional 2D animation projects.

Empowering independent artists with digital music distribution, publishing administration, and promotional tools.

Convert creative micro-blogs into high-performance web presences using generative AI and Automattic's core infrastructure.

Fashion design technology software and machinery for apparel product development.

Instantly turns any text to natural sounding speech for listening online or generating downloadable audio.

Professional studio-quality AI headshot generator for individuals and teams.