GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]
- Inference Code
- Pretrained Models
- A web demo
- Training Code
- Clean Code to make it look nicer
- Support for MeanFlow
- Unified training and testing pipeline
- MeanFlow Training Code (Coming Soon)
- Merge with Intentional-Gesture
Latest Update: The codebase has been cleaned and restructured. For legacy or historical information, please check out the old branch.
New Features:
- Added MeanFlow model support
- Unified training and testing pipeline using
train.py - New configuration files in
configs_new/directory - Updated checkpoint files with improved performance
conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh
Understanding the codebase structure will help you navigate and customize the project effectively.
GestureLSM/
βββ π configs_new/ # New unified configuration files
β βββ diffusion_rvqvae_128.yaml # Diffusion model config
β βββ shortcut_rvqvae_128.yaml # Shortcut model config
β βββ meanflow_rvqvae_128.yaml # MeanFlow model config
βββ π configs/ # Legacy configuration files (deprecated)
βββ π ckpt/ # Pretrained model checkpoints
β βββ new_540_diffusion.bin # Diffusion model weights
β βββ shortcut_reflow.bin # Shortcut model weights
β βββ meanflow.pth # MeanFlow model weights
β βββ net_300000_*.pth # RVQ-VAE model weights
βββ π models/ # Model implementations
β βββ Diffusion.py # Diffusion model
β βββ LSM.py # Latent Shortcut Model
β βββ MeanFlow.py # MeanFlow model
β βββ π layers/ # Neural network layers
β βββ π vq/ # Vector quantization modules
β βββ π utils/ # Model utilities
βββ π dataloaders/ # Data loading and preprocessing
β βββ beat_sep_lower.py # Main dataset loader
β βββ π pymo/ # Motion processing library
β βββ π utils/ # Data utilities
βββ π trainer/ # Training framework
β βββ base_trainer.py # Base trainer class
β βββ generative_trainer.py # Generative model trainer
βββ π utils/ # General utilities
β βββ config.py # Configuration management
β βββ metric.py # Evaluation metrics
β βββ rotation_conversions.py # Rotation utilities
βββ π demo/ # Demo and visualization
β βββ examples/ # Sample audio files
β βββ install_mfa.sh # MFA installation script
βββ π datasets/ # Dataset storage
β βββ BEAT_SMPL/ # Original BEAT dataset
β βββ beat_cache/ # Preprocessed cache
β βββ hub/ # SMPL models and pretrained weights
βββ π outputs/ # Training outputs and logs
β βββ weights/ # Saved model weights
βββ train.py # Unified training/testing script
βββ demo.py # Web demo script
βββ rvq_beatx_train.py # RVQ-VAE training script
βββ requirements.txt # Python dependencies
models/Diffusion.py: Denoising diffusion model for high-quality generationmodels/LSM.py: Latent Shortcut Model for fast inferencemodels/MeanFlow.py: Flow-based model for single-step generationmodels/vq/: Vector quantization modules for latent space compression
configs_new/: New unified configuration files for all modelsconfigs/: Legacy configuration files (deprecated)- Each config file contains model parameters, training settings, and data paths
dataloaders/beat_sep_lower.py: Main dataset loader for BEAT datasetdataloaders/pymo/: Motion processing library for gesture datadatasets/beat_cache/: Preprocessed data cache for faster loading
train.py: Unified script for training and testing all modelstrainer/: Training framework with base and generative trainersoptimizers/: Optimizer and scheduler implementations
utils/config.py: Configuration management and validationutils/metric.py: Evaluation metrics (FGD, etc.)utils/rotation_conversions.py: 3D rotation utilities
- For Training: Use
train.pywith configs fromconfigs_new/ - For Inference: Use
demo.pyfor web interface ortrain.py --mode test - For Customization: Modify config files in
configs_new/directory - For New Models: Add model implementation in
models/directory
This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis, accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper.
- The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
- The pretrained models are trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, MeanFlow)
- If you want to use all-speaker, please modify the config files to include all speaker ids.
- April 16, 2025: updated the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
- No hyperparameter tuning was done for all-speaker - same settings as 1-speaker are used.
- No speaker embedding is included to make the model capable of generating gestures for novel speakers.
- No gesture type information is used in the current version. This is intentional as gesture types are typically unknown for novel speakers and settings, making this approach more realistic for real-world applications.
- If you want to see better FGD scores, you can try adding gesture type information.
- Current Version: Clean, unified codebase with MeanFlow support
- Legacy Code: Available in the
oldbranch for historical reference - Accepted to ICCV 2025 - Thanks to all co-authors!
# Option 1: From Google Drive
# Download the pretrained models (Diffusion + Shortcut + MeanFlow + RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder
# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt
# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder
- Diffusion Model:
ckpt/new_540_diffusion.bin - Shortcut Model:
ckpt/shortcut_reflow.bin - MeanFlow Model:
ckpt/meanflow.pth - RVQ-VAE Models:
ckpt/net_300000_upper.pth,ckpt/net_300000_hands.pth,ckpt/net_300000_lower.pth
For evaluation and training, not necessary for running a web demo or inference.
The original dataset download method is no longer available. Please use the Hugging Face dataset:
# Download BEAT2 dataset from Hugging Face
huggingface-cli download H-Liu1997/BEAT2 --local-dir ./datasets/BEAT2Dataset Information:
- Source: H-Liu1997/BEAT2 on Hugging Face
- Size: ~4.1K samples
- Format: CSV with train/test splits
- License: Apache 2.0
The original download method is no longer working
# This command is deprecated and no longer works
# bash preprocess/bash_raw_cospeech_download.shNote: Requires dataset download for evaluation. For inference only, see the Demo section below.
The codebase now uses a unified train.py script for both training and testing. Use the --mode test flag for evaluation:
# Test Diffusion Model (20 steps)
python train.py --config configs_new/diffusion_rvqvae_128.yaml --ckpt ckpt/new_540_diffusion.bin --mode test
# Test Shortcut Model (2-step reflow)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test
# Test MeanFlow Model (1-step flow-based)
python train.py --config configs_new/meanflow_rvqvae_128.yaml --ckpt ckpt/meanflow.pth --mode test| Model | Steps | Description | Key Features | Use Case |
|---|---|---|---|---|
| Diffusion | 20 | Denoising diffusion model | High quality, slower inference | High-quality generation |
| Shortcut | 2-4 | Latent shortcut with reflow | Fast inference, good quality | Recommended for most users |
| MeanFlow | 1 | Flow-based generation | Fastest inference, single step | Real-time applications |
| Model | Steps | FGD Score β | Beat Constancy β | L1Div Score β | Inference Speed |
|---|---|---|---|---|---|
| MeanFlow | 1 | 0.4031 | 0.7489 | 12.4631 | Fastest |
| Diffusion | 20 | 0.4100 | 0.7384 | 12.5752 | Slowest |
| Shortcut | 20 | 0.4040 | 0.7144 | 13.4874 | Fast |
| Shortcut-ReFlow | 2 | 0.4104 | 0.7182 | 13.678 | Fast |
Legend:
- FGD Score (β): Lower is better - measures gesture quality
- Beat Constancy (β): Higher is better - measures audio-gesture synchronization
- L1Div Score (β): Higher is better - measures diversity of generated gestures
Recommendation: MeanFlow is the clear winner, offering the best FGD and L1Div scores with the fastest inference speed.
For reference only - use the unified pipeline above instead
# Old testing commands (deprecated)
python test.py -c configs/shortcut_rvqvae_128.yaml
python test.py -c configs/shortcut_reflow_test.yaml
python test.py -c configs/diffuser_rvqvae_128.yamlRequire download dataset
bash train_rvq.sh
Note: Requires dataset download for training.
The codebase now uses a unified train.py script for training all models. Use the new configuration files in configs_new/:
# Train Diffusion Model
python train.py --config configs_new/diffusion_rvqvae_128.yaml
# Train Shortcut Model
python train.py --config configs_new/shortcut_rvqvae_128.yaml
# Train MeanFlow Model
python train.py --config configs_new/meanflow_rvqvae_128.yaml- Config Directory: Use
configs_new/for the latest configurations - Output Directory: Models are saved to
./outputs/weights/ - Logging: Supports Weights & Biases integration (configure in config files)
- GPU Support: Configure GPU usage in the config files
For reference only - use the unified pipeline above instead
# Old training commands (deprecated)
python train.py -c configs/shortcut_rvqvae_128.yaml
python train.py -c configs/diffuser_rvqvae_128.yaml# Run the web demo with Shortcut model
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml# Test with your own audio and text (requires pretrained models)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode testThe demo provides a web interface for gesture generation. It uses the Shortcut model by default for fast inference.
python demo.py -c configs/shortcut_rvqvae_128_hf.yamlFeatures:
- Web-based interface for easy interaction
- Real-time gesture generation
- Support for custom audio and text input
- Visualization of generated gestures
Thanks to SynTalker, EMAGE, DiffuseStyleGesture, our code is partially borrowing from them. Please check these useful repos.
If you find our code or paper helps, please consider citing:
@inproceedings{liu2025gesturelsmlatentshortcutbased,
title={{GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}},
author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu},
booktitle={IEEE/CVF International Conference on Computer Vision},
year={2025},
}