Add LTX-2.3 text-to-video generation support by prishajain1 · Pull Request #402 · AI-Hypercomputer/maxdiffusion

prishajain1 · 2026-05-10T09:28:16Z

This PR introduces end-to-end pipeline and model changes to support the LTX-2.3 multi-modal (audio-video) transformer model. It enables integrated text-to-audio-video generation using Gemma-based text conditioning, latent upsamplers, and vocoders.

Key architectural changes

Gated Cross-Modal Attention: Introduces a learnable gate (to_gate_logits) applied to all attention operations in the block (Self-Video, Self-Audio, Prompt-Cross, and Modal-Cross).
Prompt AdaLN (Noise-Aware Text Conditioning): It introduces Prompt AdaLN (self.prompt_adaln). For this specific cross-attention modulation, it derives scale and shift parameters directly from the continuous noise level (sigma)
Cross-Timestep Conditioning: When use_cross_timestep is enabled, it swaps the sigma (noise level) values used during the cross-modal attention steps (A2V and V2A).
Per-Modality Text Projections (Connectors): Introduces support for Per-Modality Projections (per_modality_projections=True). Instead of a shared feature extractor, it applies per-token RMS normalization to the raw hidden states and passes them through two separate linear projection layers (video_text_proj_in and audio_text_proj_in) before sending them to the respective video and audio connectors.
4-Way Batched Denoising Compile: In addition to CFG, LTX 2.3 introduces two massive new guidance concepts, requiring a 4-pass execution per step. These are spatiotemporal guidance and modality isolation guidance
Stabilizing Multi-Branch Guidance via x_0 Space
BWE Vocoder: Introduces the Bandwidth Extension (BWE) Vocoder (LTX2VocoderWithBWE).

Files added/modified

ltx2_3_video.yml file: New config file for LTX2.3
vocoder_ltx2.py: Added support for BWE vocoder
ltx2_pipeline.py: Enabled 4-way sliced batched inference (Uncond, Cond, Perturb, Isolated) and integrated velocity/x0 conversion delta equations with guidance rescaling.
transformer_ltx2.py: Propagated modality/perturbation masks to transformer blocks and integrated prompt adaptive layer norms.
generate_ltx2.py, pyconfig.py, common_types.py: Added support for LTX2.3
ltx2_utils.py: Added support to load new LTX2.3 specific weights
attention_ltx2.py: Added support for gated attention and perturbed attention
autoencoder_kl_ltx2.py: Added support for different upsample_type
embeddings_connector_ltx2.py: Added gated attention configurations (gated_attn) support to intermediate transformer block connectors.
feature_extractor_ltx2.py: support for per_modality_projections parameter added
text_encoders.py: Implemented dual-modality parallel text connectors routing, token-wise RMS scaling, and independent video-audio linear projections.

Sample outputs

(CFG + STG + MIG) enabled : Video
(CFG) enabled: Video
upsampler video (using LTX2.3): Video

In addition, we also tested with scan_diffusion_loop = True and scan_diffusion_loop = False

github-actions · 2026-05-10T13:30:02Z

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request successfully introduces support for LTX-2.3 text-to-video generation. It includes significant updates to the transformer architecture (gated attention, cross-modal modulation) and the denoising pipeline (4-way batched denoising for STG/CFG/MIG). The implementation is high-quality and integrates well with the existing LTX-2 infrastructure.

🔍 General Feedback

Redundant Patch File: The scratch_diff.patch file was likely added by mistake and should be removed before merging.
Robustness: A few areas in the pipeline (like audio_channels fallback and upsampler parameter inference) could be made more robust to handle different model versions and naming conventions.
Optimization: The use of nnx.jit for the vocoder and the optimized sequence length in smoke tests are excellent additions for performance and stability.

github-actions · 2026-05-11T05:20:46Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-11T05:20:50Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-11T05:26:41Z

🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-05-11T05:43:02Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-05-11T05:48:47Z

🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-05-11T10:15:07Z

🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces comprehensive support for LTX-2.3 text-to-video generation, including the end-to-end pipeline, model updates, and a new vocoder with bandwidth extension (BWE). The implementation correctly handles complex features like Spatio-Temporal Guidance (STG) and Modality Isolation Guidance (MIG) using a 4-way batched denoising approach in JAX.

🔍 General Feedback

STG/MIG Logic: The implementation of the 4-way split denoising logic and the corresponding delta formulations for guidance is impressive and aligns well with the LTX-2.3 technical requirements.
Efficiency: Utilizing nnx.scan for the denoising loop ensures optimal performance on TPU/GPU hardware.
Redundancy: I identified some redundant initializations and assignments in the transformer and autoencoder models that should be cleaned up.
Parameter Initialization: Double-check the usage of nnx.Param with kernel_init, as nnx.Param typically only accepts the data tensor and might ignore additional keyword arguments.

github-actions · 2026-05-11T05:24:02Z

+        num_mod_params=num_mod_params,
        use_additional_conditions=False,
        dtype=self.dtype,
        weights_dtype=self.weights_dtype,


🟡 This block is redundant as it exactly duplicates the initialization of prompt_adaln and audio_prompt_adaln already performed in lines 743-756.

Suggested change

weights_dtype=self.weights_dtype,

github-actions · 2026-05-11T05:24:02Z

    )

-    num_mel_bins = self.audio_vae.config.mel_bins if getattr(self, "audio_vae", None) is not None else 64
+    num_mel_bins = self.audio_vae.config.mel_bins


🟠 Similar to the __init__ check, this will crash if audio_vae is None. A fallback value (e.g., 64 or 128) or a conditional check is needed.

Suggested change

num_mel_bins = self.audio_vae.config.mel_bins

num_mel_bins = self.audio_vae.config.mel_bins if self.audio_vae is not None else 128

github-actions · 2026-05-11T05:45:06Z

+  def convert_to_vel(lat, x0, sigma_t):
+    return (lat - x0) / sigma_t
+
+  def scan_body(carry, inputs):


🟡 The current logic ties the 4-way guidance pass (STG + MIG) strictly to do_cfg and do_stg. If a user enables stg_scale > 0 but sets guidance_scale = 1.0, the pipeline will fall back to a 1-pass (or 2-pass if CFG is somehow active elsewhere) execution, and STG/MIG masks will not be applied. Consider decoupling these or adding a check if either guidance is requested.

Suggested change

def scan_body(carry, inputs):

do_cfg = guidance_scale > 1.0

do_stg = stg_scale > 0.0

github-actions · 2026-05-11T05:45:06Z

              encoder_attention_mask=encoder_attention_mask,
              audio_encoder_attention_mask=audio_encoder_attention_mask,
+              perturbation_mask=mask,
          )


🔴 The modality_mask is missing in the non-scan (else) path of the transformer forward pass. This will prevent Modality Isolation Guidance (MIG) from working correctly when scan_layers=False is set in the configuration.

Suggested change

)

audio_encoder_attention_mask=audio_encoder_attention_mask,

perturbation_mask=mask,

modality_mask=modality_mask,

)

github-actions · 2026-05-11T05:48:09Z

+  def convert_to_vel(lat, x0, sigma_t):
+    return (lat - x0) / sigma_t
+
+  def scan_body(carry, inputs):


🟡 The current logic ties the 4-way guidance pass (STG + MIG) strictly to do_cfg and do_stg. If a user enables stg_scale > 0 but sets guidance_scale = 1.0, the pipeline will fall back to a 1-pass (or 2-pass if CFG is somehow active elsewhere) execution, and STG/MIG masks will not be applied. Consider decoupling these or adding a check if either guidance is requested.

Suggested change

def scan_body(carry, inputs):

do_cfg = guidance_scale > 1.0

do_stg = stg_scale > 0.0

github-actions · 2026-05-11T10:18:16Z

      v2a_attention_kernel: str = "dot_product",
      flash_block_sizes: BlockSizes = None,
      flash_min_seq_length: int = 4096,
+      perturbed_attn: bool = False,


🟡 The `perturbed_attn` argument is stored as `self.perturbed_attn` but is never used in the `__call__` method. Instead, `perturbation_mask` is passed directly to the attention layers. Consider removing this attribute if it's not needed.

github-actions · 2026-05-11T10:18:16Z

+    config_path = config.upsampler_model_path
+    if config_path == "Lightricks/LTX-2.3":
+      config_path = "Lightricks/LTX-2"
+


🟢 Use `getattr` with a default value for safer attribute access, especially since `upsampler_filename` might be missing from some older configs.

Suggested change

filename = getattr(config, "upsampler_filename", None)

github-actions · 2026-05-11T10:18:16Z

-    k1, k2, k3, k4 = jax.random.split(key, 4)
+    k1, k2, k3, k4, k5, k6 = jax.random.split(key, 6)
+
+    self.cross_attn_mod = cross_attn_mod


🟠 `nnx.Param` does not take a `kernel_init` argument. This keyword argument will likely be ignored or cause a TypeError depending on the NNX version. If you want to use an initializer with partitioning, use it to create the data before passing it to `nnx.Param`, or use a standard layer like `nnx.Linear`. Also, it's contradictory to provide both a random normal value and a `zeros` initializer.

Suggested change

self.cross_attn_mod = cross_attn_mod

self.scale_shift_table = nnx.Param(

jax.random.normal(k1, (table_size, self.dim), dtype=weights_dtype) / jnp.sqrt(self.dim)

)

github-actions · 2026-05-11T10:18:26Z

    inject_noise = tuple(reversed(inject_noise))
    upsample_residual = tuple(reversed(upsample_residual))
    upsample_factor = tuple(reversed(upsample_factor))
+    upsample_type = upsample_type


🟢 Redundant assignment. `upsample_type = upsample_type` has no effect.

Suggested change

upsample_type = upsample_type

upsample_type = upsample_type

github-actions · 2026-05-11T10:18:26Z

    )

-    # Two independent connectors
+    self.per_modality_projections = per_modality_projections


🟡 `self.caption_channels` is used later in `__call__` via `getattr(self, "caption_channels", ...)` but it is never assigned to `self` in `__init__`.

Suggested change

self.per_modality_projections = per_modality_projections

self.caption_channels = caption_channels

self.per_modality_projections = per_modality_projections

prishajain1 added 30 commits April 29, 2026 11:45

ltx2.3 base commit

4b0cd70

ltx2.3 pipeline changes

040c755

config file updated

3a45ec3

config file updated

c541ae5

config file updated

3240cbb

config file updated

b22affa

pipeline fix

1c31e53

pipeline fix

358c449

pipeline fix

5358fe0

pipeline fix

da7b44f

non scan pipeline fix

9385f4e

non scan pipeline fix

947f53e

vae decoder fix

8042684

transformer offload to cpu

173affe

import fix

dccd2fe

import fix

24ee27a

pipeline fix

de7f386

jit vocoder

f2d8574

remove debug

e90f77d

logging added

6877724

removing jax clear cache line

e9b48cd

vocoder jit issue

42c54ef

moved transformer loading to beginning of call

2f4a8d9

guidance scales changes

985771e

default vocoder to LTX2

1c7240f

functioanlity for non bwe vocoder

3982ea7

vocoder key fix

0f16c9e

vocoder fix

54540f7

vocoder fix

f58b7dd

offload_transformer flag

4ba20b6

prishajain1 added 7 commits May 10, 2026 15:05

ruff fixes

bce3755

lint

b072d5d

removed mid_channels = 512

ce325d3

Merge branch 'main' into ltx23

0ad26e3

ruff check

62aaf59

fbs change

4419ce0

ltx2 smoke test max seq len updated

22aa6d0

Perseus14 added the gemini-review label May 10, 2026

github-actions Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py

Comment thread scratch_diff.patch Outdated

Comment thread src/maxdiffusion/pipelines/ltx2/ltx2_pipeline.py Outdated

Perseus14 assigned prishajain1 May 10, 2026

prishajain1 added 3 commits May 10, 2026 19:13

Delete scratch_diff.patch

c05661f

resolved comments

ceac205

lint

3bd1785

prishajain1 added gemini-review and removed gemini-review labels May 11, 2026

github-actions Bot reviewed May 11, 2026

View reviewed changes

Perseus14 self-requested a review May 11, 2026 15:17

	num_mel_bins = self.audio_vae.config.mel_bins
	num_mel_bins = self.audio_vae.config.mel_bins if self.audio_vae is not None else 128

	def scan_body(carry, inputs):
	do_cfg = guidance_scale > 1.0
	do_stg = stg_scale > 0.0

-    self.cross_attn_mod = cross_attn_mod
+    self.scale_shift_table = nnx.Param(
+        jax.random.normal(k1, (table_size, self.dim), dtype=weights_dtype) / jnp.sqrt(self.dim)
+    )

	self.per_modality_projections = per_modality_projections
	self.caption_channels = caption_channels
	self.per_modality_projections = per_modality_projections

Conversation

prishajain1 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key architectural changes

Files added/modified

Sample outputs

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prishajain1 commented May 10, 2026 •

edited

Loading