Skip to content

audiohacking/ltx.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ltx.cpp

Portable C++17 inference of LTX-Video (Lightricks) using GGML / GGUF.
Text-to-video generation runs on CPU, with build-time support for CUDA, Metal, and Vulkan.

No Python at inference time.

On the audio-video branch: the same DiT is used for audio+video β€” concatenated video and audio latent tokens, one denoise loop, then split and decode to frames + WAV. See Audio-video (AV) and docs/AV_PIPELINE.md.

Inspired by llama.cpp and acestep.cpp.


Status

  • Experimental, Unstable, Slow. Contributors welcome πŸ‘‹

Features

  • Text-to-video inference with the LTX-Video 2.3 DiT
  • Image-to-video (I2V) β€” animate a reference image (--start-frame)
  • Keyframe interpolation β€” provide both start and end frames to interpolate between them (--start-frame + --end-frame)
  • Quantised GGUF weights (Q4_K_M β†’ Q8_0 β†’ BF16)
  • Classifier-free guidance + flow-shift Euler sampler
  • PPM frame output (pipe to ffmpeg for MP4)
  • Audio-video (AV) pipeline β€” same DiT sees concatenated video+audio latent; output is video frames + WAV (see docs/AV_PIPELINE.md)
  • Single ltx-generate binary β€” no Python at runtime

Build

Build flags select the backend (same pattern as acestep.cpp). One backend per build; the resulting binary is optimized for that target.

git submodule update --init
mkdir build && cd build

# macOS (Metal + Accelerate BLAS auto-enabled)
cmake ..

# Linux with NVIDIA GPU
cmake .. -DLTX_CUDA=ON

# Linux with AMD GPU (ROCm)
cmake .. -DLTX_HIP=ON

# Linux / Windows with Vulkan
cmake .. -DLTX_VULKAN=ON

# macOS CPU-only (disable Metal)
cmake .. -DLTX_METAL=OFF

cmake --build . --config Release -j$(nproc)
Platform Recommended cmake Backend
macOS cmake .. Metal
Linux (NVIDIA) cmake .. -DLTX_CUDA=ON CUDA
Linux (AMD) cmake .. -DLTX_HIP=ON ROCm/HIP
Linux / Win cmake .. -DLTX_VULKAN=ON Vulkan

Builds two binaries:

Binary Purpose
ltx-generate Text-to-video inference
ltx-quantize Re-quantize GGUF files

Models

Option A – Download pre-quantised GGUFs (recommended)

pip install huggingface_hub          # for hf_hub_download

./models.sh                          # Dev DiT (default) + T5 + VAE + extras
./models.sh --distilled              # Distilled DiT (few-step) instead of dev
./models.sh --quant Q4_K_M           # smaller, faster
./models.sh --all                    # every quant (dev or distilled)

Downloads three GGUF files into models/:

File Contents Size (Q8_0)
ltxv-2b-*-Q8_0.gguf Video DiT (2B params) ~2.1 GB
ltxv-vae-Q8_0.gguf CausalVideoVAE ~400 MB
t5-xxl-Q8_0.gguf T5-XXL text encoder ~4.6 GB

LTX-2.3 (22B) β€” All from unsloth/LTX-2.3-GGUF: DiT (dev at repo root, or distilled/), VAE (vae/ β€” video + audio safetensors), text encoders (text_encoders/ β€” embeddings_connectors for Gemma). Use ./models.sh for dev (default) or ./models.sh --distilled for distilled DiT + matching VAE and connectors. See docs/LTX_COMFY_REFERENCE.md for the full file list.

Option B – Convert from safetensors

pip install gguf safetensors transformers

./checkpoints.sh                     # download raw HF checkpoints

python3 convert.py --model dit \
    --input  checkpoints/ltxv-2b-0.9.6-dev.safetensors \
    --output models/ltxv-2b-BF16.gguf

python3 convert.py --model vae \
    --input  checkpoints/ltxv-vae.safetensors \
    --output models/ltxv-vae-BF16.gguf

python3 convert.py --model t5 \
    --input  checkpoints/t5-xxl/ \
    --output models/t5-xxl-BF16.gguf

./quantize.sh Q8_0                   # BF16 β†’ Q8_0

Quick Start

Text-to-video

mkdir -p output

./build/ltx-generate \
    --dit    models/ltxv-2b-0.9.6-dev-Q8_0.gguf \
    --vae    models/ltxv-vae-Q8_0.gguf \
    --t5     models/t5-xxl-Q8_0.gguf \
    --prompt "A peaceful waterfall in a lush forest, cinematic, 4K" \
    --frames 25 \
    --height 480 --width 704 \
    --steps  40  --cfg 3.0  --shift 3.0 \
    --seed   42  --out output/frame

Image-to-video (I2V) β€” animate a reference image

Provide a PNG, JPG, BMP, TGA, or PPM image as --start-frame. The video will start from (and be strongly conditioned on) that image and animate from there based on the prompt. No conversion step is needed β€” standard image formats are supported natively.

./build/ltx-generate \
    --dit    models/ltxv-2b-0.9.6-dev-Q8_0.gguf \
    --vae    models/ltxv-vae-Q8_0.gguf \
    --t5     models/t5-xxl-Q8_0.gguf \
    --prompt "Camera slowly pans right, birds fly overhead" \
    --start-frame photo.jpg \
    --frames 25 --height 480 --width 704 \
    --steps 40 --cfg 3.0 --out output/frame

Keyframe interpolation β€” animate between two images

Provide both --start-frame and --end-frame to generate a video that transitions smoothly from the first image to the last.

./build/ltx-generate \
    --dit    models/ltxv-2b-0.9.6-dev-Q8_0.gguf \
    --vae    models/ltxv-vae-Q8_0.gguf \
    --t5     models/t5-xxl-Q8_0.gguf \
    --prompt "A serene forest scene, gentle breeze, cinematic" \
    --start-frame beginning.png \
    --end-frame   ending.png \
    --frames 33 --height 480 --width 704 \
    --steps 40 --cfg 3.0 --out output/frame

Use --frame-strength (0..1) to control how strongly the reference frame(s) constrain the generation. Default is 1.0 (fully pinned). Lower values give the model more creative freedom around the reference.

Supported input image formats: PNG, JPEG/JPG, BMP, TGA, PPM/PGM (powered by stb_image β€” no additional libraries required)

Convert the PPM output frames to MP4:

ffmpeg -framerate 24 -i output/frame_%04d.ppm -c:v libx264 -pix_fmt yuv420p output.mp4

Audio-video (AV) β€” video + WAV from the same DiT

The LTX 2.3 GGUF DiT is a full audio-video model: it expects concatenated video + audio latent tokens and outputs both. Use --av to run the full AV path (same denoise loop, then decode video and synthesize audio).

./build/ltx-generate \
    --dit    models/ltx-2.3-22b-dev-Q4_K_M.gguf \
    --vae    models/ltx-2.3-22b-dev_video_vae.safetensors \
    --t5     models/t5-xxl-Q8_0.gguf \
    --av --out output/av --out-wav output/av.wav \
    --prompt "Ocean waves, seagulls, wind" \
    --frames 25 --height 480 --width 704 --steps 20 --cfg 4.0

You get output/av_0000.ppm … and output/av.wav. Mux video + audio with ffmpeg:

ffmpeg -framerate 24 -i output/av_%04d.ppm -i output/av.wav -c:v libx264 -c:a aac -shortest output_av.mp4

Design details (token concat, shapes, audio VAE): docs/AV_PIPELINE.md.


Command-Line Reference

ltx-generate [options]

Required:
  --dit    <path>   DiT model GGUF file
  --vae    <path>   VAE decoder GGUF file
  --t5     <path>   T5 text encoder GGUF file

Generation:
  --prompt  <text>  Positive text prompt
  --neg     <text>  Negative prompt (default: empty)
  --frames  <N>     Number of output video frames   (default: 25)
  --height  <H>     Frame height in pixels           (default: 480)
  --width   <W>     Frame width in pixels            (default: 704)
  --steps   <N>     Denoising steps                  (default: 40)
  --cfg     <f>     Classifier-free guidance scale   (default: 3.0)
  --shift   <f>     Flow-shift parameter             (default: 3.0)
  --seed    <N>     RNG seed                         (default: 42)
  --out     <pfx>   Output frame file prefix         (default: output/frame)

Audio-video (AV) pipeline:
  --av              Enable audio+video (concat latent β†’ DiT β†’ split β†’ decode both)
  --audio-vae <path>  Audio VAE safetensors (optional with --av; for full decoder when implemented)
  --out-wav  <path>   Output WAV path (default: &lt;out prefix&gt;.wav when --av)

Image-to-video (I2V) conditioning:
  --start-frame  <path>  PNG/JPG/BMP/TGA/PPM image: animate from this reference frame
  --end-frame    <path>  PNG/JPG/BMP/TGA/PPM image: end at this frame (keyframe interp)
  --frame-strength <f>   Conditioning strength [0..1]  (default: 1.0)
                          1.0 = fully pin frame, 0.5 = soft guidance

Performance:
  --threads <N>     CPU worker threads               (default: 4)
  -v                Verbose logging per step

Architecture

Text-to-video

Text prompt
    β”‚
    β–Ό
T5-XXL encoder          (GGUF: t5-xxl-*.gguf)
    β”‚  [seq_len Γ— 4096 embeddings]
    β”‚
    β–Ό
LTX-Video DiT           (GGUF: ltxv-2b-*.gguf)
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Random noise latent                    β”‚
  β”‚  [T_lat Γ— H_lat Γ— W_lat Γ— 128]         β”‚
  β”‚       β”‚                                 β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
  β”‚  β”‚  N Γ— Transformer block        β”‚      β”‚
  β”‚  β”‚    self-attn  (3D RoPE)       β”‚      β”‚
  β”‚  β”‚    cross-attn (text cond.)    β”‚      β”‚
  β”‚  β”‚    FFN (SwiGLU)               β”‚      β”‚
  β”‚  β”‚    AdaLN (timestep cond.)     β”‚      β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”˜      β”‚
  β”‚  Euler ODE (flow matching)    β”‚         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
    β”‚  [T_lat Γ— H_lat Γ— W_lat Γ— 128]        β”‚
    β–Ό
CausalVideoVAE decoder  (GGUF: ltxv-vae-*.gguf)
    β”‚  [T_vid Γ— H_vid Γ— W_vid Γ— 3] pixels
    β–Ό
PPM frames  β†’  ffmpeg  β†’  MP4

Image-to-video (I2V) / Keyframe interpolation

Reference image(s) (PPM)
    β”‚
    β–Ό
VaeEncoder.encode_frame()     pixel [HΓ—WΓ—3] β†’ latent [H_latΓ—W_latΓ—128]
    β”‚  start_lat / end_lat
    β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό                                                  β–Ό
Random noise latent                        Frame conditioning
[T_lat Γ— H_lat Γ— W_lat Γ— 128]             per denoising step:
    β”‚                                        lat[T=0]  ← blend(start_lat, t)
    β”‚  Denoising loop (same as T2V)          lat[T=-1] ← blend(end_lat,   t)
    β”‚        +
    β”‚  frame-pinning after each Euler step
    β–Ό
VAE decode + PPM output

The conditioning blend weight increases as the timestep approaches 0 (clean signal), so early steps use mostly noise for global structure while later steps are progressively more pinned to the reference image(s).

Dimension Formula
T_lat (frames βˆ’ 1) Γ· 4 + 1
H_lat height Γ· 8
W_lat width Γ· 8
T_vid (T_lat βˆ’ 1) Γ— 4 + 1

References

About

Unstable, Experimental LTX 2.3 GGUF inference in C++ (DO NOT USE IT YET)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors