Replace naive whitespace tokenizer with SentencePiece unigram (Viterbi + greedy longest-match) by Copilot · Pull Request #1 · audiohacking/ltx.cpp

Copilot · 2026-03-17T10:42:33Z

The T5 tokenizer was a whitespace-split + per-character fallback that mishandled subwords, multi-byte UTF-8, and any token not perfectly aligned with word boundaries. Replaces it with a correct SentencePiece unigram implementation and fixes two supporting gaps.

`src/t5_encoder.hpp` — tokenizer rewrite

preprocess(): collapses whitespace, strips leading/trailing, prepends ▁, replaces spaces with ▁ — matching SentencePiece normalization
viterbi(): Viterbi DP over byte positions maximizing sum of unigram log-probs; activated when tokenizer.ggml.scores is present in the GGUF
greedy(): greedy longest-match scan; fallback when scores are absent — already far superior to the old approach
utf8_char_len(): unk-fallback advances one full UTF-8 character (not one byte) to avoid corrupting subsequent lookups
tok2id switched from std::map → std::unordered_map (O(log n) → O(1) lookup)

"A cinematic shot" (old) → ["▁A", "▁", "c", "i", "n", …]  ← per-char fallback
"A cinematic shot" (new) → ["▁A", "▁cinematic", "▁shot"]   ← greedy/Viterbi subword

`src/ltx-quantize.cpp` — array KV preservation

The manual KV-copy switch silently dropped all GGUF_TYPE_ARRAY entries (tokenizer vocab, scores). Replaced with gguf_set_kv(out_ctx, src.gguf_ctx) — copies all KV pairs including arrays in one call.

`convert.py` — scores round-trip

convert_t5 --tokenizer now writes tokenizer.ggml.scores via tok.sp_model.GetScore(i) using writer.add_token_scores(). GGUFs produced by the converter will carry scores, automatically enabling Viterbi mode in the C++ tokenizer.

`DEV.md`

Tokenizer and quantizer-metadata entries in the known-limitations table marked as resolved; §8 updated with Viterbi/greedy mode description and GGUF key table.

Original prompt

Inspired by llama.cpp, ltx.cpp is a Portable C++17 implementation of LTX 2.3 generation using GGML/GGUF.

Our references are:

ComfyUI Unsloth LTX 2.3 workflow: https://gist.github.com/lmangani/e441d3b57e3fe623b67e4490b5a77c33

Unsloth LTX 2.3 GGUF models: https://huggingface.co/unsloth/LTX-2.3-GGUF

acestep.cpp an excellent reference for GGML: https://github.com/ServeurpersoCom/acestep.cpp

GOAL: ltx.cpp tools to run video inference using the GGUF models without python or other dependencies, replicatng what the reference ComfyUI workflow is capable of, but stand-alone.

📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

…-match) Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

Initial plan

a884b0e

Copilot AI assigned Copilot and lmangani Mar 17, 2026

Copilot started work on behalf of lmangani March 17, 2026 10:42 View session

Implement ltx.cpp: portable C++17 LTX-Video inference with GGML/GGUF

0ebb322

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add ltx.cpp tools for video inference with GGUF models~~ feat: ltx.cpp — portable C++17 LTX-Video 2.3 inference via GGML/GGUF Mar 17, 2026

Copilot AI requested a review from lmangani March 17, 2026 11:02

Copilot finished work on behalf of lmangani March 17, 2026 11:02

Copilot started work on behalf of lmangani March 17, 2026 11:03 View session

Copilot AI and others added 2 commits March 17, 2026 11:12

Plan: add PNG/JPG support via stb_image

96d625f

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

feat: PNG/JPG/BMP/TGA image input support for I2V via stb_image

37a9f6b

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

Copilot AI changed the title ~~feat: ltx.cpp — portable C++17 LTX-Video 2.3 inference via GGML/GGUF~~ feat: PNG/JPG/BMP/TGA image input for I2V start/end frame conditioning Mar 17, 2026

Copilot finished work on behalf of lmangani March 17, 2026 11:19

Copilot started work on behalf of lmangani March 17, 2026 11:32 View session

docs: add DEV.md developer guide

02c2048

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

Copilot AI changed the title ~~feat: PNG/JPG/BMP/TGA image input for I2V start/end frame conditioning~~ feat: start/end frame I2V conditioning + PNG/JPG input support + DEV.md Mar 17, 2026

Copilot finished work on behalf of lmangani March 17, 2026 11:43

Copilot started work on behalf of lmangani March 17, 2026 11:45 View session

Copilot AI and others added 2 commits March 17, 2026 11:59

plan: proper SentencePiece unigram tokenizer

0f38fb7

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

fix: proper SentencePiece unigram tokenizer (Viterbi + greedy longest…

e24bbc9

…-match) Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>

Copilot AI changed the title ~~feat: start/end frame I2V conditioning + PNG/JPG input support + DEV.md~~ Replace naive whitespace tokenizer with SentencePiece unigram (Viterbi + greedy longest-match) Mar 17, 2026

Copilot finished work on behalf of lmangani March 17, 2026 12:09

Update build.yml

11789dd

lmangani marked this pull request as ready for review March 17, 2026 14:52

lmangani merged commit 04a047c into main Mar 17, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace naive whitespace tokenizer with SentencePiece unigram (Viterbi + greedy longest-match)#1

Replace naive whitespace tokenizer with SentencePiece unigram (Viterbi + greedy longest-match)#1
lmangani merged 8 commits intomainfrom
copilot/add-video-inference-tools

Copilot AI commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

src/t5_encoder.hpp — tokenizer rewrite

src/ltx-quantize.cpp — array KV preservation

convert.py — scores round-trip

DEV.md

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 17, 2026 •

edited

Loading

`src/t5_encoder.hpp` — tokenizer rewrite

`src/ltx-quantize.cpp` — array KV preservation

`convert.py` — scores round-trip

`DEV.md`