Skip to content

Replace naive whitespace tokenizer with SentencePiece unigram (Viterbi + greedy longest-match)#1

Merged
lmangani merged 8 commits intomainfrom
copilot/add-video-inference-tools
Mar 17, 2026
Merged

Replace naive whitespace tokenizer with SentencePiece unigram (Viterbi + greedy longest-match)#1
lmangani merged 8 commits intomainfrom
copilot/add-video-inference-tools

Conversation

Copy link

Copilot AI commented Mar 17, 2026

The T5 tokenizer was a whitespace-split + per-character fallback that mishandled subwords, multi-byte UTF-8, and any token not perfectly aligned with word boundaries. Replaces it with a correct SentencePiece unigram implementation and fixes two supporting gaps.

src/t5_encoder.hpp — tokenizer rewrite

  • preprocess(): collapses whitespace, strips leading/trailing, prepends , replaces spaces with — matching SentencePiece normalization
  • viterbi(): Viterbi DP over byte positions maximizing sum of unigram log-probs; activated when tokenizer.ggml.scores is present in the GGUF
  • greedy(): greedy longest-match scan; fallback when scores are absent — already far superior to the old approach
  • utf8_char_len(): unk-fallback advances one full UTF-8 character (not one byte) to avoid corrupting subsequent lookups
  • tok2id switched from std::mapstd::unordered_map (O(log n) → O(1) lookup)
"A cinematic shot" (old) → ["▁A", "▁", "c", "i", "n", …]  ← per-char fallback
"A cinematic shot" (new) → ["▁A", "▁cinematic", "▁shot"]   ← greedy/Viterbi subword

src/ltx-quantize.cpp — array KV preservation

The manual KV-copy switch silently dropped all GGUF_TYPE_ARRAY entries (tokenizer vocab, scores). Replaced with gguf_set_kv(out_ctx, src.gguf_ctx) — copies all KV pairs including arrays in one call.

convert.py — scores round-trip

convert_t5 --tokenizer now writes tokenizer.ggml.scores via tok.sp_model.GetScore(i) using writer.add_token_scores(). GGUFs produced by the converter will carry scores, automatically enabling Viterbi mode in the C++ tokenizer.

DEV.md

Tokenizer and quantizer-metadata entries in the known-limitations table marked as resolved; §8 updated with Viterbi/greedy mode description and GGUF key table.

Original prompt

Inspired by llama.cpp, ltx.cpp is a Portable C++17 implementation of LTX 2.3 generation using GGML/GGUF.

Our references are:

GOAL: ltx.cpp tools to run video inference using the GGUF models without python or other dependencies, replicatng what the reference ComfyUI workflow is capable of, but stand-alone.


📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copilot AI changed the title [WIP] Add ltx.cpp tools for video inference with GGUF models feat: ltx.cpp — portable C++17 LTX-Video 2.3 inference via GGML/GGUF Mar 17, 2026
Copilot AI requested a review from lmangani March 17, 2026 11:02
Copilot AI and others added 2 commits March 17, 2026 11:12
Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copilot AI changed the title feat: ltx.cpp — portable C++17 LTX-Video 2.3 inference via GGML/GGUF feat: PNG/JPG/BMP/TGA image input for I2V start/end frame conditioning Mar 17, 2026
Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copilot AI changed the title feat: PNG/JPG/BMP/TGA image input for I2V start/end frame conditioning feat: start/end frame I2V conditioning + PNG/JPG input support + DEV.md Mar 17, 2026
Copilot AI and others added 2 commits March 17, 2026 11:59
Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
…-match)

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copilot AI changed the title feat: start/end frame I2V conditioning + PNG/JPG input support + DEV.md Replace naive whitespace tokenizer with SentencePiece unigram (Viterbi + greedy longest-match) Mar 17, 2026
@lmangani lmangani marked this pull request as ready for review March 17, 2026 14:52
@lmangani lmangani merged commit 04a047c into main Mar 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants