AI & Tech·Jun 8, 2026

mtmd: add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

r/LocalLLaMAJun 83 min readSingle source

The gist

5-point summary · 1 min

For input prompt, that an audio or an image requires a marker ( ) to identify its placement inside the prompt.
The logic of mtmd_bitmap_init_lazy is simple: An input media identify its placement in the prompt via a single marker (usually ) A callback is provided, and will be called repeatedly during tokenize call.
This way, we can "expand" one single input bitmap to multiple media chunks On server and CLI, since each marker == a file, this make the code trivial to implement, almost no changes are required.
Testing A short clip tools/mtmd/test-3.mp4 is added, which is an extract from Blender's Agent 327, the video is trimmed and compressed using Handbrake.
I selected this 10s clip because it's a fast-moving action, allowing the test to check if the model can actually see the movement or not.

Overview Fix #18389 Goals of this PR: Allow input video file via mtmd-cli and via /chat/completions (which automatically enables it on web ui) Invoke ffmpeg via a subprocessor (NOT pre-bundled, user need to install it manually) --> this is to avoid tricky legal problems with linking against proprietary video codecs, see: https://www.ffmpeg.org/legal.html Only take into account image input for now, but audio input is easy to implemented in the future Being model-agnostic --> prompt format is not specific to any models, but that can be improved in the future if needed NON-goals (please do not ask about these, I already explained): Using custom video decoder as suggested in mtmd: plan to add video input support #18389 (comment) --> out of scope; this implementation is already at mtmd-helper level, it is trivial for downstream code to link against libmtmd then provide a custom video handler Edit: we could also allow "probing" multiple programs to see if there is an alternative to ffmpeg installed in the system, but still, that's out of scope for the current PR No audio for now --> planned for future iteration Avoid storing the whole video frames in memory before decoding --> need yet another refactoring, planned for future 3D conv frame "merging" (qwen-vl-based models) --> already supported via mtmd: support "frame merge" for qwen-vl-based models #21858 TODO in future PRs: Add --video-ffmpeg-path and --video-fps arguments --> already have a branch locally, will push after this PR is merged Optimize memory usage --> need to study more on what's the best way to do Design choices This impl splits into 2 main parts: mtmd_bitmap_init_lazy mtmd_helper_video_context Upon receiving a new video file: mtmd_helper_bitmap_init_from_file is called and it tries to decode the file as audio/image/video video detected, mtmd_helper_video_context is created mtmd_bitmap_init_lazy create a new "lazy" bitmap, the callback gets a new bitmap/text each time it's called upon mtmd_tokenize() call, the callback is called which returns the list of bitmap and text (timestamp) in correct order Note about mtmd_bitmap_init_lazy The mtmd_bitmap_init_lazy is not an addition, but it's important to allow downstream code (server/cli) to have the least changes possible, while still be able to support video input. For input prompt, that an audio or an image requires a marker ( ) to identify its placement inside the prompt. However, the same logic is different for video: a video can be "expanded" to multiple markers (multiple images, multiple audio chunks) and text prompts (timestamps), so we need to know the number of markers beforehand - this is possible, but very complicated if done purely on mtmd-helper level. The logic of mtmd_bitmap_init_lazy is simple: An input media identify its placement in the prompt via a single marker (usually ) A callback is provided, and will be called repeatedly during tokenize call. This way, we can "expand" one single input bitmap to multiple media chunks On server and CLI, since each marker == a file, this make the code trivial to implement, almost no changes are required. Testing A short clip tools/mtmd/test-3.mp4 is added, which is an extract from Blender's Agent 327, the video is trimmed and compressed using Handbrake. I selected this 10s clip because it's a fast-moving action, allowing the test to check if the model can actually see the movement or not. On CLI (tested with Qwen3-vL-2B) On webui (tested with gemma-4-E4B) Requirements I have read and agree with the contributing guidelines AI usage disclosure: most of the ffmpeg invocation code is written by AI, the rest is hand-written

Integrity note · Xela does not rewrite or paraphrase article content. The excerpt above is the source publication's own words, sanitized for display. For the full piece — including any quotes, charts, or images — read it at r/LocalLLaMA. Xela's rewritten version is off for this story, so there's no editorial angle attached — you're getting the source's reporting unfiltered. When the rewrite is on, we add a What this means block underneath with the operator/trader takeaway.

What people are saying

Discussion

Hot takes

0/280

Loading takes…

Comments

Discussion · 0

Loading comments…

mtmd: add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

What people are saying

Hot takes

Comments

Discussion · 0

"Chat is dead": OpenAI preps overhaul of ChatGPT

Gemma 4 Chat Template now has preserve thinking

I built a semantic arXiv search engine with AI-generated TL;DRs, claim classification, and paper comparison

Jeff Bezos Is Funding a Wild Hunt for the Brain’s ‘Core Algorithm’

Track ai & tech every morning.