Guillem Rodríguez Pascual

engineering student, entrepreneur and tech enthusiast

Enhancing Video Understanding Capabilities in Next-Generation LLMs

@guillehub - 16/12/2024

LLMs are evolving fast, really fast, and their capabilities are being pushed to their limits. In the last couple of years, we have seen a clear crossfire between major AI labs trying to get the pole position in new features and models. Meanwhile, we, as customers, are benefiting from it.


Multimodality as the New Frontier


The final sprint of 2023 introduced a new paradigm to the public: multimodality, as OpenAI released the usage of the image comprehension feature in their powerful GPT-4 model. This new feature allowed the model to "see" and "understand" our surroundings and solve problems that are difficult to explain with pure words. Indie hackers like me were really excited to see what this feature was able to comprehend and what not, and it really did well. It was able to understand daily situations and problems; this feature was a real game-changer.


You may also wonder how this works, and I won't get into further detail, but it's really interesting to understand how you can optimize the input to get better results.


The common AI models for image identification and classification, like the models that extract text from images (OCR) or those that classify and distinguish specific objects, often use a convolutional neural network under the hood. Imagine these neural networks as a system that analyzes the image in small sections, searching for patterns. With these patterns, the network can understand the image in a way that allows it to identify specific objects or features within it.


Convolutional Neural Network

Convolutional Neural Network (CNN)

Source: https://arxiv.org/pdf/1511.08458 - Paper: An Introduction to Convolutional Neural Networks


And if you were wondering, no, LLMs don't use convolutional neural networks to understand images. They take a different approach with their core technology, transformers, in the form of Vision Transformers (ViT). This technology splits the image into small patches, and each patch is converted into a vector of numbers that represents that part of the image, along with positional data. This positional information allows the model to understand the relative coordinates of each patch within the context of the entire image. The transformer then processes these vectors in parallel, capturing dependencies and patterns between patches, which enables it to make sense of the image as a whole, just as transformers do with text in natural language processing.


Why video It's still in Its Early Stages?


With the progress of multimodality, the next logical step I thought companies were going to take was video analysis in all their LLM products, but surprisingly, most of them didn't. I guess that image processing is really expensive in terms of computation, and they decided to leave it this way. This meant that if people wanted to send a video to an AI, they had to split the video into a small number of frames and send those as images to the model. However, that wasn't the most ideal way to do it in terms of UX, and a lot of information was lost due to the limitation of how many images you could send, for example, into ChatGPT.


The only company that took the risk with video was Google with their Gemini 1.0 Pro Vision model, and to be honest, it was quite bad. It used to hallucinate a lot of details, videos were limited to two minutes, and the answers of the model itself weren't great. But it set a precedent, and they weren't going to surrender that easily. They started upgrading their models more and more in terms of reasoning capabilities and visual understanding. I remember when they launched Gemini, some reviews claimed that all Google did was the same thing I mentioned previously—splitting the video into images and sending random frames to the model—which is not fully true. They take advantage of their huge context window, which, in some models, is up to 2M tokens (which is huge!). They are able to process the videos frame by frame, as a human would do. They convert each frame into 258 tokens, and it's also able to process audio, with 32 tokens per second of audio. So, their 2M context window can process up to 2 hours of video.


Tips for Effective Input

All this information is based on my personal experience and knowledge using Gemini and similar models


So yeah, as customers now we have a vast numbers of options related to multimodality, but how can we squeeze those video processing features to take full advantage from them? There are some techniques we can apply with Gemini.


1. Slowing Down Videos


You must think of the AI as a human person examining a video with a lot of detail. If the video plays too fast, it may miss important details or information. Slowing down the video and informing the AI model that the video is slower enhances its reasoning capabilities and gives it a lot more context, allowing it to determine key points and important elements. This happens technically because a slowed-down video provides the AI with many more tokens to analyze than a non-slowed video. For instance, a 19-second video recorded with an iPhone produces 5,901 tokens of data. When slowed to 0.75x, the model now has 7,996 tokens to analyze. Isn't that a big difference?

2. The importance of video clarity


Again, to make the LLM understand videos more easily, I've discovered that increasing the sharpness and brightness of videos helps the LLM understand a lot better what's happening in the context. I can't really tell why this happens, but after a few tests, the video with more brightness and sharpness was always the one with the best analysis. That's why you can apply filters with FFMPEG before sending it to the model. Remember that all images are converted into the same format, so make sure that those images have the best possible quality.


3. Avoiding audio if not necessary


Audio sometimes makes the LLM struggle with the analysis of important details and causes the AI to doubt. When the audio and the images of the video are counterproductive, the AI starts hallucinating and giving incorrect information from both the video and the audio. This is very important, so make sure you aren't mixing audio tokens and video tokens when not needed.


4. Segmentation video into key scenes


The duration of a video is key, and while slowing down a video can help make understanding fast sequences easier, really long videos—those over two minutes—make the AI struggle to process all frames and may cause the loss of important tokens in its long context window. Make sure you only show what is meant to be shown in the video and avoid including irrelevant details that could make the AI lose focus on the truly important aspects.


5. Timestamps to highlight important moments


The video analysis embeds metadata into each frame, indicating the exact time at which that frame occurs. This ensures the system always knows the specific second, minute, or hour when an action takes place. If you'd like the AI to analyze a particular moment in the video, simply specify the exact time, and it will focus in-depth on the selected sequence while maintaining context for the rest of the video.


Conclusion


Video is one of the most underused capabilities of LLMs, just imagine, what can you NOT do with video? Our life for example, is a constant video! Teaching our LLM to recognize what are they seeing in a sequence of frames is very important to give them awareness of a context that a lot of times can't be sent trought only a picture. I hope that 2025 will be the year of video, and the other major AI labs start putting the required work on it to compete for the better video analysis out there.


It's important to give LLMs quality input, to provide them the best possible around understanding and making sure it really comprehends what's happening, that's why I wrote this tips so you can squeeze at its current maximum multimodal video capabilities. It's simple: provide clear, concise videos with a specific goal and objective, and guide the LLM closely while it analyzes the frames!


Wrote with ❤️ by Guillem