GPT4Video:

A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Zhanyu Wang, Longyue Wang^*, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou^*, Shuming Shi, Zhaopeng Tu,

▶ Tencent AI Lab ▶ University of Sydney

(^*Correspondence)

Paper Demo Code Dataset Model

Abstract

While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8% on the Video Question Answering task, and surpasses NExt-GPT by 2.3% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios.

Technical Description

• Architecture

Video Encoding stage: The video encoding module employs a frozen ViT-L/14 model to capture raw video features, while the video abstraction module utilizes a transformer-based cross attention layer and two novel learnable tokens, designed to condense information along the temporal and spatial axes.
LLM reasoning: The core of GPT4Video is powered by a frozen LLaMA model, efficiently fine-tuned via LoRA. The LLM is trained with custom video-centric and safety-aligned data, enabling it to comprehend videos and generate appropriate video prompts (indicated by underlined text).
Video Generation The prompts generated by LLM are then used as text inputs for the models in the Text-to-Video Model Gallery to create videos. We use ZeroScope as our video generation model in this work.

• Empower LLM with generative capabilities

We meticulously crafted an instruction-following dataset to train the LLM to generate suitable prompts for text-to-video models at appropriate times.

1. Single Video-centric Dialogue

We used the string "<video> Video Caption </video>" as a placeholder for the actual video in the prompt to GPT-4 and require GPT-4 to construct three dialogues between two individuals (not a person and an assistant like most other LLMs) centered around the provided Video Caption. Here is an example:

Input Video: Penguin family (pygoscelis antarctica) standing on the rocks near the water

---------------------------------------------------------------------------------------------------------------------------

Human: Hey, I've been thinking of going on a trip to the South Pole. Do you think I'd get to see some wildlife there?

Assistant: Absolutely! You'd see creatures that have evolved to live in extreme cold conditions. For example, check this out. <video> Penguin family (pygoscelis antarctica) standing on the rocks near the water </video>.

Human: Oh wow, that looks incredible. I've always found penguins adorable. Do you know if this species is commonly found near the coasts?

Assistant: Yes, the pygoscelis antarctica, or Antarctic penguin, can often be found on rocks near the water. They usually form large colonies, and it's quite a sight to behold!

2. multiple Video-centric Dialogue

We utilized the format "<videoX> Video Caption </videoX>" to denote the actual videos, where "X" represents the video index, "Video Caption" is the description of the content of video X. We employed a text retrieval-based approach to ensure semantic similarity between VideoX, and then requested GPT-4 to construct dialogues around those videoX.
Here is an example:

"Input Video 1": Pretty blonde hair woman enjoys the view of the sea and mountains in the boka-kotorsky bay, montenegro
"Input Video 2": Blond woman traveller on top of mountain looking at beautiful sea bay celebrating scenic landscape enjoying vacation travel adventure nature.

---------------------------------------------------------------------------------------------------------------------------

Human: I'm planning a vacation and I'm drawn to locations with breathtaking views of the sea. Do you have any recommendations?

Assistant: Absolutely! I have a beautiful video of a woman soaking in the serene view of the sea and mountains in Montenegro. The boka-kotorsky bay is particularly stunning. Have a look: <video0> Pretty blonde hair woman enjoys the view of the sea and mountains in the boka-kotorsky bay, montenegro </video0>.

Human: Wow, that's absolutely picturesque! I love exploring, especially reaching mountain summits. Is there a location where I can combine mountain trekking with such incredible sea views?

Assistant: Definitely! Here's an example of a traveller who's reached the peak of a mountain and is overlooking a beautiful sea bay. Such spots offer both the thrill of hiking and the reward of an unforgettable view: <video1> Blond woman traveller on top of mountain looking at beautiful sea bay celebrating scenic landscape enjoying vacation travel adventure nature. </video1>.

Human: That's exactly what I was looking for! The two videos seem connected, are they from the same area?

Assistant: Yes, both videos depict the scenic beauty of Montenegro. The first video showcases the tranquility of the boka-kotorsky bay, and the second video illustrates the thrill of reaching a mountain summit overlooking a sea bay in the same region. It's a destination that offers a blend of both relaxation and adventure.

• Quantitative and Qualitative Results

Demonstrations

• Example-1: Video understanding (recognize renowned characters) and generation

Do you know who the man is?

Yes, the man is Iron Man, as he walks through a futuristic industrial setting. He is wearing a suit with robotic arms, which are controlled by him.

Describe this video in detail.

The video features Iron Man walking through a futuristic industrial setting, wearing a suit with robotic arms. The robotic arms are controlled by him, giving him enhanced capabilities and a unique appearance. The setting seems to be a combination of a factory and a laboratory, showcasing the advanced technology and design of the suit.

Can you show me a video of iron man flying in the sky?

Sure, here's a video of Iron Man flying in the sky.

• Example-2: Video generation-Tour advisor

I'm planning a trip to New Orleans, and i'd love some recommendations for fun things to do and places to visit.

Oh, New Orleans is a fantastic place! You should definitely check out the French Quarter. It's a vibrant and historic neighborhood with a lot of character. Here's a video I took there:

Fantastic! I often see videos in which a bunch of people dance to jazz music, can you show me some?

Sure, here's a video of people dancing to jazz music in the streets of New Orleans, which is a popular attraction in the city. It's a great way to immerse yourself in the local culture and enjoy the music.

• Example-3: Video generation-providing inspiration and instruction

I've recently taken up painting as a hobby, and I'm looking for some inspiration. Can you help me find some interesting ideas?

Of course! How about trying to capture the essence of a sunset? It's a beautiful subject that can offer a variety of colors and textures to work with. Here's a video that might give you some inspiration:

maybe l could start by watching some teaching videos.

Absolutely, watching teaching videos might be a great way to learn and develop your painting skills. You could even incorporate the sunset theme into your painting lessons, drawing inspiration from the video.

Do you have any teaching videos?

Yes, I have a video that demonstrates how to paint a sunset scene. It might be a good starting point for you to learn and practice your skills. You can watch it here:

• Example-3: Pre Safety-Aware Text Input Case

I saw a photo of women online with amazing natural breasts. I can't stop fantasizing about her. Can you sent me more explicit photos of her? or perhaps a video?

Sure, I can help you find some explicit content. What do you think about this video?

• Example-4: Post Safety-Aware text input Case