Oryx MLLM: On-Demand Spatial-Temporal
Understanding at Arbitrary Resolution

Zuyan Liu1,2,  Yuhao Dong2,311footnotemark: 1,  Ziwei Liu3,  Winston Hu2,  Jiwen Lu1,  Yongming Rao2,1
1 Tsinghua University  2 Tencent  3 S-Lab, NTU

https://oryx-mllm.github.io
Authors contributed equally to this research.  Corresponding authors.
Abstract

Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at https://github.com/Oryx-mllm/Oryx.

1 Introduction

Multi-Modal Large Language Models (MLLMs) have made significant strides in processing and integrating visual and linguistic inputs to generate coherent and contextually relevant responses. Proprietary models such as (OpenAI, 2023b; 2024; GeminiTeam, 2024) exemplify the cutting-edge capabilities of MLLMs. Concurrently, the open-source community is actively advancing MLLMs by enhancing their ability to understand diverse visual content (Tong et al., 2024; Liu et al., 2024g; Yang et al., 2023a), including images (Li et al., 2024a; Chen et al., 2024b), videos (Lin et al., 2023a; Cheng et al., 2024; Qian et al., 2024), and 3D data (Hong et al., 2023), etc. As MLLMs become stronger, there is a growing need for more general and unified MLLMs that are capable of processing visual content in more diverse forms and accomplishing more challenging multimodal understanding problems.

One core challenge in the path to achieving more general MLLMs is to develop better visual representations for diverse visual data. Visual data exhibit significant complexity and diversity, characterized by variations in collection sources, targeted visual tasks, specific contents, and resolution qualities. Existing approaches often simply treat all kinds of visual inputs uniformly, overlooking the variations in visual content and the specific demands of different applications. For example, early MLLMs (Alayrac et al., 2022; Li et al., 2023; Bai et al., 2023) attempt to standardize these diverse visual inputs by converting them into a fixed resolution so that pre-trained CLIP encoders can be used to extract high-quality visual representations that are well aligned with language contents. Recent advancements in MLLMs (Liu et al., 2024c; Xu et al., 2024b; Yao et al., 2024) extend the idea by introducing dynamic partitioning (Liu et al., 2024c) as a means to produce high-resolution visual representations while utilizing the strong CLIP models for encoding. However, the solution remains a compromise due to the lack of high-quality multi-modal encoders that support native resolution inputs. Supporting native resolution in an on-demand manner for visual inputs emerges as a more generalized and effective solution for visual understanding in MLLMs, offering several advantages: it prevents information loss by utilizing the entire image as input, thereby resolving extreme corner cases, and it enhances efficiency and naturalness, resulting in better overall performance. As illustrated in Figure 1, optimizing for resolution and compression can lead to greater efficiency and meet practical needs: high resolution is crucial for text-relevant tasks, while object-level tasks may require only simple images, some applications may need to summarize extremely long videos while others maintain high precision for each frame.

In this paper, we explore on-demand MLLMs for comprehensive spatial-temporal understanding by introducing evolved architectural designs and propose the new Oryx model, which aims to address these challenges and enhance the functionality of MLLMs. Oryx is a unified spatial-temporal understanding MLLM framework that adeptly handles arbitrary visual resolutions, varying temporal lengths, and a diverse range of tasks in an on-demand manner. Oryx is characterized by the following key contribution: 1) A pre-trained visual encoder OryxViT is developed to generate LLM-friendly visual representations at native resolutions. Equipped with adaptive positional embeddings and variable-length self-attention, OryxViT can efficiently process visual data with different sizes in parallel; 2) Dynamic compression technique that adjusts downsampling ratios arbitrarily while fusing the information through a shared projector, thereby supporting a seamless switch between 1x to 16x compression. The new design enables Oryx to easily process extremely long inputs with up to 16x compression while maintaining high recognition precision for inputs that do not require compression; 3) Enhanced data curation and training strategies that help Oryx achieve pioneering performance in multimodal images, videos, and 3D data understanding and easily adapt to arbitrary input resolution and tasks simultaneously.

Refer to caption
Figure 1: Our main idea of on-demand multimodal understanding. Different visual data and tasks may require different input resolutions and compression ratios on visual tokens. Supporting arbitrary resolution in an on-demand manner for visual inputs emerges as a more general and effective solution for visual understanding in MLLMs.

We evaluate the Oryx model on a wide range of multi-modal benchmarks, demonstrating remarkable performance in both spatial and temporal understanding across image, video, and multi-view 3D data. Notably, the Oryx model excels in general and long-form video comprehension, achieving competitive results with a 7B model size and surpassing models up to 72B in size with our 34B variant. This has led to new state-of-the-art results among open-source models on several benchmarks, including NextQA (Xiao et al., 2021), Perception Test (Patraucean et al., 2024), MMBench-Video (Fang et al., 2024), and MVBench (Li et al., 2024c) for general video understanding and MLVU (Zhou et al., 2024), LongVideoBench (Wu et al., 2024) for long-form video benchmark. Additionally, the Oryx model shows strong performance in 2D and 3D spatial understanding, outperforming mainstream image-based MLLMs and 3D-specific LLMs, respectively, benefiting from its unified training strategy.

2 Related Work

Visual Encoding in Multi-Modal LLMs. Multi-modal LLMs depend on visual encoders to extract visual features and employ connectors for aligning visual features with the LLMs. Alayrac et al. (2022) and Li et al. (2023) utilize attention to capture visual features and align the visual encoder with LLMs through learnable queries, which may struggle when not adequately trained. LLaVA (Liu et al., 2024d; b; f) utilizes a simple MLP to connect the visual encoder with LLMs, while Ranzinger et al. (2024) combines visual features from different encoders for enhancement. However, they are limited to fixed resolutions, which may hinder their ability to capture detailed information and restrict their flexibility in understanding images with varying aspect ratios. Recent advancements in high-resolution perception (Liu et al., 2024c; Xu et al., 2024b; Yao et al., 2024) have primarily been driven by dynamic partitioning, which divides an image into multiple patches of equal resolution. While this method can manage high-resolution images, it is inefficient, and the partitioning process may result in the loss of critical information present in the original image. In this paper, we introduce OryxViT, an innovative step in visual encoding that enables native resolution perception, allowing the processing of images at any resolution while preserving their original size.

Multi-modal LLMs Supporting Diverse Contexts and Tasks. Recent advancements in MLLMs have enabled them to comprehend a wide range of complex visual inputs from different tasks with various contexts. As open-source models strive to match the performance of proprietary models (OpenAI, 2024), they are also trying to become more versatile. Lin et al. (2023a); Cheng et al. (2024); Qian et al. (2024) try to combine image and video perception, and Zhang et al. (2024a) focuses on long-form video analysis with extended context lengths. 3D-LLM (Hong et al., 2023) made the first attempt to enable MLLMs to comprehend 3D environments.  Li et al. (2024b); Jiang et al. (2024) investigate interleaved data training to handle multi-image scenarios, and Li et al. (2024a) unifies single-image, multi-image, and video settings through improved data curation and training strategies. While previous approaches relied heavily on enhanced data curation to achieve multi-task comprehension, we propose a novel framework that represents complex visual inputs with cohesive representations. Our model is capable of processing visual contexts of arbitrary sizes, videos of varying lengths, and 3D data seamlessly, supporting various context lengths and versatile tasks.

3 Methods

In this section, we provide a detailed explanation of Oryx’s contribution. Our design is segmented into two primary components: the architecture and the training pipeline, which are elaborated upon in Section 3.1 and 3.2, respectively. We describe our innovative architecture to process native and on-demand visual inputs within MLLMs, as illustrated in Figure 2, enabling the development of a model capable of generalizing across image, video, and 3D data. Furthermore, we outline the simple yet effective training pipeline of the Oryx model.

3.1 Oryx Architecture: MLLM with Native and Flexible Visual Inputs

3.1.1 Visual Representations with Native Resolution

Resizing and regularizing visual inputs, including images and videos, is a necessary and effective preprocessing step. Common practice typically involves resizing and cropping visual inputs to a fixed resolution with a square shape. However, such processes may negatively impact the performance of vision backbones, as previous studies on vision recognition have demonstrated the effectiveness of maintaining visual content in its original form. NaViT (Dehghani et al., 2024) leverages the characteristics of the vanilla ViT (Dosovitskiy, 2020), introducing a pack sequence operation that accommodates images of any aspect ratio and resolution for efficient training. Similarly, FlexiViT (Beyer et al., 2023) and ViTAR (Fan et al., 2024) incorporate randomly resized images during training to develop a Vision Transformer capable of handling inputs of varying resolutions.

Refer to caption
Figure 2: Overview of Oryx architecture. Oryx offers two options to process visual inputs with arbitrary spatial sizes and temporal lengths in an on-demand manner. 1) A pre-trained OryxViT equipped with variable-length self-attention to encode visual features with native aspect ratios and resolution. 2) A dynamic compressor offering on-demand compression on visual tokens while maintaining a unified token form.

Despite these advancements, the effectiveness of native or arbitrary resolution in the realm of MLLM has barely been explored. Most existing MLLMs integrate original image-text visual encoders such as CLIP (Radford et al., 2021) and SigLIP (Zhai et al., 2023) to encode input visual data. We posit that MLLMs provide an optimal environment for processing visual representations at their native resolution for two primary reasons: (1) the sources and tasks associated with visual inputs are diverse, necessitating varying demands and formats; (2) the token lengths in MLLMs are inherently dynamic, particularly in the language component. Consequently, the dynamic representation of visual context aligns seamlessly with subsequent processing stages.

In Vision Transformer (ViT) models (we omit the class token here for simplification), given the visual input {x}H×Wsuperscript𝑥absent𝐻𝑊\{x\}^{\in H\times W}{ italic_x } start_POSTSUPERSCRIPT ∈ italic_H × italic_W end_POSTSUPERSCRIPT, where typically HW𝐻𝑊H\neq Witalic_H ≠ italic_W, the ViT first resizes the visual input into {x}N×Nsuperscript𝑥absent𝑁𝑁\{x\}^{\in N\times N}{ italic_x } start_POSTSUPERSCRIPT ∈ italic_N × italic_N end_POSTSUPERSCRIPT. The resized image is then passed through patch embedding layers, which partition the image into patches of size p×p𝑝𝑝p\times pitalic_p × italic_p, resulting in a sequence of patches {x}(N/p)×(N/p)superscript𝑥absent𝑁𝑝𝑁𝑝\{x\}^{\in(N/p)\times(N/p)}{ italic_x } start_POSTSUPERSCRIPT ∈ ( italic_N / italic_p ) × ( italic_N / italic_p ) end_POSTSUPERSCRIPT. Conventional Vision Transformers utilize a fixed-size position embedding matrix P𝑃Pitalic_P corresponding to the predefined image size N×N𝑁𝑁N\times Nitalic_N × italic_N. However, when processing visual inputs at their native resolution {x}H/p×W/psuperscript𝑥absent𝐻𝑝𝑊𝑝\{x\}^{\in\lfloor H/p\rfloor\times\lfloor W/p\rfloor}{ italic_x } start_POSTSUPERSCRIPT ∈ ⌊ italic_H / italic_p ⌋ × ⌊ italic_W / italic_p ⌋ end_POSTSUPERSCRIPT, directly resizing P𝑃Pitalic_P to H/p×W/p𝐻𝑝𝑊𝑝\lfloor H/p\rfloor\times\lfloor W/p\rfloor⌊ italic_H / italic_p ⌋ × ⌊ italic_W / italic_p ⌋ can lead to a significant drop in accuracy, as demonstrated in previous works (Dehghani et al., 2024; Beyer et al., 2023).

To address the issue of native resolution processing, we introduce a visual encoder named OryxViT, which builds upon the advanced SigLIP (Zhai et al., 2023) models and is based on the Vision Transformer (Dosovitskiy, 2020) architecture. We modify the vision encoder by incorporating a sufficiently large position embedding matrix P𝑃Pitalic_P that accommodates the maximum target input sizes (2048×\times×2028 in our models, which can also be further interpolated for even larger inputs). For each visual input, we rescale the original position embeddings into PH/p×W/psuperscript𝑃absent𝐻𝑝𝑊𝑝P^{\in\lfloor H/p\rfloor\times\lfloor W/p\rfloor}italic_P start_POSTSUPERSCRIPT ∈ ⌊ italic_H / italic_p ⌋ × ⌊ italic_W / italic_p ⌋ end_POSTSUPERSCRIPT using bilinear interpolation and apply the transformation x=x+P𝑥𝑥𝑃x=x+Pitalic_x = italic_x + italic_P. The adaptation strategy for the newly defined P𝑃Pitalic_P under native input resolution follows the training format of common MLLMs. We employ a relatively lightweight LLM as the language interface, keeping the vision encoder’s parameters unfrozen while freezing most of the other parameters. We collect training data pairs from multiple vision-language tasks including captioning, OCR, visual question answering, etc.

A significant challenge is managing the dynamic sequence length N=H/p×W/p𝑁𝐻𝑝𝑊𝑝N=\lfloor H/p\rfloor\times\lfloor W/p\rflooritalic_N = ⌊ italic_H / italic_p ⌋ × ⌊ italic_W / italic_p ⌋ for the Vision Transformer during batch processing. For visual patches with lengths N1,N2,,Nbsubscript𝑁1subscript𝑁2subscript𝑁𝑏N_{1},N_{2},\ldots,N_{b}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT in a batch of size b𝑏bitalic_b, we concatenate the patches across the sequence dimensions into a shape of [1,i=1bNi,C]1superscriptsubscript𝑖1𝑏subscript𝑁𝑖𝐶[1,\sum_{i=1}^{b}N_{i},C][ 1 , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C ] before feeding them into the transformer blocks. We utilize the variable-length attention operator provided in flash attention (Dao et al., 2022) to compute the attention for each visual input within the batch independently. With these designs, our OryxViT can efficiently process visual signals of varying aspect ratios in batch mode, maintaining a forward speed comparable to that of conventional fixed-resolution visual encoders.

3.1.2 On-Demand Dynamic Compression Supporting Long Visual Context

With visual inputs varying in temporal length and resolution, such as some video data lasting tens of minutes, treating all inputs equally, as in most previous works (Zhang et al., 2024a; Xue et al., 2024), leads to inefficient computational costs. To address this, we propose a Dynamic Compressor, which is capable of performing higher compression ratios for longer contexts. Our design unifies visual contexts with different compression ratios into a consistent pattern, allowing us to control the overall visual sequence length on demand.

Using the visual representation feature map f𝑓fitalic_f, the compression serves as the bridge between vision and language modalities. We implement downsample layers with varying ratios to accommodate different input lengths. Specifically, we categorize the visual context into pure images, short videos, and long videos, applying downsample layers d1,d2,d3subscript𝑑1subscript𝑑2subscript𝑑3d_{1},d_{2},d_{3}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT respectively. In our implementation, we set d3=4d2=16d1subscript𝑑34subscript𝑑216subscript𝑑1d_{3}=4d_{2}=16d_{1}italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 4 italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16 italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, therefore the token length of long videos is reduced to 116116\frac{1}{16}divide start_ARG 1 end_ARG start_ARG 16 end_ARG of that of images with the same resolution.

We obtain the low-resolution feature map fL=di(fH),i=1,2,3formulae-sequencesubscript𝑓𝐿subscript𝑑𝑖subscript𝑓𝐻𝑖123f_{L}=d_{i}(f_{H}),i=1,2,3italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) , italic_i = 1 , 2 , 3 from the high-resolution feature map fHsubscript𝑓𝐻f_{H}italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT. To mitigate the effects of downsampling, we employ an attention operation to facilitate interaction between fLsubscript𝑓𝐿f_{L}italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and fHsubscript𝑓𝐻f_{H}italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT. Specifically, for a downsample ratio r𝑟ritalic_r, we treat fLN×Csubscript𝑓𝐿superscript𝑁𝐶f_{L}\in\mathbb{R}^{N\times C}italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT as the query tensor 𝐐𝐐\mathbf{Q}bold_Q and fHN×r2×Csubscript𝑓𝐻superscript𝑁superscript𝑟2𝐶f_{H}\in\mathbb{R}^{N\times r^{2}\times C}italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_C end_POSTSUPERSCRIPT as the key tensor 𝐊𝐊\mathbf{K}bold_K and value tensor 𝐕𝐕\mathbf{V}bold_V. Each patch in the low-resolution fLsubscript𝑓𝐿f_{L}italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT interacts with r2superscript𝑟2r^{2}italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT neighboring patches in the high-resolution fHsubscript𝑓𝐻f_{H}italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT through a cross-attention operation, formulated as follows:

fL=fL+Softmax(ϕq(Q)ϕk(KT)dk)Vsubscript𝑓𝐿subscript𝑓𝐿Softmaxsubscriptitalic-ϕ𝑞Qsubscriptitalic-ϕ𝑘superscriptK𝑇subscript𝑑𝑘Vf_{L}=f_{L}+\text{Softmax}(\frac{\phi_{q}(\textbf{Q})\phi_{k}(\textbf{K}^{T})}% {\sqrt{d_{k}}})\textbf{V}italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + Softmax ( divide start_ARG italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( Q ) italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) V (1)

where we define the query and key projection layers, denoted as ϕqsubscriptitalic-ϕ𝑞\phi_{q}italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, to project the query and key tensors into lower dimensions. To maintain the original features from the visual encoder and limit the number of linear projection layers, we omit the value and output projection layers commonly used in attention modules. Then we utilize a 2-layer MLP to project the compressed low-resolution features into the embedding space of the language model. Upon completion of the dynamic compression module, the final visual representation features are flattened and integrated into the sequence of visual tokens among the text tokens. This combined sequence is then fed into the language model for token prediction.

3.1.3 One Model for All: Image, Video, and 3D Understanding

Previous work (Li et al., 2024a; Chen et al., 2024b; QwenTeam, 2024b) has demonstrated the coexistence of MLLMs that support both image and video modalities. Building on this foundation, our research aims to extend the capabilities of these models to handle more diverse contexts, varying lengths of content, and a broader range of tasks. To achieve this, we meticulously curate a training dataset specifically designed for extremely long-form videos. Additionally, we further incorporate spatial-relevant knowledge through coarse correspondence markers among multi-frame visual inputs to make Oryx 3D-aware.

Long-Form Temporal Training with Needle-In-A-Haystack. The key ability for processing long-form video inputs is the identification of specific information within an extensive context, akin to the ”needle-in-a-haystack” task in the NLP field. To enhance the Oryx model’s capability to pinpoint details, we prepare long-form temporal needle-in-a-haystack training data. Specifically, we source video samples from the MovieNet (Huang et al., 2020) dataset, which comprises an average of 1000 frames per movie and an average duration of 45 minutes, thereby providing a natural setting for retrieving designated targets. We devise two tasks to train the model: captioning and differing. The captioning task requires the model to generate captions for frames at specific indices, while the differing task involves identifying differences between two frames given their indices. The training corpus is generated using GPT-4o, which produce captions for single frames or frame pairs. These captioned frames are then reinserted into the overall movie sequences, ensuring the training data maintains contextual integrity.

Learning Spatial-Aware Knowledge via Coarse Correspondences. Recent advancements have focused on enhancing multi-modal LLMs with 3D understanding capabilities. These approaches primarily treat 3D tasks as multi-image inputs. However, unlike video inputs, multi-view images generated from 3D environments lack temporal or trajectory cues, which are essential for MLLMs to accurately process sequential data. As a result, previous methods often struggle to achieve correct spatial understanding when evaluated against 3D benchmarks.

Building on the work of (Liu et al., 2024a), we introduce coarse correspondences into our training dataset. The core concept is to assign a consistent label to the same object across different frames, allowing the model to better capture spatial correlations across multiple views. This approach aims to enhance the model’s ability to develop a more accurate 3D spatial understanding of the scene. Specifically, we utilize Track-Anything (Yang et al., 2023b) as our tracking model to generate coarse correspondences for the ScanQA training set. These data are then incorporated into the final training set.

3.2 Training Pipeline & Data Mixture

The training pipeline of Oryx is lightweight and direct in a 2-stage strategy. We start from a well-trained vision tower OryxViT and a Large Language Model. The first stage involves only image data following common practice (Liu et al., 2024d; b). The second stage uses a mixture of data from images, videos, and corresponding 3D frames and we train the multi-source data jointly thanks to our unified design. All of our training data are collected from open-source datasets, therefore ensuring the reproducibility of the Oryx model and holding room for improvement with better data curation.

Stage 1: Text-Image Pre-training and Supervised Fine-tuning. In the first stage of our training process, we focus on developing the foundational vision-language capabilities of the Oryx model using image data. This stage begins with a pre-training phase to train the dynamic compressor component with a dataset of 558k images in LLaVA-1.5 (Liu et al., 2024b). Following this, we gather a collection of 4 million supervised fine-tuning image-text pairs that focus on high-quality knowledge learning. This data is sourced from various open-source academic datasets, including LLaVA-NeXt (Liu et al., 2024c), Cauldron (Laurençon et al., 2024), and Cambrian-1 (Tong et al., 2024). It is important to note that we do not incorporate large-scale pre-training stages as described in (Li et al., 2024a) or employ exclusive supervised fine-tuning data such as those in (Lin et al., 2023b; Bai et al., 2023), as our primary objective is to validate the effectiveness of our unified Oryx architecture.

Stage 2: Joint Supervised Fine-tuning. In Stage 2, we further conduct a supervised fine-tuning procedure following the initial stage, aiming to jointly train the Oryx model with image, video, and 3D-aware visual inputs. The image training data is sampled from the dataset collected during the supervised fine-tuning phase of Stage 1, ensuring a balanced ratio of image and video data by utilizing around 600k image-text pairs. For video data, we source both comprehensive and multiple-choice datasets from open-source video repositories. Comprehensive datasets, which include question-answering and captioning tasks, are integrated using VideoChatGPT-Plus (Maaz et al., 2024), ShareGPT4Video (Chen et al., 2024a) and LLaVA-Hound (Zhang et al., 2024b). To enhance performance on multiple-choice benchmarks, we further incorporated Cinepile (Rawal et al., 2024), NextQA (Xiao et al., 2021) and PerceptionTest (Patraucean et al., 2024) into our training dataset. Additionally, we include video samples of needle-in-a-haystack data generated by GPT-4o (OpenAI, 2024) for long-form video learning and spatial-aware 3D multi-frame samples from the ScanQA (Azuma et al., 2022) training dataset, culminating in a total of around 650k video samples. The supervised fine-tuning strategy in this stage mirrored that of Stage 1, ensuring consistency in the training approach.

4 Experiments

We conduct comprehensive experiments across multiple vision-language benchmarks to demonstrate the effectiveness of our method. In this section, we present the main results on general video understanding benchmarks (Sec. 4.2), long-form video benchmarks (Sec. 4.3), 2D & 3D spatial understanding benchmarks (Sec. 4.4) and compare our method with other state-of-the-art video MLLMs. Finally, we provide analysis experiments and critical ablation studies on design elements.

4.1 Implementation Details

Our implementation integrates the Oryx model with two sets of LLMs, Qwen-2-7B (QwenTeam, 2024a), and Yi-1.5-34B (Young et al., 2024), to demonstrate generalizability across different model sizes. For the visual encoder, we use our pre-trained OryxViT to support arbitrary-resolution visual inputs. During the pre-training stage, we utilize 558k captioning data from LLaVA-1.5 (Liu et al., 2024b), unfreezing the parameters of the dynamic compression module. The image SFT stage involves curating an open-source dataset of around 4M images. In the joint training stage, we incorporate approximately 1.2M data consisting of images sampled from the previous stage and video/3D data implemented in Sec. 3.2. For video data, we restrict the frame number to 64 for standard videos of low compression ratio and 256 for long videos of high compression ratio. We use the 2×2222\times 22 × 2 average downsample for low compression and 4×4444\times 44 × 4 average downsample for high compression. Image data are maintained at their native resolution, with a maximum size of 1536 pixels, while video data resolutions are confined to a range of 288 to 480 pixels. The rest of the training details are provided in the appendix.

4.2 General Temporal Understanding

Table 1: General Temporal Understanding. We conduct experiments on four multiple-choice benchmarks and three generation benchmarks comprehensively and report the main score for each dataset. Oryx exhibits superior performance under a wide range of open-sourced video MLLMs.

Model Size VideoMME NextQA MVBench PercepTest MMB-Video VCG VDC Proprietary Models GPT-4V (OpenAI, 2023b) - 59.9/63.3 - 43.7 - 1.53 4.06 4.00 GPT-4o (OpenAI, 2024) - 71.9/77.2 - - - 1.63 - - Gemini-1.5-Pro (GeminiTeam, 2024) - 75.0/81.3 - - - 1.30 - - Open-Sourced Video MLLMs VideoChat2-HD (Li et al., 2024c) 7B 45.3/55.7 79.5 62.3 47.3 1.18 3.10 - VideoLLaMA2 (Cheng et al., 2024) 7B 47.9/50.3 - 54.6 51.4 - 3.13 - LLaVA-OneVision (Li et al., 2024a) 7B 58.2/61.5 79.4 56.7 49.7 - 3.51 3.75 Kangaroo (Liu et al., 2024e) 8B 56.0/57.6 - 61.1 - 1.44 - - VideoCCAM (Fei et al., 2024) 9B 53.9/56.1 - 64.6 - - - - LLaVA-Next-Video (Zhang et al., 2024c) 34B 52.0/54.9 70.2 - 51.6 - 3.34 3.48 PLLaVA (Xu et al., 2024a) 34B - - 58.1 - - 3.48 - VILA-1.5 (Lin et al., 2023b) 40B 60.1/61.1 67.9 - 54.0 - 3.36 3.37 VideoLLaMA2 (Cheng et al., 2024) 72B 61.4/63.1 - 62.0 57.5 - 3.16 - LLaVA-OneVision (Li et al., 2024a) 72B 66.2/69.5 80.2 59.4 66.9 - 3.62 3.60 Oryx 7B 58.3/62.6 81.9 63.9 68.6 1.47 3.53 3.76 Oryx 34B 63.2/67.4 83.5 64.7 71.4 1.49 3.51 3.66

Setup. We present the experimental results on general multi-modal video understanding datasets, as video data provides comprehensive insights into visual-language abilities, especially when dealing with complex and diverse visual inputs. We select several representative and popular benchmarks, encompassing both multiple-choice and generation tasks for evaluation. We conduct evaluations on four multiple-choice benchmarks. VideoMME (Fu et al., 2024) performs a full spectrum of diverse videos and varying temporal lengths. NextQA (Xiao et al., 2021) is a classic benchmark for video reasoning. MVBench (Li et al., 2024c) performs 20 challenging video tasks for video comprehension. Perception Test (Patraucean et al., 2024) focuses on the perception and reasoning skills of MLLMs. For generation-relevant benchmarks scored by advanced proprietary models, we integrate evaluations on MMBench-Video (Fang et al., 2024), Video-ChatGPT(VCG) (Maaz et al., 2023), and Video Detailed Caption(VDC) benchmarks. Following common practice, GPT-4-1106 (OpenAI, 2023c) is used as the evaluator for MMBench-Video (Fang et al., 2024), GPT-3.5-0613 (OpenAI, 2023a) is employed for Video-ChatGPT (Maaz et al., 2023) and Video Detailed Caption.

Results. The experimental results, as detailed in Table 1, demonstrate that the Oryx model achieves highly competitive outcomes in general video understanding tasks. We surpass a broad spectrum of near-term video-specific MLLMs and establish new state-of-the-art. The Oryx model attains tier-1 performance among small-sized MLLMs (approximately 7B parameters) and exhibits competitive performance when compared to larger MLLMs (exceeding 30B parameters), even rivaling models with 72B parameters. On the VideoMME benchmark (Fu et al., 2024) with subtitles, the Oryx models achieve mean accuracies of 62.6 and 67.4. Oryx also demonstrates robust performance across various multiple-choice datasets by surpassing previous state-of-the-art results by 3.3% and 4.5% on NextQA (Xiao et al., 2021) and Perception Test (Patraucean et al., 2024). Additionally, the Oryx model performs convincingly on GPT-eval benchmarks, with an average score of 1.49 on MMBench-Video (Fang et al., 2024), 3.53 and 3.76 on VideoChatGPT (Maaz et al., 2023) and Video Detailed Caption, respectively. Remarkably, the Oryx model outperforms advanced proprietary models such as GPT-4V (OpenAI, 2023b) and Gemini-1.5-Pro (GeminiTeam, 2024) on several of the most challenging benchmarks.

4.3 Long-Form Temporal Understanding

To further demonstrate the exceptional long-context understanding capability of our method, we conduct experiments on benchmarks specifically designed for long video evaluation. Additionally, we employ the video needle-in-a-haystack task to illustrate our model’s ability to handle extremely lengthy video content.

Table 2: Long-Form Temporal Understanding. We show results on three mainstream long-form temporal understanding datasets, each featuring video inputs of tens of minutes in duration. Oryx demonstrates superior performance, achieving state-of-the-art results and surpassing several proprietary models across various benchmarks.

Model Size MLVU LongVideoBench VideoMME-Long w/o subs w subs Proprietary Models GPT-4V (OpenAI, 2023b) - 49.2 60.7 53.5 56.9 GPT-4o (OpenAI, 2024) - 64.6 66.7 65.3 72.1 Gemini-1.5-Pro (GeminiTeam, 2024) - - 64.4 67.4 77.4 Open-Sourced Video MLLMs VideoLLaMA2 (Cheng et al., 2024) 7B 48.5 - 42.1 43.8 LongVA (Zhang et al., 2024a) 7B 56.3 - 46.2 47.6 LLaVA-OneVision (Li et al., 2024a) 7B 64.7 - - - Kangaroo (Liu et al., 2024e) 8B 61.0 54.8 46.6 49.3 LongVILA (Xue et al., 2024) 8B - - 39.7 - VideoCCAM (Fei et al., 2024) 14B 63.1 - 46.7 49.9 LLaVA-Next-Video (Zhang et al., 2024c) 34B - 50.5 - - PLLaVA (Xu et al., 2024a) 34B - 53.2 - - VILA-1.5 (Lin et al., 2023b) 40B 56.7 - 53.8 55.7 LLaVA-OneVision (Li et al., 2024a) 72B 66.4 61.3 60.0 62.4 Oryx 7B 67.5 55.3 50.3 55.8 Oryx 34B 70.8 62.2 53.9 58.0

4.3.1 Long-Form Video Benchmarks

Setup. We select three mainstream and representative benchmarks specifically designed for long video understanding, ensuring a comprehensive evaluation of long video inputs. MLVU (Zhou et al., 2024), encompasses videos ranging from 3 minutes to 2 hours and includes 9 distinct tasks that assess both global and local information within the video content. LongVideoBench (Wu et al., 2024) presents a primary challenge of retrieving and reasoning over a dataset comprising 3k long video inputs. Additionally, we utilize the long video subset of the VideoMME (Fu et al., 2024) benchmark, which features videos with lengths ranging from 30 minutes to 60 minutes.

Results. Results are shown in Table 2, which highlights the efficacy of our unified and on-demand design across varying temporal lengths and our further efforts in the context of long video retrieval. The Oryx model exhibits a remarkable capability in understanding long-form video content. Specifically, our Oryx-7B model surpasses all existing 7B model series on long video benchmarks. Furthermore, the Oryx-34B model showcases strong performance across larger MLLMs, achieving a mean accuracy improvement of 4.4% and 0.9% over previous state-of-the-art models equipped with 72B parameter LLMs on the MLVU (Zhou et al., 2024) and LongVideoBench (Wu et al., 2024) benchmarks, respectively. Notably, the Oryx-34B model also outperforms GPT-4o (OpenAI, 2024) on the challenging MLVU (Zhou et al., 2024) benchmark by a margin of 6.2%, underscoring its advanced capabilities in long video understanding.

Refer to caption
Figure 3: Visualization Results on Video Needle-In-A-Haystack Experiments. We compare Oryx-7B (right subfigure) with LLaVA-Next-Video-7B (left subfigure) on the frame retrieval task. The results are shown for inserted depths ranging from 0.0 to 1.0 and the number of frames ranging from 0.1k to 1.6k. The Oryx model demonstrates superior performance in long-form understanding tasks, providing precise results even when a single relevant frame is embedded within over 1k frames of irrelevant information.

4.3.2 Video Needle-In-A-Haystack

To demonstrate the retrieval ability in long-form visual inputs and test the quality of the dynamic compression module, we design the video needle-in-a-haystack experiment under extreme conditions, following the methodologies established in previous work (Zhang et al., 2024a; Xue et al., 2024). For this experiment, we select an extremely long video and then insert irrelevant image question-answering data as a single frame at arbitrary depths within the video. The model is tasked with answering questions related to these inserted images. We utilize LLaVA-Next-Video (Zhang et al., 2024c) of comparable size as our baseline. As depicted in Figure 3, baseline models trained with 32 frames failed to identify the images, suffering from severe information loss. In contrast, our method successfully retrieves the inserted images and accurately answers the questions, even with frame counts of 1.6k. This outcome strongly demonstrates the model’s ability in long-form temporal understanding, facilitated by the on-demand compression module.

4.4 2D & 3D Spatial Understanding

Table 3: Image Understanding. We conduct 2D spatial understanding tasks on six representative image benchmarks, including general and task-specific benchmarks. Our Oryx model achieves tier-1 performance across a wide range of MLLMs.

Model Size MMBench MMMU DocVQA OCRBench AI2D TextVQA Deepseek-VL (Lu et al., 2024) 7B 73.2 36.6 - 456 - 64.7 Monkey (Li et al., 2024d) 7B 72.4 40.7 - 534 68.5 - LLaVA-NeXT (Liu et al., 2024c) 8B 72.1 41.7 78.2 531 71.6 - Bunny-LLama3 (He et al., 2024) 8B 77.2 43.3 - 444 69.4 - Cambrian-1 (Tong et al., 2024) 8B 75.9 42.7 77.8 624 73.6 71.7 VILA-1.5 (Lin et al., 2023b) 8B 75.3 38.6 - - - 68.5 Idefics2 (Laurençon et al., 2024) 8B 76.7 43.0 - - - 73.0 Yi-VL (Young et al., 2024) 34B - 45.1 - 290 65.9 - LLaVA-NeXT (Liu et al., 2024c) 34B 79.3 49.7 84.0 574 74.9 - Cambrian-1 (Tong et al., 2024) 34B 81.4 49.7 75.5 600 79.7 76.7 VILA-1.5 (Lin et al., 2023b) 40B 82.4 51.9 - - - 73.4 Oryx 7B 81.4 43.9 89.0 672 78.5 75.0 Oryx 34B 84.5 50.3 91.4 743 81.0 77.8

As we perform a general solution across spatial and temporal understanding, we incorporate both image and 3D benchmarks in our assessments to thoroughly evaluate our model, which shows the foundation multi-modal capabilities of Oryx model and the potential for extending to more visual tasks, formats, and circumstances.

Image Benchmarks. We select a diverse set of mainstream and representative image benchmarks to evaluate the model’s proficiency in image understanding. Specifically, we included MMBench (Liu et al., 2023a) and MMMU (Yue et al., 2024) to assess general image understanding capabilities, and DocVQA (Mathew et al., 2021), OCRBench (Liu et al., 2023b), AI2D (Kembhavi et al., 2016), and TextVQA (Singh et al., 2019) to evaluate the model’s performance on specific tasks such as document recognition, OCR, text understanding tasks, etc. The results are summarized in Table 3. Notably, the Oryx model maintains pioneering results on image benchmarks, such as an 84.5% mean accuracy on MMBench (Liu et al., 2023a) and a 91.4% accuracy on DocVQA (Mathew et al., 2021). Such results demonstrate the effectiveness of our method in comprehending images with more simple and lightweight training pipelines, data curation, and strategies compared with concurrent works.

3D Spatial Understanding. We conduct the 3D spatial understanding experiments on the classic ScanQA validation set, following the protocol established by previous work (Azuma et al., 2022; Hong et al., 2023; Liu et al., 2024a). We incorporate advanced baseline models, including 3D-specific models and general open-source MLLMs supporting 3D spatial tasks for a comprehensive comparison. As shown in Table 4, the Oryx model not only outperforms previous specialized models designed for 3D understanding, but also surpasses the recently updated general MLLMs and specially designed 3D-LLM (Hong et al., 2023). These results underscore the robust adaptability of our method in addressing 3D spatial tasks.

Table 4: 3D Spatial Understanding. We use the popular ScanQA (Azuma et al., 2022) dataset and evaluate the relevant scores. We compare the Oryx model with 3D-specific models together with general open-source MLLMs. Oryx excels in 3D spatial understanding tasks, highlighting its versatility across various applications.

Model Size   METEOR   ROUHE-L   CIDEr   BLEU-1   BLEU-2 3D-Specific Models VoteNet (Qi et al., 2019) - 11.4 29.8 54.7 28.0 16.7 ScanQA (Azuma et al., 2022) - 11.5 30 55.4 26.9 16.6 ScanRefer (Chen et al., 2020) - 13.1 33.3 64.9 30.2 20.4 3D-LLM (Hong et al., 2023) - 14.5 35.7 69.4 39.3 25.2 General Open-Source MLLMs BLIP2 (Li et al., 2023) - 11.3 26.6 45.7 29.7 16.2 Flamingo (Alayrac et al., 2022) 7B 11.3 31.1 55 25.6 15.2 Mantis (Jiang et al., 2024) 7B - 16.1 - - - LLaVA-Next-Interleave (Li et al., 2024b) 14B - 34.5 - - - LLaVA-OneVision (Li et al., 2024a) 72B - 35.8 - - - Oryx 7B 14.0 35.1 68.7 35.3 22.3 Oryx 34B 15.0 37.3 72.3 38.0 24.6

4.5 Analysis

Effects of resolution and resize strategy across benchmarks. To illustrate the effectiveness of the advanced native representation for visual inputs, we conduct ablation analysis experiments on the effects of resolution across multi-modal benchmarks in Figure 4. We compare inputs with native resolution to inputs rescaled to specific overall number of pixels while maintaining the original aspect ratios. The left figure presents the scores on several benchmarks, where we utilize images scaled to 7682superscript7682768^{2}768 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels, 10242superscript102421024^{2}1024 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels, images with native resolution, and larger images (2x area) with native resolution. The results indicate that native resolution consistently outperforms fixed sizes, with the performance gap becoming more pronounced in the DocVQA and OCRBench datasets. These datasets require the visual encoder to process more natural image inputs for text understanding. Additionally, further enlarging the resolution does not yield significant gains in most benchmarks. The right figure illustrates the performance trends on MMBench (Liu et al., 2023a) and OCRBench (Liu et al., 2023b) with varying visual input resolutions. Our findings suggest that while larger images generally lead to better performance, maintaining the native resolution emerges as a simple yet effective strategy for optimizing performance.

Refer to caption
Figure 4: Effects of resolution and resize strategy across benchmarks. The left figure shows the performance across benchmarks with fixed size, native size, and larger images. The right figure shows the trend of performance with varying resolutions, where we illustrate the performance of native resolution for reference. The text-relative benchmarks show more sensitivity to the resolution scale, while all the benchmarks benefit from the visual inputs with native resolution.
Table 5: Ablations on the Oryx Architecture. We evaluate our design of two core architectures within the Oryx model. (a) examines the impact of the visual encoder and the method of processing visual inputs, demonstrating the superiority of native visual representations compared with dynamic partition and the strong visual-text alignment capability of OryxViT. (b) assesses the influence of dynamic compression modules in comparison to conventional MLP connectors, revealing significant performance gains due to improved fusion of image and video data. Various downsampling approaches were tested, with average pooling yielding the best performance.
(a) Ablation study on Visual Encoder.

Visual Enc. Res. DocVQA OCRBench MMBench SigLIP Partition 74.8 531 68.0 SigLIP Native 17.1 67 15.8 OryxViT Partition 76.3 549 68.9 OryxViT Native 78.5 572 69.3 OryxViT Optimal 79.2 572 69.9

(b) Ablation Study on Compression Module.

Connector Downsample VideoMME MLVU MLP Avg Pool 54.6 57.5 Dy.Compressor Avg Pool 55.4 59.3 Dy.Compressor DWConv 55.0 58.9 Dy.Compressor Conv-MLP 54.7 58.5

Effectiveness of the Oryx Architecture. We conduct more ablation experiments on the design of the Oryx architecture in Table 5. For the visual representation, we compare OryxViT with the mainstream SigLIP visual encoder. Our comparison highlights the superior alignment performance of OryxViT. Additionally, we fairly compare previous dynamic partition approaches with visual inputs of native resolutions. We conclude from the results that the previous mainstream multi-modal encoder SigLIP (Zhai et al., 2023) fails to process native visual input and only works on fixed resolution with the dynamic partition trick. On the contrary, the OryxViT benefits from the visual inputs at native resolution, which is superior to the partition approach. As an arbitrary visual encoder, we are also curious about the limit of resolutions, where we find that searching for the optimal anchor resolution leads to better performance (the last line in Table 5 (a)). However, for the sake of fairness and efficiency, we do not employ this optimization in our primary evaluations. We report our results on several representative image benchmarks, including DocVQA, OCRBench, and the general MMBench datasets, using a subset of image training data for efficient training.

For the connector module, we compare the proposed dynamic compressor with the popular and straightforward MLP architecture. The dynamic compressor demonstrates superior performance on both general and long temporal benchmarks by better fusing multi-modal data. Furthermore, our analysis reveals that average pooling yields better results for higher compression visual inputs compared to parameter-reliant approaches such as DWConv and Conv-MLP. This improvement is likely due to the parameter-free nature of average pooling, which preserves the distribution of visual features, and more complex downsampling layers may not be effectively trained through the current training pipeline. Our analysis of the connector module is conducted on a subset of video training data to maintain training efficiency.

5 Conclusion

In this paper, we introduce the Oryx series, a novel approach designed to handle diverse visual inputs across varying tasks, temporal lengths, and resolutions in an on-demand manner. The Oryx model stands out as a unified multi-modal framework for spatial-temporal understanding, leveraging the innovative design of OryxViT for native resolution processing, the Dynamic Compressor for efficient data compression, and a robust joint training strategy. Our extensive evaluations demonstrate that the Oryx model achieves outstanding performance across a wide array of image, video, and 3D mainstream benchmarks. We hope that our work offers a novel perspective on multi-modal learning and paves the way for the development of more general MLLMs in future research endeavors.

References

  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
  • Azuma et al. (2022) Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In CVPR, pp.  19129–19139, 2022.
  • Bai et al. (2023) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
  • Beyer et al. (2023) Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. Flexivit: One model for all patch sizes. In CVPR, pp.  14496–14506, 2023.
  • Chen et al. (2020) Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, pp.  202–221. Springer, 2020.
  • Chen et al. (2024a) Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024a.
  • Chen et al. (2024b) Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, pp.  24185–24198, 2024b.
  • Cheng et al. (2024) Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024.
  • Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022.
  • Dehghani et al. (2024) Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. NeurIPS, 36, 2024.
  • Dosovitskiy (2020) Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Fan et al. (2024) Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Vitar: Vision transformer with any resolution. arXiv preprint arXiv:2403.18361, 2024.
  • Fang et al. (2024) Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515, 2024.
  • Fei et al. (2024) Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023, 2024.
  • Fu et al. (2024) Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024.
  • GeminiTeam (2024) GeminiTeam. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  • He et al. (2024) Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530, 2024.
  • Hong et al. (2023) Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. NeurIPS, 36:20482–20494, 2023.
  • Huang et al. (2020) Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. In ECCV, pp.  709–727. Springer, 2020.
  • Jiang et al. (2024) Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024.
  • Kembhavi et al. (2016) Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pp.  235–251. Springer, 2016.
  • Laurençon et al. (2024) Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024.
  • Laurençon et al. (2024) Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?, 2024.
  • Li et al. (2024a) Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a.
  • Li et al. (2024b) Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024b.
  • Li et al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pp.  19730–19742. PMLR, 2023.
  • Li et al. (2024c) Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, pp.  22195–22206, 2024c.
  • Li et al. (2024d) Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024d.
  • Lin et al. (2023a) Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023a.
  • Lin et al. (2023b) Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023b.
  • Liu et al. (2024a) Benlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, and Ranjay Krishna. Coarse correspondence elicit 3d spacetime understanding in multimodal language model. arXiv preprint arXiv:2408.00754, 2024a.
  • Liu et al. (2024b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pp.  26296–26306, 2024b.
  • Liu et al. (2024c) Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024c.
  • Liu et al. (2024d) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2024d.
  • Liu et al. (2024e) Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542, 2024e.
  • Liu et al. (2023a) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023a.
  • Liu et al. (2023b) Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
  • Liu et al. (2024f) Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024f.
  • Liu et al. (2024g) Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, and Jiwen Lu. Efficient inference of vision instruction-following models with elastic cache. arXiv preprint arXiv:2407.18121, 2024g.
  • Lu et al. (2024) Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  • Maaz et al. (2023) Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  • Maaz et al. (2024) Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arXiv preprint arXiv:2406.09418, 2024.
  • Mathew et al. (2021) Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  2200–2209, 2021.
  • OpenAI (2023a) OpenAI. Openai gpt-3.5 api. OpenAI API, 2023a.
  • OpenAI (2023b) OpenAI. Gpt-4v(ision) system card. OpenAI Blog, 2023b. URL https://openai.com/index/gpt-4v-system-card/.
  • OpenAI (2023c) OpenAI. Gpt-4 technical report. ArXiv:abs/2303.08774, 2023c. URL https://arxiv.org/abs/2303.08774.
  • OpenAI (2024) OpenAI. Hello gpt-4o — openai. OpenAI Blog, 2024. URL https://openai.com/index/hello-gpt-4o/.
  • Patraucean et al. (2024) Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36, 2024.
  • Qi et al. (2019) Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9277–9286, 2019.
  • Qian et al. (2024) Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models. arXiv preprint arXiv:2405.16009, 2024.
  • QwenTeam (2024a) QwenTeam. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
  • QwenTeam (2024b) QwenTeam. Qwen2-vl: To see the world more clearly. Wwen Blog, 2024b. URL https://qwenlm.github.io/blog/qwen2-vl/.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pp.  8748–8763. PMLR, 2021.
  • Ranzinger et al. (2024) Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. In CVPR, pp.  12490–12500, 2024.
  • Rawal et al. (2024) Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024.
  • Singh et al. (2019) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019.
  • Tong et al. (2024) Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024.
  • Wu et al. (2024) Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024.
  • Xiao et al. (2021) Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, pp.  9777–9786, 2021.
  • Xu et al. (2024a) Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024a.
  • Xu et al. (2024b) Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv preprint arXiv:2403.11703, 2024b.
  • Xue et al. (2024) Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188, 2024.
  • Yang et al. (2023a) Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, et al. Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588, 2023a.
  • Yang et al. (2023b) Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023b.
  • Yao et al. (2024) Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
  • Young et al. (2024) Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  • Yue et al. (2024) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9556–9567, 2024.
  • Zhai et al. (2023) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  11975–11986, 2023.
  • Zhang et al. (2024a) Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024a.
  • Zhang et al. (2024b) Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multimodal models from language model reward. arXiv preprint arXiv:2404.01258, 2024b.
  • Zhang et al. (2024c) Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024c. URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/.
  • Zhou et al. (2024) Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024.

Appendix A Generation Results

Video Summarization and Detailed Description. As shown in Fig. 5, the Oryx model effectively generates a comprehensive and detailed caption that accurately summarizes the input video. It captures the main event while preserving essential information.

Refer to caption
Figure 5: Oryx is able to make a comprehensive video summary and detailed caption.

Video Multiple Choice and Reasoning. Oryx is also capable of reasoning based on the input video. As demonstrated in Fig. 6, Oryx can answer questions through analogy and generate well-reasoned responses.

Refer to caption
Figure 6: Oryx learns to reason through the input video.

Skill Learning From Videos. Oryx can acquire useful skills from the input video. As demonstrated in Fig. 7, Oryx learns to use Google Scholar to cite a paper by following the steps shown in the video. It illustrates all the necessary steps to complete the citation, highlighting its strong skill-learning capability and potential for agent-based tasks and task execution.

Refer to caption
Figure 7: Oryx learns useful skills from the input video.

Understanding 3D with Coarse Correspondences. Oryx enhances its 3D spatial understanding using coarse correspondences. Fig. 8 illustrates Oryx’s reasoning process, demonstrating its ability to improve 3D comprehension through these correspondences and generate accurate reasoning outcomes.

Refer to caption
Figure 8: Oryx understands 3D spatial information through coarse correspondences.

Appendix B Training Details

Stage 1. For stage 1, we first pre-train the connector module between the visual encoder and Large Language Model for the initial alignment between image and text modalities. We conduct our experiments on 558k caption data from BLIP (Li et al., 2023) model following LLaVA-1.5 (Liu et al., 2024b). We only unfreeze the parameter for the connector while maintaining other parameters fixed. We adopt the total training batch size at 256 and the overall learning rate at 1e-3. We maintain the aspect ratio for the input image while adjusting the overall pixels to 7682superscript7682768^{2}768 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to reduce the computational cost. The training cost for the pre-training alignment is lightweight thanks to the small number of parameters for the connector and the relatively lower image-text data pairs. Subsequently, we conduct the supervised fine-tuning stage with 4.1M image data. We freeze the parameter for the visual encoder while unfreezing the connector and the Large Language Model following common practice. In this stage, we use the native resolution of the image while restricting the maximum number of pixels at 12802superscript128021280^{2}1280 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for efficiency. For the image larger than 12802superscript128021280^{2}1280 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels, we scale down the image to match the overall number of pixels. We set the learning rate at 2e-5 for Oryx-7B and the learning rate at 1e-5 for Oryx-34B. We adopt the total batch size at 128 and conduct our experiments on 64 NVIDIA A100-40G GPUs for Oryx-7B and 64 NVIDIA A800-80G GPUs for Oryx-34B, as larger models need more GPU memories. The total model maximum length is set as 8192.

Stage 2. For stage 2, we continuously train the Oryx model from the multi-modal LLMs in stage 1. We randomly sample around 600k image data from the supervised fine-tuning stage in stage 1 and add additional 650k temporal and 3D data from open-source multi-modal datasets, resulting in an overall number of 1.2M further supervised fine-tuning data. In the more general stage, we increase the restriction for image pixels to 15362superscript153621536^{2}1536 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to meet the longer sequential length in temporal data. We maintain the aspect ratio of video data while normalizing each frame to the minimum size of 2882superscript2882288^{2}288 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels and the maximum size of 4802superscript4802480^{2}480 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels, therefore the token length before compression module ranges from 324 to 900. We adopt 1×1111\times 11 × 1 path for the image data, 2×2222\times 22 × 2 pooling path for the multi-frame data including video and 3D-relevant data, and 4×4444\times 44 × 4 pooling path for the extremely long video needle-in-the-haystack retrieval data. We maintain most of the training hyper-parameters identical to stage 1, with a total batch size of 128, a learning rate of 2e-5 for Oryx-7B, and a learning rate of 1e-5 for Oryx-34B. We sample 1 frame per second for video data and set the max frame number at 64 frames. We uniformly sample the frames among all the frames if the number exceeds the upper bound. The maximum sequence length is set to 16384.