VL Model Behind Doubao AI Phone

Context (title): The VL Model Behind Doubao AI Phone

According to public reports, the model used by the Doubao AI phone is a closed-source version optimized for mobile devices based on UI-TARS. UI-TARS is derived from SFT on Alibaba’s Qwen2 VL, and a 7b version is currently open-sourced (Qwen2 VL has open-sourced models ranging from 3b to 72b here). We won’t delve into Qwen here (Qwen2 VL already has UI Operation capabilities), but will focus on further improvements of the UI-TARS model on Qwen2 VL, divided into data and training sections.

Data

The core part lies in more refined data construction, building detailed annotation information for each UI screenshot from the bottom up, from the smallest button to the overall layout, and even captions before and after state changes.

Data Type	Description
Element Description	Information about a single element, including type (button, input box, etc., similar to frontend component classification), visual description (color, appearance, etc.), positional information (relative spatial information), and element function (e.g., delete email)
Dense Caption	A detailed paragraph describing the entire interface
State Transition Caption	A set of images used to describe changes before and after, and whether any operations like button presses were performed
QA	Questions and answers about the UI interface
Set of Mark	Making some marks in the UI (e.g., framing a part) and constructing QA based on these marks

The paper mentions constructing a total of 50 billion tokens of data to train the 7B and 72B models (Qwen2 VL’s pre-training already used 1.4 trillion tokens of data).

In addition to this data, during the subsequent SFT training phase, error correction data pairs (error + correction) were also constructed, instructing the Agent on how to recover after making a mistake on the UI, which is a major highlight (constructing these complex and deeply annotated datasets seems to cost a lot…).

Training

The training process of UI-TARS can be divided into four steps: pre-training, SFT, and DPO.

1. Pre-training

Pre-training uses all the data mentioned above, essentially continuing with these specific data on the foundation of Qwen2 VL. Using ChatGPT’s estimation, pre-training the 7B and 72B models with 50 billion tokens translates to approximately:

7B: ≈ 49.2 – 70.2 H200 GPU-days
72B: ≈ 505.6 – 722.2 H200 GPU-days

It looks manageable, especially with 128 cards, it’s quite fast.

2. SFT

This stage is more refined, using not only the high-quality parts of the data mentioned above but also semi-automatically generated trace data + error correction data to further enhance sequence operation capabilities.

Trace

For trace sequence data, the native dataset is very sparse. Here, a semi-automatic method is used to generate data + iterate the model. Each iteration creates a batch of tasks for the model to run, and through manual annotation, model scoring, etc., high-quality trace data is selected for the next round of model training, iteratively using the high-quality data generated by the model itself.

Reflection Tuning

Error correction data involves taking the model’s erroneous trace, re-annotating it to obtain positive samples, and using these positive samples as SFT training data. There are two ways to construct positive samples:

Directly correct the wrong operation to the right one, so the model avoids making mistakes:
$\left\{ \begin{aligned} \mathcal{T}_{-} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{red}{t_\tau}, \textcolor{red}{a_\tau}) \bigr) \\[6pt] \mathcal{T}_{+} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{green}{t_\tau^{*}}, \textcolor{green}{a_\tau^{*}}) \bigr) \end{aligned} \right.$
Change the next step of the wrong operation to a corrective action, so the model knows how to correct after making a mistake:

\left\{ \begin{aligned} \mathcal{T}_{-} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{red}{t_\tau}, \textcolor{red}{a_\tau}), (o_{\tau+1}, t_{\tau+1}, a_{\tau+1}) \bigr) \\[6pt] \mathcal{T}_{+} &= \bigl( \text{instruction}, (o_1, t_1, a_1), (o_2, t_2, a_2), \ldots, (o_\tau, \textcolor{red}{t_\tau}, \textcolor{red}{a_\tau}), (o_{\tau+1}, \textcolor{green}{t_{\tau+1}^{*}}, \textcolor{green}{a_{\tau+1}^{*}}) \bigr) \end{aligned} \right.

3. DPO

In the previous SFT, only the erroneous samples were corrected to positive samples for training, without utilizing the information of the negative samples themselves. The idea of DPO is somewhat like SVM, not only separating positive and negative samples but also maximizing the distance between them.

\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_\tau \Big[ \log \sigma\big( \beta \log \tfrac{\pi_\theta(a'_\tau|s_\tau)}{\pi_{\text{SFT}}(a'_\tau|s_\tau)} - \beta \log \tfrac{\pi_\theta(a_\tau|s_\tau)}{\pi_{\text{SFT}}(a_\tau|s_\tau)} \big) \Big]

DPO constructs the above loss function for training, where the ratio in the log represents the preference of the training model $\pi_\theta$ compared to the SFT model $\pi_{SFT}$ . The former is the preference on positive samples, and the latter is the preference on negative samples. The optimization goal is to maximize the former and minimize the latter, meaning the training model should be more inclined towards positive samples than the old SFT model and further away from negative samples. This makes the model’s preference for positive and negative samples more clear and distinct.

The experimental section is not elaborated here; the original paper provides detailed experiments on perception, grounding, etc. Overall, there is a significant improvement compared to the original Qwen2 VL. Additionally, the paper uses reasoning and other methods to further enhance the effect, which are relatively general techniques and are not repeated here.