Tech

VL Model Behind Doubao AI Phone

According to public reports, the model used by Doubao AI phone is a closed-source version optimized for mobile based on UI-TARS. UI-TARS is derived from SFT on Alibaba's Qwen2 VL, with a 7b version open-sourced (Qwen2 VL has models ranging from 3b to 72b open-sourced). This post will not delve into Qwen (Qwen2 VL already includes UI Operation features), but will focus on further improvements of the UI-TARS model on Qwen2 VL, covering data and training aspects.

Quantitative Analysis of PyTorch Training Acceleration

This article starts with a baseline and gradually optimizes training speed through various software and hardware methods, ultimately reducing training time to 1/8.

Milestones in Neural Architecture Search (NAS)

Neural Architecture Search (NAS) has been extremely popular this year. This post briefly outlines some of the works I find particularly representative. Feel free to point out any errors or omissions. hhhh

Feeding the GPU in Deep Learning

Recently, I trained several models and found that more GPUs don't always lead to better results. Sometimes, there's no difference between using one V100 and two V100s. I later discovered the bottleneck was elsewhere. This article summarizes some tricks I've used.

Learning to Push by Grasping: Using Multiple Tasks for Effective Learning

Currently, end-to-end learning frameworks are becoming popular in the field of robotic control. These frameworks take states/images as direct input and output predicted torque and action parameters. However, they have been criticized for their high data demands, sparking discussions about their scalability. Specifically, does end-to-end learning require a separate model for each task? Intuitively, sharing between tasks is beneficial because they require some common understanding of the environment. This paper explores the next step in data-driven end-to-end learning frameworks, moving from task-specific models to joint models for multiple robotic tasks, yielding surprising results: multi-task learning outperforms single-task learning with the same amount of data. For example, in the grasp task, a model trained with 2.5k grasp data and 2.5k push data performs better than a model trained with 5k grasp data alone.

Playing Atari with Deep Reinforcement Learning

This paper by Volodymyr Mnih, presented at NIPS 2013, is essentially the pioneering work on DQN, along with another paper published in Nature in 2015.

Cityscapes Dataset

Cityscapes is typically used for semantic segmentation and contains data divided into 8 categories, including one named "void." Each category has multiple classes, totaling 30 classes in Cityscapes. However, there are 35 labeled types after numbering, including labels like "unlabeled" that are not counted as classes.

Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation

Previous segmentation networks were either slow or had low accuracy. Here, an EDANet module is designed, combining asymmetric conv, dilated conv, and dense connectivity. It outperforms FCN in all aspects and does not require a decoder structure, context module, post-processing scheme, or pretrained model. Experiments were conducted on Cityscapes and CamVid.

Darts: Differentiable Architecture Search

This paper aims to challenge structure search by defining the task in a differentiable form, rather than using traditional methods that rely on reinforcement learning in a discrete, non-differentiable space. The approach is based on continuous relaxation of structure representation, allowing efficient methods like gradient descent for structure search. Subsequent experiments demonstrate that the algorithm performs well in exploring high-performance CNN structures for image recognition and RNN structures for language modeling, and is much faster than existing state-of-the-art non-differentiable structures.

Compressing Neural Networks with the Hashing Trick

Deep networks are increasingly applied on mobile devices, highlighting a dilemma: while deep learning trends toward developing models that can absorb larger datasets, mobile devices have limited storage and cannot accommodate overly large models. HashedNets are introduced to reduce model size by minimizing inherent redundancy within neural networks. HashedNets use a low-cost hash function to randomly group connection weights into different hash buckets, where all connections in the same bucket share a single parameter value, adjusted during standard backpropagation. This hashing process does not incur additional memory overhead. Performance on various benchmark datasets demonstrates that HashedNets can significantly reduce storage requirements while maintaining generalization performance.

ShuffleNetV2

Many network designs today focus on non-direct metrics like FLOPs for computational complexity, but direct metrics such as speed are influenced by more than just FLOPs, including MAC (memory access cost) and platform characteristics. This article aims to measure directly on specific platforms, which is more effective than only considering FLOPs. Through a series of controlled experiments, it proposes guidelines for efficient networks, leading to the development of a new architecture, ShuffleNetV2. Comprehensive ablation experiments demonstrate that this model achieves state-of-the-art performance in balancing efficiency and accuracy.

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

This article introduces an efficient network, ShuffleNet, which primarily uses pointwise group convolution and channel shuffle operations. These techniques significantly reduce computational costs while maintaining accuracy, outperforming previous networks on ImageNet and COCO.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

For mobile and embedded vision applications, this blog post introduces an efficient model called MobileNets, a lightweight neural network constructed using depthwise separable convolutions. The model employs two hyperparameters to balance accuracy and latency, and extensive experiments on ImageNet demonstrate its powerful performance compared to other models. Experiments also showcase ImageNet's strengths in various applications, including object detection, fine-grained classification, facial attributes, and large-scale geolocation.

InceptionV4 Summary

In recent years, very deep convolutional neural networks have significantly enhanced image recognition performance. The Inception network structure offers excellent performance with relatively low computational cost. The combination of recent residual connections with traditional structures achieved the best results at the 2015 ILSVRC, comparable to InceptionV3. Integrating Inception networks with residual connections has been shown to significantly accelerate the training of Inception networks. There is also evidence that Inception networks with residual connections perform slightly better than those without, despite having nearly the same computational load. This article introduces some new Inception networks with and without residual connections, which also noticeably improved single-frame classification performance in the 2012 ILSVRC. Lastly, it mentions that using appropriate activation scaling can make training very wide residual connection Inception networks more stable.

Derivatives of Vectors and Matrices

In machine learning algorithms, you'll encounter numerous matrix-related differentiation and derivation tasks. Here, we introduce some common differentiation formulas related to matrices and vectors.

A General Solution to Stock Problems in Dynamic Programming

There is a type of dynamic programming problem where you are given a stock price sequence and need to calculate the maximum profit from buying and selling stocks. These problems often have many variations, such as allowing only one transaction, multiple transactions, or imposing a transaction tax. The maximum profit is usually determined by the timing of the trades and the allowed maximum number of transactions (each transaction being a combination of one buy and one sell).

Definition of Convex Sets and Common Convex Sets

Similar to solving optimization problems with only equality constraints as discussed earlier, optimization problems with inequality constraints can also be solved using the Lagrange multiplier method.

Derivation of SVM (3)

In the previous post, we introduced the derivation of hard-margin SVM. This article will continue with the mathematical derivation of soft-margin SVM, which allows for some misclassification when samples are not linearly separable.

Derivation of SVM (2)

In the previous article (1), we discussed the derivation of hard-margin SVM and its dual form, which can be simplified into the following form.

Derivation of SVM (1)

SVM is a classic method in machine learning. Besides hard-margin SVM, it includes variants like soft-margin SVM and kernel tricks. This article mainly introduces the derivation of **hard-margin SVM**.

Solving Systems of Linear Equations (3)

The pseudoinverse discussed here is the **Moore-Penrose inverse matrix**.

Solving Systems of Linear Equations (2)

In the previous blog post, we discussed one scenario of linear equations where the number of unknowns is less than the number of equations, introducing the least squares method. In this post, we will cover another scenario where the number of equations is less than the number of unknowns. In this case, the system has infinitely many solutions, but there is only one solution closest to the origin, known as the **minimum norm solution** of the linear equations.

207. Course Schedule

This topic uses DFS and BFS to determine if a graph can be topologically sorted.

Solving Linear Equations (1)

In this post, we will discuss solving a specific case of linear equations, namely considering linear equations.

Numerical Computation in Machine Learning (1)

Machine learning algorithms often require extensive numerical computations, solving for approximations through iteration rather than analytical solutions. These algorithms typically involve optimization and solving linear equations. Since computers represent various floating-point numbers with limited precision, certain methods are needed to ensure computational accuracy.

Training a Simple Neural Network with TensorFlow

In this blog post, we use TensorFlow's Eager Execution to build models, eliminating the need to create Graphs and Sessions as before, making neural network training more convenient and faster. We will train a neural network using the Iris dataset as an example, with code from Google's tutorial.

Deep Learning on GeekCloud

Recently, I've been working on an image-related deep learning task assigned by my teacher. After debugging the code, I realized my laptop's memory (8GB) wasn't sufficient. Later, I discovered a very useful deep learning cloud service platform.

Radar + Camera Data Fusion in KITTI

The KITTI dataset offers a variety of data; here, we select the raw_data for integration.

Solving Optimization Problems with Inequality Constraints

Similar to solving optimization problems with only equality constraints discussed earlier, optimization problems with inequality constraints can also be solved using the Lagrange multiplier method.

Constructors in C++

Each class defines how its objects are initialized through one or more special member functions called **constructors**. The constructor's task is to initialize the data members of the class object, and it is executed whenever a class object is created.

Associative Containers in C++

Associated containers support efficient keyword lookup and access. The two main associated containers are `set` and `map`. Elements in a `map` are key-value pairs, where the keyword acts as an index and the value represents the data associated with the index. Elements in a `set` contain only a keyword. `Set` supports efficient keyword lookup operations, likely implemented using a hash table.

Derivation of Neural Network Backpropagation

In the training process of neural networks, the backpropagation algorithm is the core.

Sequential Containers in C++

A container is a collection of objects of a specific type. Sequence containers provide the ability to control the order of storage and access of elements.

Introduction to Decision Tree and Random Forest Algorithms

Decision trees are a method for classification and regression. This post focuses on decision trees used for classification. A decision tree has a tree-like structure and represents the process of classifying data based on features. It can be seen as a collection of if-then rules or as a conditional probability distribution defined over feature and class spaces. The main advantages are good model interpretability and fast classification speed. During training, a decision tree model is built using training data by minimizing a loss function. For prediction, new data is classified using the decision tree. Learning a decision tree typically involves three steps: feature selection, tree generation, and tree pruning. The concepts of decision trees mainly originate from Quinlan's ID3 algorithm (1986) and C4.5 algorithm (1993), as well as the CART algorithm proposed by Breiman et al. in 1984.

I/O Classes in C++

C++ does not handle input and output directly; instead, it uses a set of types defined in the standard library for IO operations. These types support reading from and writing to devices like files and console windows. Some types also allow memory IO, such as reading from and writing to strings.

Solving Optimization Problems with Equality Constraints

This article will discuss optimization problems for such shapes.

Dual Problems in Linear Programming

Every linear programming problem has a corresponding dual problem, which is also a linear programming problem. The dual of the dual problem is the original problem. The optimal solution of the original problem can be obtained from the dual problem. Sometimes, using dual theory to solve linear programming problems is simpler and provides a deeper understanding of the problem's nature. Inspired by dual theory, the performance of the simplex method has been improved, and some non-simplex methods for solving linear programming problems have emerged, which will not be detailed in this article.

Parameter Passing in C++ Functions

In a C++ program, when calling a function, you need to pass an argument to it. Apart from void, argument passing is divided into **pass by reference** and **pass by value**.

Simplex Algorithm for Solving Linear Programming Problems

In 1947, Dantzig introduced a method for solving linear programming problems, known today as the simplex method. This concise and efficient algorithm is hailed as one of the top ten algorithms of the 20th century with the greatest impact on scientific development and engineering practice.

Overview of Linear Programming

In optimization problems, there is a category known as linear programming problems, which are constrained optimization problems. Linear programming involves finding the extremum of a linear objective function under **linear constraints** (equalities or inequalities).

The `const` Keyword in C++

When programming, we often need to define a variable whose value doesn't change, such as pi=3.14, e=2.72, or the elastic modulus of a material. In these cases, the const keyword is used.