Learning to Push by Grasping: Using multiple tasks for effective learning
Abstract
End-to-end learning frameworks have become popular in robotic control: they take state/images as direct input and directly output predicted torques and action parameters. But they have been criticized for their heavy data requirements, sparking debate about their scalability—does the end-to-end approach require building a separate model for every task? Intuitively, sharing across tasks should help, since they all require some common understanding of the environment. This paper attempts the next step for data-driven end-to-end learning frameworks: moving from task-specific models to a joint model across multiple robotic tasks, with surprising results. Under the same amount of data, multi-task learning outperforms single-task learning. For the grasp task, for example, a model trained on 2.5k grasp samples plus 2.5k push samples outperforms a model trained on 5k grasp samples.
Introduction
When a robot prepares to perform a manipulation (such as grasping), it needs to complete the following steps: (a) infer the properties of the object, (b) have some understanding of its own configuration, and (c) understand what constitutes a successful grasp and know how to achieve it. Many analytical frameworks define (c) mathematically, while relying on additional perception modules for the perception part, usually applying some degree of simplification. This over-reliance prevents them from achieving good results. End-to-end models learn a joint model of (a)–(c) in a data-driven manner and have shown strong performance.
Despite the strong performance of end-to-end models, there have also been many criticisms, chiefly that they require building a separate model for each task, and that training each model requires large amounts of data. So is there a way to share across tasks that reduces the data requirement? Intuitively there is, because all tasks require parts (a) and (b)—for example, push-task data is also helpful for training the perception module of the grasp task.
As mentioned above, under the same amount of data, a combination of data from different tasks outperforms data from a single task. The conjecture is that this is because performing multiple tasks can explore target properties and patterns that the original task never encounters. This can be viewed as a form of regularization, allowing the model to learn more generalizable features.
Related Work
Grasping is a relatively old problem. Earlier approaches were mostly based on analytical methods and 3D reasoning to predict grasp position and configuration; only recently have data-driven learning approaches begun to emerge. Pushing is another fundamental robotic task, enabling target objects to be moved without grasping. In addition, we use tactile feedback prediction (poking) as an auxiliary task for learning pushing and grasping.
As mentioned earlier, collecting large amounts of data can be useful for robotic tasks, but data collection is a time-intensive task. Multi-task learning (MTL) uses a sharing scheme to exploit the commonalities among these tasks, namely using previous tasks to initialize the current parameters. This paper focuses on exploring an end-to-end MTL model for grasping and pushing.
For robotic tasks, prior frameworks were mostly task-specific. This is the first paper to report that sharing across multiple tasks can improve task performance, and we find that an additional data point of the original task is worth less than a data point of an alternating task—probably because such sharing enhances robustness and regularizes feature learning, improving performance across multiple tasks. Through our exploration of MTL, we find that even with only a small amount of data, an efficient model can be obtained by leveraging large amounts of data from other tasks.
Overview
Our goal is to explore whether data collected for one specific task can be used for other tasks. Current research directions focus on training a specific model for each task; however, we believe most of these problems require learning how the world works, so data can be shared to speed up learning. More concretely, certain parameters in a CNN correspond to visual features, some to low-level structure and physics, some to the robot’s configuration and control information, and only the remaining parameters are specific to a particular task. If this is indeed the case, then sharing data across tasks is crucial.
This paper investigates whether multi-task learning for grasping and pushing can yield a better control model. We collected data for three tasks—grasping, pushing, and poking—and finally compared the performance of the Grasp ConvNet when using only grasping data versus using fused grasping, pushing, and poking data. The same comparison was conducted for the Push ConvNet.
Approach
Here we formalize the three tasks: planar grasping, planar pushing, and the use of poking data in the framework.
Planar Grasps
A grasp can be defined by three parameters —the grasp point and the grasp orientation. The training data contains 37k failure samples and 3k success samples; the test data contains 2.8k failure samples and 0.2k success samples.

A grasp problem can be defined as predicting a successful grasp configuration from an input image. However, this problem is ill-posed, because an object can have multiple possible grasp positions. Therefore, we sample a batch of images whose lies at the center of the image, so that only the grasp angle needs to be predicted. The angle is classified into , so the problem can be converted into 18 binary classification problems, since the evaluation criterion is binary—grasp success or failure.
Planar Push
Each push data point consists of three parts: the initial image , the final image , and the action . The entire dataset contains 5k actions over 70 objects.

The task of this push learner is, given the initial and final images, to predict the push action . Here we use a siamese network with shared weights: one input branch takes and the other takes ; after processing both branches, they are concatenated together, and an fc layer is used to output . The loss function is Euclidean distance.
Planar Poke
This dataset contains an image of the target and the feedback felt during poking. The learner’s task is to predict from the input image.

We use two parameters to represent , namely the slope and intercept of the voltage increase. Here we adopt a structure similar to grasp: the first three layers share parameters with grasp, while the remaining layers are dedicated to the poking task. The loss function here is likewise Euclidean distance.
Network Architecture
There are two kinds of feature transfer here: one is poking+pushing->grasping, and the other is poking+grasping->pushing. The network architecture is shown below.

Among the first three conv layers, the upper and lower branches share weights. The numbers on each conv indicate the size and number of convolutional kernels, and each conv layer is immediately followed by BN and ReLU. For the grasp task, the first two fc layers are each immediately followed by a dropout layer with p=0.5 and a ReLU; the same goes for poking. For the pushing task, since two images must be input, the two convolutional pipelines shown in the figure are used, but their parameters are shared. They are then concatenated and passed through the same fc layers, finally outputting the predicted 5-dimensional vector .
Training
Here we mainly describe the training details. For the loss function, grasping uses cross-entropy, while both pushing and poking use Euclidean distance.
For the entire joint training process, we take batch_size=128, which yields a batch for each of the three tasks. For the fully connected layers, the gradient computation is the same as in the ordinary case. For the shared conv layers, the gradient computation is slightly different; its formula is:
where are the loss functions of the grasping, pushing, and poking tasks, respectively. During training we use the RMSProp algorithm, a gradient-descent-based algorithm, with a learning rate of 0.002, momentum=0.9, and decay=0.9. The learning rate decays by a factor of 0.1 every 5000 steps.
Results
For validation, the grasping task uses classification error, while the pushing task uses mean squared error.
Evaluating Multi-Task vs. Task-Specific
This section mainly compares the performance of multi-task training versus task-specific training. The multi-task setting here ignores poking, and the training data is split evenly between pushing and grasping. The conclusion reached is that with small amounts of data, task-specific data performs better, while with large amounts of data, multi-task performs better. This may be because multi-task data provides a kind of diversity that has a regularizing effect, preventing overfitting.

Multitask: Data Ratio
Here we experiment with the size of the data ratio. The experimental results are shown below, and it can be seen that pushing transfers more easily than grasping.

Multitask: 3-task performance
Here we experiment to verify whether adding poking helps, and we also experiment with different data ratios. The results show that poking is quite useful, as follows.

Discussion
The current research trend is toward task-specific learning, and discussions in many places hold that multi-task learning is useless. However, this gives rise to a problem: for end-to-end learning approaches, large amounts of training data are required. This paper shows that multi-task training is not only feasible but can also achieve better results under the same amount of data. We hypothesize that this is because the diversity of the data has a regularizing effect. This paper opens a new subfield of multi-task learning in robotics, especially with regard to sharing across different tasks.