AutoDL Image Competition Tuning Log

I’ve been spending this period on AutoDL’s image-classification track (Auto_Image), with all offline test data drawn from the Pedro dataset. The overall idea is to explore, within a limited time budget, how different backbones, normalization layers, activation functions, precision formats, and pretrained weights affect the final score. What follows is a systematic write-up in the order the experiments were run.

Backbone Choice: ResNet18 vs. ResNet34

The first question is the choice of backbone. Looking at how the online scores trend, the score rises quickly early on, and ResNet18 is a reasonable fit for this phase; later the rise flattens out and the increments get small, so in theory it makes sense to try a larger backbone or switch to a different pipeline.

But the offline test results show that ResNet34 has no clear advantage over ResNet18. In the figure below, ResNet18 is on the left and ResNet34 is on the right:

The gap is not significant, so for now we keep ResNet18 as the baseline.

Activation Function: Replacing ReLU with CELU

I tried replacing all the ReLUs in the network with CELU. On the offline data the two are roughly comparable, with CELU slightly worse; on the online data the gap widens noticeably, and the CELU variant actually scores considerably lower.

Offline CELU result:

Online comparison (the new baseline on the left, CELU on the right):

CELU brings no benefit in this setting, so it’s dropped.

Normalization-Layer Experiments

Resetting All BN Layers

I tried resetting all the Batch Normalization layers of the pretrained model, hoping to let the statistics adapt to the new data faster. The offline results show that after resetting BN, the early stage (the first checkpoint) does perform better and converges faster, but the final accuracy isn’t high. The reset-BN result is on the right:

The online result is also slightly below the no-reset ResNet18 baseline:

ResNet34 + Resetting All BN

Combining the two, hoping to get both a stronger backbone and faster early convergence. The offline first checkpoint slowed down a lot, which I suspect was caused by unstable system IO, so I don’t treat it as a reference. The online result is that the early stage was indeed much faster, but the later accuracy still isn’t high enough—judging by appearances, this is caused by overfitting.

Online test result:

Precise Norm (Group Norm) Replacing Batch Norm

Another direction is to fully replace BN with Group Norm (Precise Norm):

The final result is poor, so it’s not worth considering.

Mixed-Precision Training (FP16)

I tested FP16 locally. The result was unexpected: the version without FP16 (left) is actually faster than the one with FP16 enabled (right), and FP16 also has a fairly large impact on accuracy.

FP16 isn’t worth it under the current configuration, so I’m not introducing it for now.

Miscellaneous Other Experiments

Scan loop unrolling (unroll): I tested the time cost of unrolling the loop in scan (unroll=2). It turned out to be of no use—the original test scan time is about 43s, so the change makes little difference.

Increasing batch size: Moderately increasing the bs adds some time overhead but does little for stability—the root cause of online instability is still overfitting, so tuning bs treats the symptom rather than the disease.

Learning-rate adjustment: I’m still trying the idea of using different pipelines for the early and later stages, and haven’t reached a firm conclusion yet.

Pretrained Models: ReID and Pedestrian-Attribute Datasets

I tried replacing ImageNet pretraining with person re-identification (ReID) pretrained weights, testing versions pretrained on the MSMT dataset and the Market dataset respectively.

Offline result for MSMT-pretrained:

I also tested a PA-100K pretrained model (pedestrian-attribute recognition), which worked poorly both online and offline; I tried pretraining from several datasets and none were satisfactory.

Applicability of ImageNet Pretraining Across Scenarios

I tested the effect of ImageNet pretraining on three categories of scenarios: pedestrian, medical, and satellite imagery.

Satellite imagery:

Medical imagery:

The differences across domains are pronounced; ImageNet pretraining cannot maintain a consistent advantage across all scenarios.

Predicting the CV Final Datasets

Based on what has shown up in the current feedback phase, here is a prediction of the datasets that might appear in the final phase:

Loukoum (handwriting): appeared in both the cv and cv2 finals; AutoCLINT does well on it, so no worries.
Tim (general objects): ImageNet pretraining is already sufficient.
Apollon (pedestrian): already appeared in the feedback phase.
Ideal (aerial/satellite): already appeared in the feedback phase.
Ray (medical): only appeared in the cv final, worth keeping an eye on.

Online-Data Reshuffle Experiment

Testing the impact of reshuffling the online data or not:

Finally, I tried using self-attention to handle the data-distribution problem, but it didn’t work.

Hyperparameter Tuning
Image Classification
Experiment Log

2019 · 12 · 31