Differential Learning Rate

Gradual Freezing and Unfreezing


Using a pre-trained model (from the gluon model zoo) and then fine-tuning the model to work with the users dataset is a common practice in order to results quickly without needing to train an entire model from scratch. However, right now the current API only supports training the last fully-connected layer. To unfreeze other layers the developer needs to imperatively configure each individual layer.

Open Issue:

for param in net.features[1].collect_params().values():

Source: https://discuss.mxnet.io/t/finetuning-in-mxnet-for-convnet-blocks/839/2

Additionally, there is no easy way to set different learning rates for different layers. A multiplier needs to be applied to each layer.


Source https://github.com/apache/incubator-mxnet/issues/8623

The above snippet would be used to adjust one layer's learning rate and this would need to applied for each layer. There no way to set a custom schedule for each layer.

List of resources for fine-tuning:

Goals/Use cases

The goal of this project is to provide an API to easily configure the number of layers to be tuned and a way to apply different learning rates and schedules to different layers.

Proposed Approach

Make it easier to unfreeze layers

Currently we fetch pre-trained models with the model_zoo api.

model_zoo.vision.get_model.squeezenet1_1(prefix='deep_dog_', classes=2)

The only parameter the user can pass is classes which specifies a trainable fully connected layers that are trainable. We can allow an extra argument to passed that would automatically unfreeze the last n layers.

model_zoo.vision.get_model.squeezenet1_1(prefix='deep_dog_', classes=2, unfreeze=5)

Make it easier to apply different learning rate / learning rate schedule to different layers

As mentioned above currently we need to manually extract each layer and apply a different learning rate multiplier. Instead we could

Currently the trainer is initialized by:

gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': learning_rate, 'lr_scheduler' : lr_scheduler})

We could optionally allow a different dictionary to be passed into the optimizer that maps the layer name to a specific schedule

gluon.Trainer(net.collect_params(), 'sgd', {
'layer_1_name': {'learning_rate': 0.01, 'lr_scheduler' : lr_scheduler},
'layer_2_name': {'learning_rate': 0.001, 'lr_scheduler' : lr_scheduler2},

Current Implementations

  • PyTorch

PyTorch has the most intuitive way of applying different learning rates to different layers

  • optim.SGD([
    {'params': model.base.parameters()},
    {'params': model.classifier.parameters(), 'lr': 1e-3}
    ], lr=1e-2, momentum=0.9)
  • Tensorflow

Tensorflow has a method of applying different learning rates; however, it requires the gradients be calculated an additional time for each optimizer added.

var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)

Source: https://stackoverflow.com/questions/34945554/how-to-set-layer-wise-learning-rate-in-tensorflow

  • Keras

Keras does not have an out-of-the-box solution for applying different learning rates to different layers

updates = self.optimizer.get_updates(self.params, self.constraints, self.depths, train_loss)

One involves updating Keras core

Source: https://github.com/keras-team/keras/issues/414

The other involves creating a custom optimizer and passing a custom dictionary to it

# Setting the Learning rate multipliers
LR_mult_dict = {}

Source: https://ksaluja15.github.io/Learning-Rate-Multipliers-in-Keras/

  • No labels