優(yōu)化器是機(jī)器學(xué)習(xí)的很重要部分,但是在很多機(jī)器學(xué)習(xí)和深度學(xué)習(xí)的應(yīng)用中,我們發(fā)現(xiàn)用的最多的優(yōu)化器是 Adam,為什么呢?pytorch有多少優(yōu)化器,我什么時候使用其他優(yōu)化器?本文將詳細(xì)講述:
在torch.optim 包中有如下優(yōu)化器
torch.optim.adam.Adam
torch.optim.adadelta.Adadelta
torch.optim.adagrad.Adagrad
torch.optim.sparse_adam.SparseAdam
torch.optim.adamax.Adamax
torch.optim.asgd.ASGD
torch.optim.sgd.SGD
torch.optim.rprop.Rprop
torch.optim.rmsprop.RMSprop
torch.optim.optimizer.Optimizer
torch.optim.lbfgs.LBFGS
torch.optim.lr_scheduler.ReduceLROnPlateau
這些優(yōu)化器都派生自O(shè)ptimizer,這是一個所有優(yōu)化器的基類,我們來看看這個基類:
class Optimizer(object):
def __init__(self, params, defaults):
self.defaults = defaults
self.state = defaultdict(dict)
self.param_groups = list(params)
for param_group in param_groups:
self.add_param_group(param_group)
- params 代表網(wǎng)絡(luò)的參數(shù),是一個可以迭代的對象net.parameters()
- 第二個參數(shù)default是一個字典,存儲學(xué)習(xí)率等變量的值。
構(gòu)造函數(shù)最重要的工作就是把params加入到param_groups組中
zero_grad
def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()
遍歷param_groups,將每個組中參數(shù)值,有梯度的都解除鏈接,然后清零。
state_dict
def state_dict(self):
......
param_groups = [pack_group(g) for g in self.param_groups]
# Remap state to use ids as keys
packed_state = {(id(k) if isinstance(k, torch.Tensor) else k): v
for k, v in self.state.items()}
return {
'state': packed_state,
'param_groups': param_groups,
}
state當(dāng)前優(yōu)化器狀態(tài),param_groups,整理格式,以字典方式返回
def load_state_dict(self, state_dict):
state = defaultdict(dict)
for k, v in state_dict['state'].items():
if k in id_map:
param = id_map[k]
state[param] = cast(param, v)
else:
state[k] = v
# Update parameter groups, setting their 'params' value
param_groups = [
update_group(g, ng) for g, ng in zip(groups, saved_groups)]
self.__setstate__({'state': state, 'param_groups': param_groups})
整理格式,更新state和param_groups
SGD
這個優(yōu)化器是最基本的優(yōu)化器,
d_p = p.grad.data # 梯度值
...
p.data.add_(-group['lr'], d_p) # 更新值,只是一個lr和梯度
減去學(xué)習(xí)率和梯度值的乘積,果然夠簡單
我們給出計(jì)算公式
whitle True:
wights_grad = evaluate_gradient(loss_fun, data, weights)
weights += -step_size * weights_grad
SGD就是計(jì)算隨機(jī)梯度值,然后更新當(dāng)前參數(shù)。
Adam
Adam 這個名字來源于 adaptive moment estimation,自適應(yīng)矩估計(jì)。概率論中矩的含義是:如果一個隨機(jī)變量 X 服從某個分布,X 的一階矩是 E(X),也就是樣本平均值,X 的二階矩就是 E(X^2),也就是樣本平方的平均值。Adam 算法根據(jù)損失函數(shù)對每個參數(shù)的梯度的一階矩估計(jì)和二階矩估計(jì)動態(tài)調(diào)整針對于每個參數(shù)的學(xué)習(xí)速率。Adam 也是基于梯度下降的方法,但是每次迭代參數(shù)的學(xué)習(xí)步長都有一個確定的范圍,不會因?yàn)楹艽蟮奶荻葘?dǎo)致很大的學(xué)習(xí)步長,參數(shù)的值比較穩(wěn)定。
exp_avg.mul_(beta1).add_(1 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
if amsgrad:
torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
denom = max_exp_avg_sq.sqrt().add_(group['eps'])
else:
denom = exp_avg_sq.sqrt().add_(group['eps'])
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1 # 動態(tài)調(diào)整計(jì)算步長
p.data.addcdiv_(-step_size, exp_avg, denom) # 更新值
Adagrad
state['sum'].addcmul_(1, grad, grad)
std = state['sum'].sqrt().add_(1e-10)
p.data.addcdiv_(-clr, grad, std)
據(jù)說這個梯度可變,先累加addcmul_平方,還帶根號,防止除零還帶平滑項(xiàng)1e-10,果然代碼不騙人
Adadelta
其實(shí)Adagrad累加平方和梯度也會猛烈下降,如果限制把歷史梯度累積窗口限制到固定的尺寸,學(xué)習(xí)的過程中自己變化,看看下面的代碼能讀出這個意思嗎?
square_avg.mul_(rho).addcmul_(1 - rho, grad, grad)
std = square_avg.add(eps).sqrt_()
delta = acc_delta.add(eps).sqrt_().div_(std).mul_(grad)
p.data.add_(-group['lr'], delta)
acc_delta.mul_(rho).addcmul_(1 - rho, delta, delta)
SparseAdam
實(shí)現(xiàn)適用于稀疏張量的Adam算法的懶惰版本。在這個變體中,只有在漸變中出現(xiàn)的時刻才會更新,只有漸變的那些部分才會應(yīng)用于參數(shù)。
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
beta1, beta2 = group['betas']
# Decay the first and second moment running average coefficient
# old <- b * old + (1 - b) * new
# <==> old += (1 - b) * (new - old)
old_exp_avg_values = exp_avg._sparse_mask(grad)._values()
exp_avg_update_values = grad_values.sub(old_exp_avg_values).mul_(1 - beta1)
exp_avg.add_(make_sparse(exp_avg_update_values))
old_exp_avg_sq_values = exp_avg_sq._sparse_mask(grad)._values()
exp_avg_sq_update_values = grad_values.pow(2).sub_(old_exp_avg_sq_values).mul_(1 - beta2)
exp_avg_sq.add_(make_sparse(exp_avg_sq_update_values))
# Dense addition again is intended, avoiding another _sparse_mask
numer = exp_avg_update_values.add_(old_exp_avg_values)
exp_avg_sq_update_values.add_(old_exp_avg_sq_values)
denom = exp_avg_sq_update_values.sqrt_().add_(group['eps'])
del exp_avg_update_values, exp_avg_sq_update_values
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
p.data.add_(make_sparse(-step_size * numer.div_(denom)))
這么復(fù)雜的公式,只能看出通過一個矩陣計(jì)算,然后更新梯度
Adamax
torch.max(norm_buf, 0, keepdim=False, out=(exp_inf, exp_inf.new().long()))
bias_correction = 1 - beta1 ** state['step']
clr = group['lr'] / bias_correction
p.data.addcdiv_(-clr, exp_avg, exp_inf)
看到torch.max估計(jì)明白為甚叫Adamax了,給學(xué)習(xí)率的邊界做個上限
ASGD
state['step'] += 1
if group['weight_decay'] != 0:
grad = grad.add(group['weight_decay'], p.data)
# decay term
p.data.mul_(1 - group['lambd'] * state['eta'])
# update parameter
p.data.add_(-state['eta'], grad)
# averaging
if state['mu'] != 1:
state['ax'].add_(p.data.sub(state['ax']).mul(state['mu']))
else:
state['ax'].copy_(p.data)
# update eta and mu
state['eta'] = (group['lr'] /
math.pow((1 + group['lambd'] * group['lr'] * state['step']), group['alpha']))
state['mu'] = 1 / max(1, state['step'] - group['t0'])
使勁看,唯一能看出平均的含義就是eta 和 mu要累加統(tǒng)計(jì)。
Rprop
# update stepsizes with step size updates
step_size.mul_(sign).clamp_(step_size_min, step_size_max)
# for dir<0, dfdx=0
# for dir>=0 dfdx=dfdx
grad = grad.clone()
grad[sign.eq(etaminus)] = 0
# update parameters
p.data.addcmul_(-1, grad.sign(), step_size)
設(shè)定變化范圍,根據(jù)符合調(diào)整
RMSprop
square_avg = state['square_avg']
alpha = group['alpha']
state['step'] += 1
if group['weight_decay'] != 0:
grad = grad.add(group['weight_decay'], p.data)
square_avg.mul_(alpha).addcmul_(1 - alpha, grad, grad)
if group['centered']:
grad_avg = state['grad_avg']
grad_avg.mul_(alpha).add_(1 - alpha, grad)
avg = square_avg.addcmul(-1, grad_avg, grad_avg).sqrt().add_(group['eps'])
else:
avg = square_avg.sqrt().add_(group['eps'])
if group['momentum'] > 0:
buf = state['momentum_buffer']
buf.mul_(group['momentum']).addcdiv_(grad, avg)
p.data.add_(-group['lr'], buf)
else:
p.data.addcdiv_(-group['lr'], grad, avg)
記錄每一次梯度變化,由梯度變化決定更新比例,根據(jù)符號調(diào)整步長
LBFGS
用向量代替矩陣,進(jìn)行類似迭代,這個代碼太晦澀了,有興趣可以查看
https://www.cnblogs.com/ljy2013/p/5129294.html