Huggingface Transformer 庫(kù)訓(xùn)練BERT優(yōu)化技術(shù)總結(jié)

實(shí)驗(yàn)測(cè)試內(nèi)容

Why ? 為什么采用HuggingFace的軟件棧?

  • 隨著AI 基礎(chǔ)軟件和框架的發(fā)展, AI 訓(xùn)練過(guò)程也變得極為復(fù)雜, AI training中往往涉及很多不同配置的組合, 比如 混合精度訓(xùn)練, Low-bit Optimizer, Quantized Graident等等, 因此直接采用PyTorch訓(xùn)練會(huì)比較繁瑣和復(fù)雜。
  • 為此HuggingFace, Microsoft等推出了更High-Level高階和用戶(hù)友好的AI訓(xùn)練框架, 這些框架緊跟學(xué)術(shù)屆前沿,不斷的將最新的成果集成到各自的庫(kù)中,增強(qiáng)自身的競(jìng)爭(zhēng)力和影響力。

主要內(nèi)容: 使用Huggingface Transformer 庫(kù), 配置不同的AI訓(xùn)練參數(shù)/選項(xiàng), 理解這些不同訓(xùn)練參數(shù)和優(yōu)化的意義和原理

(由于本人也是學(xué)以致用,因此可能存在理解不到位的地方)


軟硬件環(huán)境

  • Ubuntu 22.04
  • 1 * RTX 3080 10GB GPU
  • PyTorch 2.0
  • CUDA-12.0
  • Huggingface相關(guān)的庫(kù): transformers, datasets, accelerate

基本訓(xùn)練過(guò)程(Baseline)

以下為最基礎(chǔ)的AI模型訓(xùn)練過(guò)程,不帶任何優(yōu)化。

  • 訓(xùn)練數(shù)據(jù)為隨機(jī)生成的假數(shù)據(jù)
  • 為了監(jiān)測(cè)訓(xùn)練過(guò)程GPU Memory的使用情況, 采用 pynvml庫(kù)的API輸出GPU顯存的使用量
  • 模型選擇: BERT-based, 由于是采用單GPU訓(xùn)練,選擇較小的模型便于觀(guān)察
  • 訓(xùn)練API: 主要采用 Huggingface Transformer庫(kù)的 Trainer API, 該API已封裝的Training Loop循環(huán)
  • 訓(xùn)練結(jié)果: 觀(guān)察GPU顯存占用量, 訓(xùn)練吞吐
import numpy as np
from datasets import Dataset
from pynvml import *
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, logging

logging.set_verbosity_error()

seq_len, dataset_size = 512, 512
dummy_data = {
    'input_ids': np.random.randint(100, 30000, (dataset_size, seq_len)),
    'labels': np.random.randint(0,1, (dataset_size))
}

ds = Dataset.from_dict(dummy_data)
ds.set_format('pt')

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f'GPU memory occupied: {info.used // 1024**2} MB')

def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

print_gpu_utilization()

default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

training_args = TrainingArguments(per_device_train_batch_size=4, 
                                   optim='adafactor',
                                   **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

輸出結(jié)果:
{'train_runtime': 16.0498, 'train_samples_per_second': 31.901, 'train_steps_per_second': 7.975, 'train_loss': 0.013442776165902615, 'epoch': 1.0}
Time: 16.05
Samples/second: 31.90
GPU memory occupied: 5790 MB

優(yōu)化1: + 梯度累加 (Gradient Accumulation)

梯度累加: 是一種時(shí)間換空間的思想方法, 使得在有限的GPU Memory條件下允許使用更大的batch_size訓(xùn)練, 這里的空間指的是GPU Memory。傳統(tǒng)的一般訓(xùn)練過(guò)程, 每計(jì)算完一個(gè)batch便計(jì)算梯度以及進(jìn)行權(quán)重Weight更新, 采用梯度累加的策略之后,每計(jì)算完若干batch之后,再進(jìn)行一次weight update, 每個(gè)batch計(jì)算中仍然計(jì)算梯度,將若干個(gè)batch的梯度累加在一起

對(duì)比:

  • 無(wú)Gradient Accumulation
for idx, batch in enumerate(dataloader):
     # Forward
     loss = model(batch).loss
     # Backward
     loss.backward()
     ...
     
     # Optimizer update
     optimizer.zero_grad()
     optimizer.step()
     ...
  • +Gradient-Accumulation:
    • 代碼中可能有疑問(wèn)? 沒(méi)看到梯度累加的代碼? 實(shí)際上是由于PyTorch框架造成的, 每次計(jì)算完梯度backward()的時(shí)候如果不立即調(diào)用optimizer.zero_grad(), 則當(dāng)前batch計(jì)算的梯度就默認(rèn)累加到之前idx-1的梯度上。
    • 參數(shù): gradient_accumulation_steps 代表多少個(gè)batch之后進(jìn)行一次optimizer update()。 因此實(shí)際的training_batch_size = per_device_train_batch_size * gradient_accumulation_steps
for idx, batch in enumerate(dataloader):
     # Forward
     loss = model(batch).loss
     loss = loss / training_args.gradient_accumulation_steps
     # Backward
     loss.backward()
     ...
     if idx % training_args.gradient_accumulation_steps == 0:
     # Optimizer update
     optimizer.zero_grad()     
     optimizer.step()
     ...

測(cè)試代碼:

training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

保存training_batch_size不變, 輸出結(jié)果: GPU Memory占用明顯降低 (5790MB --> 4169MB), 訓(xùn)練吞吐略有降低。
per_device_train_batch_size=1, gradient_accumulation_steps=4
{'train_runtime': 19.7445, 'train_samples_per_second': 25.931, 'train_steps_per_second': 6.483, 'train_loss': 0.01618509739637375, 'epoch': 1.0}
Time: 19.74
Samples/second: 25.93
GPU memory occupied: 4169 MB


優(yōu)化2: + Gradient Checkpointing

Why ? 訓(xùn)練在backward計(jì)算某一layer weight的梯度時(shí)候, 需要用到Forward階段該Layer計(jì)算得到的Activation輸出。 因此每個(gè)layer在Forward階段的Activation輸出需要一直保存在GPU Memory, 顯然增大了Memory的使用量。

Gradient Checkpoint的原理: 只保存?zhèn)€別Layer 的Activation 輸出 (被選中保存的Layer 稱(chēng)為Checkpoint Node), 在反向傳播計(jì)算采用重計(jì)算 (Recomputation)根據(jù)最近的Layer的Activation重新計(jì)算出當(dāng)前Layer所需的Activation.

優(yōu)勢(shì) vs. 劣勢(shì):

  • 優(yōu)勢(shì): 由于只保存部分Layer 的Activation , 降低了GPU Memory占有
  • 劣勢(shì): 重計(jì)算引入了額外的計(jì)算負(fù)擔(dān),訓(xùn)練吞吐變慢。

代碼實(shí)現(xiàn):

training_args = TrainingArguments(
    per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

輸出結(jié)果: GPU Memory進(jìn)一步降低 (4169MB --> 3706MB), 吞吐降低: 25.93 --> 20.40
{'train_runtime': 25.1014, 'train_samples_per_second': 20.397, 'train_steps_per_second': 5.099, 'train_loss': 0.015386142767965794, 'epoch': 1.0}
Time: 25.10
Samples/second: 20.40
GPU memory occupied: 3706 MB


優(yōu)化3: + 混合精度訓(xùn)練 (Mixed-Precision), 低精度

核心思想: 采用低精度的數(shù)據(jù)類(lèi)型(Numeric Format) 存儲(chǔ)Weight, Activation.Gradient, 并且采用低精度進(jìn)行計(jì)算

優(yōu)勢(shì) vs. 劣勢(shì):

  • 優(yōu)勢(shì): Low-precision降低Memory Footprint, 計(jì)算復(fù)雜度,提高訓(xùn)練速度和吞吐
  • 劣勢(shì):使用不當(dāng)會(huì)造成數(shù)值溢出,訓(xùn)練發(fā)散

AI訓(xùn)練一般采用浮點(diǎn)數(shù)據(jù)類(lèi)型(Floating-point) 進(jìn)行存儲(chǔ)和計(jì)算, 目前NVIDIA GPU支持的Floating Low-bit precision formats: TF32 --> FP16---> BF16 ---> FP8

[圖片上傳失敗...(image-ecbacb-1692528009679)]

代碼實(shí)現(xiàn): 比如fp16=True, bf16=True 采用相應(yīng)數(shù)據(jù)類(lèi)型的混合精度

training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

輸出結(jié)果: 速度吞吐有提升(20.40 --> 25.91), GPU Memory占有反而有增加,因?yàn)镸aster Weight副本采用FP32存儲(chǔ)
{'train_runtime': 19.76, 'train_samples_per_second': 25.911, 'train_steps_per_second': 6.478, 'train_loss': 0.010953620076179504, 'epoch': 1.0}
Time: 19.76
Samples/second: 25.91
GPU memory occupied: 3829 MB


優(yōu)化4: 低精度Optimizer (8-bit Adam)

# 8bit Adam
import numpy as np
from datasets import Dataset
from pynvml import *
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, logging

# 8bit Adam
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names

# https://huggingface.co/docs/transformers/perf_train_gpu_one

logging.set_verbosity_error()

seq_len, dataset_size = 512, 512
dummy_data = {
    'input_ids': np.random.randint(100, 30000, (dataset_size, seq_len)),
    'labels': np.random.randint(0,1, (dataset_size))
}

ds = Dataset.from_dict(dummy_data)
ds.set_format('pt')

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f'GPU memory occupied: {info.used // 1024**2} MB')

def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


print_gpu_utilization()

torch.ones((1, 1)).to("cuda")
print_gpu_utilization()

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased').to('cuda')
print_gpu_utilization()

default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}


# first we need to group the model’s parameters into two groups where to one group we apply weight decay and to the other we don’t. Usually, biases and layer norm parameters are not weight decayed. Then in a second step we just do some argument housekeeping to use the same parameters as the previously used AdamW optimizer.

decay_parameters = get_parameter_names(model, forbidden_layer_types=[nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if 'bias' not in name]


training_args = TrainingArguments(per_device_train_batch_size=1, 
                                  gradient_accumulation_steps=4, 
                                  gradient_checkpointing=True, 
                                  fp16=True, 
                                  optim='adafactor',
                                  **default_args)

optimizer_grouped_parameters = [
    {
        'params': [p for n,p in model.named_parameters() if n in decay_parameters],
        'weight_decay': training_args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
        "weight_decay": 0.0,
    },
]

optimizer_kwargs = {
    "betas": (training_args.adam_beta1, training_args.adam_beta2),
    "eps": training_args.adam_epsilon,
}

optimizer_kwargs['lr'] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
    optimizer_grouped_parameters,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    lr=training_args.learning_rate
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))
result = trainer.train()
print_summary(result)

輸出結(jié)果:
{'train_runtime': 17.5487, 'train_samples_per_second': 29.176, 'train_steps_per_second': 7.294, 'train_loss': 0.015325695276260376, 'epoch': 1.0}
Time: 17.55
Samples/second: 29.18
GPU memory occupied: 3161 MB


Reference

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀(guān)點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容