deepspeed基本使用

1. deepspeed的基本用法

1.1 deepspeed安装

deepspeed的安装非常简单,只需要运行以下命令即可

1
pip install deepspeed

在此之前还需要安装python,pytorch等基本环境,这里就不赘述了

1.2 配置json文件

deepseed的使用也非常简单,首先需要准备一个json文件,我们新建一个config.json文件来放训练的必要信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# config.json
{
"train_batch_size": 4,
"steps_per_print": 2000,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
},
"wall_clock_breakdown": false
}

1.3 deepspeed使用

1. 配置训练参数

1
2
3
4
5
6
7
8
9
10
11
12
def add_argument():
parser = argparse.ArgumentParser(description='CIFAR')
parser.add_argument('-b', '--batch_size', default=32, type=int, help='mini-batch size (default: 32)')
parser.add_argument('-e', '--epochs', default=30, type=int, help='number of total epochs (default: 30)')
parser.add_argument('--local_rank', type=int, default=-1, help='local rank passed from distributed launcher')
parser.add_argument('--log-interval', type=int, default=2000, help="output logging information at a given interval")

parser = deepspeed.add_config_arguments(parser) # deepspeed的参数
args = parser.parse_args()
return args

args = add_argument()

2. 初始化网络

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16*5*5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

net = Net()
parameters = filter(lambda p: p.requires_grad, net.parameters())

3. 加载数据

1
2
3
4
5
6
7
8
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=16, shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)

4. 初始化deepspeed

1
2
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
args=args, model=net, model_parameters=parameters, training_data=trainset)

5. 训练和测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
criterion = nn.CrossEntropyLoss()
for epoch in range(2):
running_loss = 0.0
for i, data in enumerate(trainloader):
inputs, labels = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
outputs = model_engine(inputs)
loss = criterion(outputs, labels)
model_engine.backward(loss)
model_engine.step()

# print statistics
running_loss += loss.item()
if i % args.log_interval == (args.log_interval - 1):
print('[%d %5d] loss: %.3f' % (epoch+1, i+1, running_loss / args.log_interval))
running_loss = 0.0


correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images.to(model_engine.local_rank))
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels.to(model_engine.local_rank)).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

每一步都写好之后,我们运行下面命令来启动程序

1
deepspeed --include localhost:7 test.py --deepspeed_config config.json

接下来就可以看到训练过程了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[2023-08-11 15:28:13,128] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-11 15:28:14,693] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-11 15:28:14,694] [INFO] [runner.py:555:main] cmd = /home/wangyh/miniconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test.py --deepspeed_config config.json
[2023-08-11 15:28:15,862] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-11 15:28:17,418] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [7]}
[2023-08-11 15:28:17,418] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-08-11 15:28:17,418] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-08-11 15:28:17,418] [INFO] [launch.py:163:main] dist_world_size=1
[2023-08-11 15:28:17,418] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=7
[2023-08-11 15:28:18,729] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Files already downloaded and verified
Files already downloaded and verified
[2023-08-11 15:28:21,691] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown
[2023-08-11 15:28:21,691] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-11 15:28:21,691] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-11 15:28:21,691] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-11 15:28:22,465] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.0886220932006836 seconds
[2023-08-11 15:28:22,849] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
......

2. deepspeed的进阶技巧

2.1 指定特定显卡

我们只展示单机多卡的情况,运行deepspeed --include localhost:4,5,6,7 train.py --deepspeed_config config.json,其中include这个参数就可以指定卡的数量,这里我们指定4,5,6,7即四张卡来训练,如果不指定的话,deepspeed会自动选择所有可用的卡来训练

如果还有其他的参数,在--deepspeed_config这个参数后面加即可

2.2 如何用vscode调试deepspeed程序

launch.json文件改为以下内容即可,其中program那一行的llava2改为自己的环境名字

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "/home/wangyh/miniconda3/envs/llava2/bin/deepspeed",
"console": "integratedTerminal",
"justMyCode": true,
"args": [
"--include", "localhost:7",
"test.py",
"--deepspeed_config", "/data/wangyh/mllms/deepspeed_test/config.json",
],
}
]
}

2.3 DeepSpeed基本运行命令

1
2
3
4
deepspeed --master_port 29500 \
--num_gpus 2 \
train.py \
--deepspeed ds_config.json
  • master_port: 指定端口号,默认为295000
  • num_gpus: GPU数目,默认使用所用到的所有GPU
  • deepspeed: 提供的config文件,用来指定Deepspeed的参数

2.4 Stage2和Stage3的一些例子

stage2的基本作用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
{
"bfloat16": {
"enabled": "auto"
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print": 1e5
}

stage3的基本作用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{
"bfloat16": {
"enabled": false
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1e5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

3. deepspeed应用分析

3.1 基本介绍

DeepSpeed官方教程
Huggingface官方教程
DeepSpeed论文

  1. Optimizer state partitioning (ZeRO stage 1)
  2. Gradient partitioning (ZeRO stage 2)
  3. Parameter partitioning (ZeRO stage 3)
  4. Custom mixed precision training handling
  5. A range of fast CUDA-extension-based optimizers
  6. ZeRO-Offload to CPU and NVMe

最多的我们使用的是前三种,第一种是针对Optimizer的优化,第二种主要是对梯度的优化(因此在inference的时候是无效的),第三种是对模型参数的优化(可以大模型切分到多张卡上)

这里只介绍DeepSpeed在Huggingface中的应用

  • 在Huggingface中讲DeepSpeed集成到了Trainer类中:只需要提供自定义的config文件即可,不需要做其余的任何操作(推荐)
  • 自定义自己的Trainer,这块不作介绍

训练过程:支持ZeRO stage1,2,3以及Infinity
推理过程:支持ZeRO stage3以及Infinity

启动方式

1
2
3
4
5
# 常规的pytorch DDP启动
torch.distributed.run --nproc_per_node=2 your_program.py <normal cl args> --deepspeed ds_config.json

# deepspeed专属启动方式
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json

注意,如果你要指定使用的GPU,比如你想用GPU 1,那么使用CUDA_VISIBLE_DEVICES是无效的,需要使用localhost参数

1
deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...

注意,在deepspeed文档中的很多参数会和trainer中的有冲突,所以为了避免冲突的发生,你需要将参数的值设为auto,它能够自动替换为正确或者最有效的值(如果二者的参数不匹配,训练可能会失败)

Stage two

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true
}
}
  • offload_optimizer: 减少GPU使用量
      1. stage: 第几阶段,有0,1,2,3,分别对应着disabled,优化器状态,优化器+梯度,优化器+梯度+参数,默认0
      1. allgather_partitions: 在每个步骤结束时从所有GPU收集更新的参数,默认true
      1. allgather_bucket_size: 一次收集的所有元素数量,限制大模型参数量(可以认为这个参数越大,占用GPU的内存越小),默认5e8
      1. overlap_comm: 尝试将梯度的减少和反向计算重叠,默认false

Stage three

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
  • offload_optimizer: 尝试将优化器状态和计算都offload到CPU中,这将为更大的模型释放GPU显存,默认为false
      1. device: 优化器状态offload的设备,默认cpu
      1. pin_memory: offload到page-locked cpu内存中,这会以额外的内存开销为代价来提高吞吐量,默认false
  • offload_param: 尝试将模型参数都offload到CPU中,这将为更大的模型释放GPU显存(只在第三阶段有效),默认为false
      1. device: 模型参数offload的设备,默认cpu
      1. pin_memory: offload到page-locked cpu内存中,这会以额外的内存开销为代价来提高吞吐量,默认false
  • overlap_comm: 开启时会在计算梯度时执行梯度之间的通信,减少通信时间,默认false
  • contiguous_gradients: 将梯度复制到一个连续缓冲区,避免在反向传播时产生内存碎片,默认true
  • sub_group_size: 增加可以提高贷款利用率,不使用NVMe时,保持默认值即可,默认1e9
  • reduce_bucket_size: 越大通信操作会越快,但是需要更多的内存来存储中间结果,默认5e8

Others

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"consecutive_hysteresis": false,
"min_loss_scale": 1
}
"bf16": {
"enabled": true
}
}
  • fp16: 混合精度训练
      1. enabled: 是否开启,默认false
      1. auto_cast: 是否强制转换为fp16数据类型,默认false
      1. loss_scale: 默认0
      1. initial_scale_power: 默认16
      1. loss_scale_window: 默认1000
      1. hysteresis: 默认2
      1. consecutive_hysteresis: 默认false
      1. min_loss_scale: 默认1
  • bf16: 是否开启bf16,不能和amp模式以及fp16模式一起使用,默认false