deepspeed基本使用

2023-08-11 python python, 原创 0 评论字数统计: 2.9k(字) 阅读时长: 15(分)

1. deepspeed的基本用法

1.1 deepspeed安装

deepspeed的安装非常简单，只需要运行以下命令即可

1	pip install deepspeed

在此之前还需要安装python，pytorch等基本环境，这里就不赘述了

1.2 配置json文件

deepseed的使用也非常简单，首先需要准备一个json文件，我们新建一个config.json文件来放训练的必要信息

# config.json
{
    "train_batch_size": 4,
    "steps_per_print": 2000,
    "optimizer": {
      "type": "Adam",
      "params": {
        "lr": 0.001,
        "betas": [
          0.8,
          0.999
        ],
        "eps": 1e-8,
        "weight_decay": 3e-7
      }
    },
    "scheduler": {
      "type": "WarmupLR",
      "params": {
        "warmup_min_lr": 0,
        "warmup_max_lr": 0.001,
        "warmup_num_steps": 1000
      }
    },
    "wall_clock_breakdown": false
  }

1.3 deepspeed使用

1. 配置训练参数

def add_argument():
    parser = argparse.ArgumentParser(description='CIFAR')
    parser.add_argument('-b', '--batch_size', default=32, type=int, help='mini-batch size (default: 32)')
    parser.add_argument('-e', '--epochs', default=30, type=int, help='number of total epochs (default: 30)')
    parser.add_argument('--local_rank', type=int, default=-1, help='local rank passed from distributed launcher')
    parser.add_argument('--log-interval', type=int, default=2000, help="output logging information at a given interval")

    parser = deepspeed.add_config_arguments(parser)  # deepspeed的参数
    args = parser.parse_args()
    return args

args = add_argument()

2. 初始化网络

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()
parameters = filter(lambda p: p.requires_grad, net.parameters())

3. 加载数据

transform = transforms.Compose(
  [transforms.ToTensor(),
 transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=16, shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)

4. 初始化deepspeed

1 2	model_engine, optimizer, trainloader, __ = deepspeed.initialize( args=args, model=net, model_parameters=parameters, training_data=trainset)

5. 训练和测试

criterion = nn.CrossEntropyLoss()
for epoch in range(2):
    running_loss = 0.0
    for i, data in enumerate(trainloader):
        inputs, labels = data[0].to(model_engine.local_rank), data[1].to(model_engine.local_rank)
        outputs = model_engine(inputs)
        loss = criterion(outputs, labels)
        model_engine.backward(loss)
        model_engine.step()

        # print statistics
        running_loss += loss.item()
        if i % args.log_interval == (args.log_interval - 1):
            print('[%d %5d] loss: %.3f' % (epoch+1, i+1, running_loss / args.log_interval))
            running_loss = 0.0


correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images.to(model_engine.local_rank))
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels.to(model_engine.local_rank)).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

每一步都写好之后，我们运行下面命令来启动程序

1	deepspeed --include localhost:7 test.py --deepspeed_config config.json

接下来就可以看到训练过程了

[2023-08-11 15:28:13,128] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-11 15:28:14,693] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-11 15:28:14,694] [INFO] [runner.py:555:main] cmd = /home/wangyh/miniconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test.py --deepspeed_config config.json
[2023-08-11 15:28:15,862] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-11 15:28:17,418] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [7]}
[2023-08-11 15:28:17,418] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-08-11 15:28:17,418] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-08-11 15:28:17,418] [INFO] [launch.py:163:main] dist_world_size=1
[2023-08-11 15:28:17,418] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=7
[2023-08-11 15:28:18,729] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Files already downloaded and verified
Files already downloaded and verified
[2023-08-11 15:28:21,691] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown
[2023-08-11 15:28:21,691] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-11 15:28:21,691] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-11 15:28:21,691] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-11 15:28:22,465] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.0886220932006836 seconds
[2023-08-11 15:28:22,849] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
......

2. deepspeed的进阶技巧

2.1 指定特定显卡

我们只展示单机多卡的情况，运行deepspeed --include localhost:4,5,6,7 train.py --deepspeed_config config.json，其中include这个参数就可以指定卡的数量，这里我们指定4,5,6,7即四张卡来训练，如果不指定的话，deepspeed会自动选择所有可用的卡来训练

如果还有其他的参数，在--deepspeed_config这个参数后面加即可

2.2 如何用vscode调试deepspeed程序

将launch.json文件改为以下内容即可，其中program那一行的llava2改为自己的环境名字

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "type": "python",
            "request": "launch",
            "program": "/home/wangyh/miniconda3/envs/llava2/bin/deepspeed",
            "console": "integratedTerminal",
            "justMyCode": true,
            "args": [
                "--include", "localhost:7",
                "test.py",
                "--deepspeed_config", "/data/wangyh/mllms/deepspeed_test/config.json",
            ],
        }
    ]
}

2.3 DeepSpeed基本运行命令

deepspeed --master_port 29500 \
  --num_gpus 2 \
  train.py \
  --deepspeed ds_config.json

master_port: 指定端口号，默认为295000
num_gpus: GPU数目，默认使用所用到的所有GPU
deepspeed：提供的config文件，用来指定Deepspeed的参数

2.4 Stage2和Stage3的一些例子

stage2的基本作用

{
    "bfloat16": {
        "enabled": "auto"
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 1e5
}

stage3的基本作用

{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

3. deepspeed应用分析

3.1 基本介绍

DeepSpeed官方教程
 Huggingface官方教程
 DeepSpeed论文

Optimizer state partitioning (ZeRO stage 1)
Gradient partitioning (ZeRO stage 2)
Parameter partitioning (ZeRO stage 3)
Custom mixed precision training handling
A range of fast CUDA-extension-based optimizers
ZeRO-Offload to CPU and NVMe

最多的我们使用的是前三种，第一种是针对Optimizer的优化，第二种主要是对梯度的优化（因此在inference的时候是无效的），第三种是对模型参数的优化（可以大模型切分到多张卡上）

这里只介绍DeepSpeed在Huggingface中的应用

在Huggingface中讲DeepSpeed集成到了Trainer类中：只需要提供自定义的config文件即可，不需要做其余的任何操作（推荐）
自定义自己的Trainer，这块不作介绍

训练过程：支持ZeRO stage1，2，3以及Infinity
推理过程：支持ZeRO stage3以及Infinity

启动方式

# 常规的pytorch DDP启动
torch.distributed.run --nproc_per_node=2 your_program.py <normal cl args> --deepspeed ds_config.json

# deepspeed专属启动方式
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json

注意，如果你要指定使用的GPU，比如你想用GPU 1，那么使用CUDA_VISIBLE_DEVICES是无效的，需要使用localhost参数

1	deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...

注意，在deepspeed文档中的很多参数会和trainer中的有冲突，所以为了避免冲突的发生，你需要将参数的值设为auto，它能够自动替换为正确或者最有效的值（如果二者的参数不匹配，训练可能会失败）

Stage two

{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true
    }
}

offload_optimizer: 减少GPU使用量
- 1. stage: 第几阶段，有0，1，2，3，分别对应着disabled，优化器状态，优化器+梯度，优化器+梯度+参数，默认0
- 1. allgather_partitions: 在每个步骤结束时从所有GPU收集更新的参数，默认true
- 1. allgather_bucket_size: 一次收集的所有元素数量，限制大模型参数量（可以认为这个参数越大，占用GPU的内存越小），默认5e8
- 1. overlap_comm: 尝试将梯度的减少和反向计算重叠，默认false

Stage three

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

offload_optimizer: 尝试将优化器状态和计算都offload到CPU中，这将为更大的模型释放GPU显存，默认为false
- 1. device: 优化器状态offload的设备，默认cpu
- 1. pin_memory: offload到page-locked cpu内存中，这会以额外的内存开销为代价来提高吞吐量，默认false
offload_param: 尝试将模型参数都offload到CPU中，这将为更大的模型释放GPU显存（只在第三阶段有效），默认为false
- 1. device: 模型参数offload的设备，默认cpu
- 1. pin_memory: offload到page-locked cpu内存中，这会以额外的内存开销为代价来提高吞吐量，默认false
overlap_comm: 开启时会在计算梯度时执行梯度之间的通信，减少通信时间，默认false
contiguous_gradients: 将梯度复制到一个连续缓冲区，避免在反向传播时产生内存碎片，默认true
sub_group_size: 增加可以提高贷款利用率，不使用NVMe时，保持默认值即可，默认1e9
reduce_bucket_size: 越大通信操作会越快，但是需要更多的内存来存储中间结果，默认5e8

Others

{
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "consecutive_hysteresis": false,
        "min_loss_scale": 1
    }
    "bf16": {
        "enabled": true
    }
}

fp16: 混合精度训练
- 1. enabled: 是否开启，默认false
- 1. auto_cast: 是否强制转换为fp16数据类型，默认false
- 1. loss_scale: 默认0
- 1. initial_scale_power: 默认16
- 1. loss_scale_window: 默认1000
- 1. hysteresis: 默认2
- 1. consecutive_hysteresis: 默认false
- 1. min_loss_scale: 默认1
bf16: 是否开启bf16，不能和amp模式以及fp16模式一起使用，默认false

本文链接： https://harrytea.netlify.app/2023/08/11/deepspeed基本使用/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

harryteaustc

computer vision