请安装docker-19.03及以上版本。
在Windows开始菜单中找到docker软件图标并启动。
请手动修改docker镜像保存路径,确保至少20G的磁盘空间装载docker镜像和操作。
安装完成后,运行下面的命令查看版本,验证是否安装成功并有足够权限。
输入指令:
docker --version
执行结果:
Docker version 24.0.5, build ced0996
至此docker环境安装完成。
根据自己的需要选择对应的镜像
docker pull listenai/linger:1.1.1 #纯cpu版本镜像
docker pull listenai/linger_gpu:1.1.1 #cuda11.2版本镜像
linger_gpu针对有GPU的电脑,若无GPU则选用CPU版本。
拉取镜像成功:
...
Digest: sha256:0f20199ac2c3892159ac7cc6e9de5efc37f450921a97370c4aec43d4c01e4185
Status: Downloaded newer image for listenai/linger:1.1.1
docker.io/listenai/linger:1.1.1
What's Next?
View summary of image vulnerabilities and recommendations → docker scout quickview listenai/linger:1.1.1
docker pull listenai/thinker:2.1.1
拉取镜像成功:
...
Status: Downloaded newer image for listenai/thinker:2.1.1
docker.io/listenai/thinker:2.1.1
What's Next?
View summary of image vulnerabilities and recommendations → docker scout quickview listenai/thinker:2.1.1
拉取镜像的过程比较耗时,请耐心等待。
以linger(cpu版本)为示例。
运行示例中把linger容器的home目录挂载到本地主机的e:/linger目录下,以便文件同步和修改,您可以根据实际需要进行挂载。
docker container run -it -v e:/linger:/home listenai/linger:1.1.1 /bin/bash
运行指令执行结果:
root@66d80f4aaf1e:/linger#
其中66d80f4aaf1e表示容器ID,这表示你已经在容器里面了,返回的提示符就是容器内部的 Shell 提示符。这里进入linger目录,能够执行linger相关命令。
提示:
linger的docker镜像里已经搭建了linger环境,无需再次安装linger环境,需要执行sh install.sh 脚本后验证环境是否就绪。
进入linger目录,执行以下指令:
sh install.sh
然后执行python操作:
python
返回结果:
Python 3.7.0 (default, Jul 8 2020, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>import linger
# 无报错提醒,则说明linger已成功被安装
cd到容器的home目录下(挂载到本地主机的目录)。拉取示例项目pytorch-cifar100代码:
git clone https://cloud.listenai.com/listenai_xqqin/pytorch-cifar100.git /home
clone 到home目录下。
本示例工程基于
pytorch-cifar100官方项目,针对LNN的特性和使用要求做了修改,建议您阅读《cifar100示例项目修改内容》以了解修改了哪些地方,这对您接下来训练、部署自己的模型有帮助。
按照浮点训练要求修改pytorch-cifar100的train.py文件。
在loss_function = nn.CrossEntropyLoss()代码段前添加浮点训练时linger约束条件:
# 浮点训练时导入linger约束条件
import linger
net=net.to(device)
dummy_input=torch.randn(8,3,32,32,requires_grad=True).to(device)#设置模型输入大小
train_mode ="clamp" #clamp:浮点训练阶段范围约束,quant:量化训练阶段
linger.trace_layers(net,net,dummy_input,fuse_bn=True)#net为初始模型结构,dummy_input为模型输入数据
"""linger.disable_normalize(net.fc)#设置不量化的层"""
normalize_modules=(nn.Conv2d,nn.Linear,nn.BatchNorm2d)#设置需要量化的层,可使用默认值
net=linger.normalize_layers(net,normalize_modules=normalize_modules,normalize_weight_value=8, normalize_bias_value=None,normalize_output_value=8)#模型量化参数设置
在pytorch-cifar100目录下执行以下脚本:
python train.py -net resnet18 # 执行训练指令
浮点训练结束:
Files already downloaded and verified
Files already downloaded and verified
/root/anaconda3/lib/python3.7/site-packages/torch/_tensor.py:579: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(other, self)
/home/pytorch-cifar100/models/resnet.py:133: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
output = output.view(-1,int(output.numel()//output.size(0)))
Training Epoch: 1 [128/50000] Loss: 4.7146 LR: 0.000000
Training Epoch: 1 [256/50000] Loss: 4.8907 LR: 0.000256
...
Training Epoch: 1 [49792/50000] Loss: 4.1798 LR: 0.099233
Training Epoch: 1 [49920/50000] Loss: 4.0861 LR: 0.099488
Training Epoch: 1 [50000/50000] Loss: 4.2186 LR: 0.099744
epoch 1 training time consumed: 147.89s
Evaluating Network.....
Test set: Epoch: 1, Average loss: 0.0329, Accuracy: 0.0454, Time consumed:3.80s
在跑完第一个Training Epoch即可主动停止(ctrl+c)。*.pth文件生成路径示例:
/home/pytorch-cifar100/checkpoint/resnet18/Saturday_09_September_2023_18h_58m_16s/resnet18-1-regular.pth
实际训练时
Saturday_09_September_2023_18h_58m_16s会有变化,以实际训练时生成目录为准。
在开始量化训练前修改train.py脚本,在量化训练时导入linger操作
# 量化训练时导入linger操作
import linger
net=net.to(device)
dummy_input=torch.randn(8,3,32,32,requires_grad=True).to(device)#设置模型输入大小
train_mode ="quant" #clamp:浮点训练阶段范围约束,quant:量化训练阶段
linger.trace_layers(net,net,dummy_input,fuse_bn=True)#net为初始模型结构,dummy_input为模型输入数据
"""linger.disable_normalize(net.fc)#设置不量化的层"""
normalize_modules=(nn.Conv2d,nn.Linear,nn.BatchNorm2d)#设置需要量化的层,可使用默认值
replace_tulpe=(nn.Conv2d,nn.Linear, nn.BatchNorm2d, nn.AvgPool2d)#❤
net=linger.normalize_layers(net,normalize_modules=normalize_modules,normalize_weight_value=8, normalize_bias_value=None,normalize_output_value=8)#模型量化参数设置
net=linger.init(net, quant_modules=replace_tulpe,mode=linger.QuantMode.QValue)
#加载浮点训练时生成的**.pth
net.load_state_dict(torch.load("/home/pytorch-cifar100/checkpoint/resnet18/Saturday_09_September_2023_18h_58m_16s/resnet18-1-regular.pth"))
python train.py -net resnet18
量化训练结束:
Files already downloaded and verified Files already downloaded and verified /root/anaconda3/lib/python3.7/site-packages/torch/_tensor.py:579: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(other, self) /home/pytorch-cifar100/models/resnet.py:133: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! output = output.view(-1,int(output.numel()//output.size(0))) Training Epoch: 1 [128/50000] Loss: 4.6930 LR: 0.000000 Training Epoch: 1 [256/50000] Loss: 4.6886 LR: 0.000256 Training Epoch: 1 [384/50000] Loss: 4.7367 LR: 0.000512 Training Epoch: 1 [512/50000] Loss: 4.7378 LR: 0.000767 Training Epoch: 1 [640/50000] Loss: 4.7736 LR: 0.001023 Training Epoch: 1 [768/50000] Loss: 4.8131 LR: 0.001279 Training Epoch: 1 [896/50000] Loss: 4.7823 LR: 0.001535 Training Epoch: 1 [1024/50000] Loss: 4.7550 LR: 0.001790 Training Epoch: 1 [1152/50000] Loss: 4.7108 LR: 0.002046 Training Epoch: 1 [1280/50000] Loss: 4.7074 LR: 0.002302 Training Epoch: 1 [1408/50000] Loss: 4.8324 LR: 0.002558 Training Epoch: 1 [1536/50000] Loss: 4.6873 LR: 0.002813 Training Epoch: 1 [1664/50000] Loss: 4.7840 LR: 0.003069 Training Epoch: 1 [1792/50000] Loss: 4.7409 LR: 0.003325 Training Epoch: 1 [1920/50000] Loss: 4.6880 LR: 0.003581 Training Epoch: 1 [2048/50000] Loss: 4.7332 LR: 0.003836 Training Epoch: 1 [2176/50000] Loss: 4.6932 LR: 0.004092 Training Epoch: 1 [2304/50000] Loss: 4.7095 LR: 0.004348 Training Epoch: 1 [2432/50000] Loss: 4.7338 LR: 0.004604 Training Epoch: 1 [2560/50000] Loss: 4.7003 LR: 0.004859 Training Epoch: 1 [2688/50000] Loss: 4.7005 LR: 0.005115 Training Epoch: 1 [2816/50000] Loss: 4.6649 LR: 0.005371 Training Epoch: 1 [2944/50000] Loss: 4.7591 LR: 0.005627 Training Epoch: 1 [3072/50000] Loss: 4.6761 LR: 0.005882 Training Epoch: 1 [3200/50000] Loss: 4.7718 LR: 0.006138 Training Epoch: 1 [3328/50000] Loss: 4.7878 LR: 0.006394 Training Epoch: 1 [3456/50000] Loss: 4.7317 LR: 0.006650 Training Epoch: 1 [3584/50000] Loss: 4.6645 LR: 0.006905 Training Epoch: 1 [3712/50000] Loss: 4.6986 LR: 0.007161
...
Training Epoch: 1 [49664/50000] Loss: 4.1578 LR: 0.098977
Training Epoch: 1 [49792/50000] Loss: 4.1697 LR: 0.099233
Training Epoch: 1 [49920/50000] Loss: 4.2625 LR: 0.099488
Training Epoch: 1 [50000/50000] Loss: 4.3276 LR: 0.099744
epoch 1 training time consumed: 180.79s
Evaluating Network.....
Test set: Epoch: 1, Average loss: 0.0341, Accuracy: 0.0329, Time consumed:3.81s
在跑完第一个Training Epoch即可主动停止(ctrl+c)。*.pth文件生成路径示例:
/home/pytorch-cifar100/checkpoint/resnet18/Saturday_09_September_2023_19h_03m_58s/resnet18-1-regular.pth
实际训练时
Saturday_09_September_2023_19h_03m_58s会有变化,以实际训练时生成目录为准。
在pytorch-cifar100根目录下找到pt2onnx.py脚本文件。
修改pt2onnx.py脚本,设置检查点文件路径、onnx文件路径:
# 设置检查点文件路径、onnx文件路径
ch_path="/home/pytorch-cifar100/checkpoint/resnet18/Saturday_09_September_2023_19h_03m_58s/resnet18-1-regular.pth"
onnx_path ='/home/resnet18_shape.onnx'
其中/home/为生成onnx文件的目标路径。
执行指令:
在pytorch-cifar100根目录下执行以下指令:
python pt2onnx.py
导出量化后模型文件结束。
运行thinker容器,并将容器的/home目录挂载到本地主机的e:\thinker目录,以方便文件同步和修改。
docker container run -it -v e:\thinker:/home listenai/thinker:2.1.1 /bin/bash
执行结果:
root@38ceeca52613:/thinker#
以上表示进入了thinker环境。
将模型训练的结果文件resnet18_shape.onnx文件拷贝到容器/home目录中,执行打包操作:
在``thinker根目录下执行以下操作:`
tpacker -g ../home/resnet18_shape.onnx -d True -o model.bin
其中model.bin文件为打包后输出的.bin模型文件。
打包工具会对计算图进行图优化、模拟引擎执行以规划内存占用并将分析结果序列化到资源文件中。该部分会对算子的输入大小、整体内存的检查。
执行结果:
===================================================================================
****** load model:../home/resnet18_shape.onnx ******
===================================================================================
****** graph optimizer ******
---- op fusion success ----
---- op fusion success ----
---- convert layout success ----
===================================================================================
****** graph adjust device ******
set linearint threshold and analyze memory begin
try threshold:655360
memory allocate on SHARE_MEM:[8192, 8192, 8192, 8192, 1200, 608],total:34576
set linearint threshold and analyze memory end
===================================================================================
****** generate memory plan report ******
memory.txt generated success!
===================================================================================
****** graph serialize ******
pack param begin
pack param success
pack tensor begin
pack tensor success
pack operator begin
pack operator success
pack io begin
pack io success
MemType.PSRAM need capacity: 13376 Bytes
MemType.SHARE_MEM need capacity: 34576 Bytes
resource_total_size:27424
===================================================================================
****** pack success ******
===================================================================================
至此,即可进行正常浮点模型训练,及量化训练,直至模型收敛。将收敛后的模型导出打包生成model.bin文件。
进入thinker的目录下,对thinker进行编译。
在编译之前需要先修改script/x86_linux.sh和test/auto_test.sh脚本中的CMAKE的路径,修改为anaconda/bin所在的目录。
找到
将
CMAKE_ROOT=/home/bitbrain/bzcai/anaconda3/bin
修改为:
CMAKE_ROOT=/root/anaconda3/bin
详见示例文件:
将x86_linux.sh文件内容替换为:
x86_linux.sh
以上步骤完成后,执行脚本进行编译:
在thinker根目录下执行以下操作:
sh ./scripts/x86_linux.sh
执行结果:
Stored in directory: /root/.cache/pip/wheels/69/2c/c0/4c2b698c1654288fbd7f0304087c70ba578480141dd972ca21
Successfully built pythinker
Installing collected packages: pythinker
Attempting uninstall: pythinker
Found existing installation: pythinker 2.1.1
Uninstalling pythinker-2.1.1:
Successfully uninstalled pythinker-2.1.1
Successfully installed pythinker-2.1.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
/thinker
确认thinker环境是否就绪:
在thinker根目录下执行以下操作
python
执行结果:
Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import thinker
>>>
如未报错则表示thinker环境已就绪。
在下一步的仿真测试中需要给模型输入图片,以识别结果,这里我们通过image_preprocess.py脚本把jpg图片转成二进制bin文件,详细操作:
image_preprocess.py文件,指定输入和输出文件路径image = Image.open("/thinker/demo/resnet18/apple.jpg")
preprocessed_image_int8.tofile("/thinker/demo/resnet18/apple_after_resize.bin")
python tools/image_preprocess.py #执行指令
可能遇到的报错:
Traceback (most recent call last):
File "image_preprocess.py", line 4, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
解决办法:
# 安装依赖:
pip install torch -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install torchvision -i https://pypi.tuna.tsinghua.edu.cn/simple
转换成功后的bin文件保存在/thinker/demo/resnet18/apple_after_resize.bin路径下,可作为仿真测试的输入文件。
执行仿真指令:
在thinker根目录下执行以下操作:
chmod +x ./bin/test_thinker
./bin/test_thinker ./demo/resnet18/apple_after_resize.bin model.bin output.bin 3 32 32 6
执行结果:
./bin/test_thinker ./demo/resnet18/apple_after_resize.bin model.bin output.bin 3 32 32 6
init model successful!
create executor successful!
forward successful!
Predicted category index: 53
Predicted label: orange
该结果表示模型已计算完成,工程实例顺利跑通。
./demo/resnet18/apple_after_resize.bin为输入的图片文件,可通过以下方式生成。Predicted category index 表示输入模型的分类结果,包括序号、标签和相应的概率值,示例项目的模型中包含100个分类。Predicted label: orange 标识最终识别的对象名称。docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
747686929bd4 listenai/linger:1.1.1 "/bin/bash" About a minute ago Up About a minute trusting_mirzakhani
4ede768d6d73 listenai/linger:1.1.1 "/bin/bash" 49 minutes ago Up 49 minutes priceless_joliot
docker images 查询当前已下载的docker镜像
docker start CONTAINER_ID
启动docker容器,其中CONTAINER_ID为容器ID。
docker exec -it 747686929bd4 /bin/bash
如果容器未运行则会返回以下提示:
docker exec -it 747686929bd4 /bin/bash
Error response from daemon: Container 747686929bd42169e43d52bb69027385af000ca1be118cc00251e5b36c2a70de is not running
需要使用docker start CONTAINER_ID指令启动docker容器。
docker cp model 66d80f4aaf1e:linger
当然,挂载磁盘的方式更方便。
docker cp 66d80f4aaf1e:/models /opt
docker container kill [containID]