snpe后量化重写DLC量化参数

量化感知训练入门：

调用高通骁龙snpe神经网络分别跑在不同的核上代码的详细过程

snpe的后量化：

snpe后量化重写DLC量化参数

1. 将带有FakeQaunt节点的PB转成成DLC.

2. snpe-dlc-quantize --override_params 选项来使用这些encoding.

3. 检查量化DLC的encoding

二、通过encoding.json重写入DLC量化参数

1. SNPE工具有 quantization_overrides选项

2. 需要的JSON文件的格式如下所示:

3. 使用方法

三、snpe-dlc-quantize 量化参数的处理

1. 如何由min max计算scale offset.

2. use_symmetric_quantize_weights

总结

量化感知训练入门：

神经网络量化入门--量化感知训练 - BBuf的个人空间 - OSCHINA - 中文开源技术交流社区

调用高通骁龙snpe神经网络分别跑在不同的核上代码的详细过程

调用高通骁龙snpe神经网络分别跑在不同的核上代码的详细过程_qq798446835的博客-CSDN博客

snpe的后量化：

SNPE中的后量化 - 知乎

snpe后量化重写DLC量化参数

上手SNPE-重写DLC量化参数 - 知乎

重写DLC量化参数的功能很早就加入到SNPE了，可能是文档的不完善，很多小伙伴至今不知道怎么用。

这里，我就简单介绍一下重写DLC量化参数的办法和需要注意的点。

如何将TF中FakeQuant的min/max参数写入DLC
如何通过encoding.json文件写入DLC
snpe-dlc-quantize 如何操作写入的量化参数

本文基于 SNPE1.59.0

Qualcomm Neural Processing SDK for AIdeveloper.qualcomm.com/software/qualcomm-neural-processing-sdk正在上传…重新上传取消

1. 将带有FakeQaunt节点的PB转成成DLC.

snpe-tensorflow-to-dlc --input_network test.pb --input_dim noisy_image 1,512,512,3 --out_node downsample_0/conv2d_0/Relu  --output_path test.dlc

这个过程中SNPE也会将FakeQuant的min、max、bitwidth同时写入到浮点的DLC中，参考如下代码。

#<snpe-sdk>\lib\python\qti\aisw\converters\tensorflow\layers\fake_quant.py
class FakeQuantLayerBuilder(LayerBuilder):
...
            # save quantization encodings for previous layer. ie quantization is done on the outputs of the previous
            # layer. node_x -> fakequant_node -> node_y
                # save quantization encodings for next layer. ie quantization is done on the const inputs of the next
                # layer. weights_node -> fakequant_node -> node_x
....

浮点DLC中保存的是FakeQuant的输出值，也就是经过量化反量化的weights值，参考如下代码。

#<snpe-sdk>\lib\python\qti\aisw\converters\tensorflow\layers\convolution.py
def get_weights_tensor(self, graph_helper, weights_source_op):
    if graph_helper.check_tensor_const_origin(weights_source_op.outputs[0])[0]:
        return graph_helper.evaluate_tensor_output(weights_source_op.outputs[0])
    return None

2. snpe-dlc-quantize --override_params 选项来使用这些encoding.

不采用 --override_params，SNPE会进行后量化。

snpe-dlc-quantize --input_dlc test.dlc  --input_list raw_list.txt --output_dlc test_quantized.dlc
# 将DLC的信息拿出来
snpe-dlc-info -i test_quantized.dlc > test_quantized.dlc.info.txt

采用 --override_params，SNPE会进行后量化，然后将有FakeQuant encoding的层的参数重写到量化后的DLC.

snpe-dlc-quantize --input_dlc test.dlc  --input_list raw_list.txt --output_dlc test_quantized_override.dlc --override_params
snpe-dlc-info -i test_quantized_override.dlc > test_quantized_override.dlc.info.txt

3. 检查量化DLC的encoding

a. Conv2D节点weights Encoding

Conv2d Weights Encoding

绿色框住的是使用--override_params之后的minmax和PB Fakequant_1的minmax基本一致。
细微的差异是因为做了零点校准，后面会在解释。

b. Relu节点的activation Encoding

Relu Activation Encoding

绿色框住的是使用--override_params之后的minmax，和PB Fakequant_2 的minmax一致。

二、通过encoding.json重写入DLC量化参数

1. SNPE工具有 quantization_overrides选项

SNPE的转化工具有 --quantization_overrides这个选项，它是用来输入模型量化需要的量化参数的，量化参数以JSON文件传入。

╰─ snpe-tensorflow-to-dlc -h
╰─ snpe-pytorch-to-dlc -h
....
Quantizer Options:
  --quantization_overrides QUANTIZATION_OVERRIDES
                        Use this option to specify a json file with parameters
                        to use for quantization. These will override any
                        quantization data carried from conversion (eg TF fake
                        quantization) or calculated during the normal
                        quantization process. Format defined as per AIMET
                        specification.

2. 需要的JSON文件的格式如下所示:

"activation_encodings" 里面存放的是activation Tensor的量化参数，tensor名字作为name
"param_encodings" 里面存放的是weights和bias tensor 的量化参数

{
    "activation_encodings": {
       "inference/coefficients/splat_cond/conv1/BiasAdd:0": [
            {
                "bitwidth": 16,
                "is_symmetric": "False",
                "max": 263.1574777999113,
                "min": -3.2438893875886934,
                "offset": -798,
                "scale": 0.0040650242952239264
            }
        ],
	   "inference/coefficients/splat_cond/conv1/LeakyRelu:0": [
            {
                "bitwidth": 16,
                "is_symmetric": "False",
                "max": 176.4537167516027,
                "min": -0.6377291712488437,
                "offset": -236,
                "scale": 0.002702242251054422
            }
        ],
		"inference/coefficients/splat_cond/res1_1/conv1/BiasAdd:0": [
            {
                "bitwidth": 16,
                "is_symmetric": "False",
                "max": 297.32563562059215,
                "min": -2.0971243891734557,
                "offset": -459,
                "scale": 0.004568898451358291
            }
        ]
    },
    "param_encodings": {
        "inference/coefficients/splat_cond/conv1/weights/read:0": [
            {
                "bitwidth": 8,
                "is_symmetric": "False",
                "max": 0.7084210564108456,
                "min": -0.7028865169076359,
                "offset": -127,
                "scale": 0.0055345395032097315
            }
        ],
        "inference/coefficients/splat_cond/res1_1/conv1/weights/read:0": [
            {
                "bitwidth": 8,
                "is_symmetric": "False",
                "max": 1.921162235035616,
                "min": -1.4808958895066204,
                "offset": -111,
                "scale": 0.013341404409969554
            }
        ],
        "inference/coefficients/splat_cond/res1_1/conv2/weights/read:0": [
            {
                "bitwidth": 8,
                "is_symmetric": "False",
                "max": 1.702315330505371,
                "min": -2.4318790435791016,
                "offset": -150,
                "scale": 0.01621252695719401
            }
        ]
    }
}

3. 使用方法

这里我们采用一个没有FakeNode的test2.pb示例，使用上面的encoding.json.

test2.pb

snpe-tensorflow-to-dlc 加上--quantization_overrides encoding.json.

╰─ snpe-tensorflow-to-dlc --input_network test2.pb --quantization_overrides encoding.json --input_dim Placeholder 1,256,256,4  --out_node inference/coefficients/splat_cond/res1_1/conv1/BiasAdd  --output_path test2.dlc
....
2022-02-10 19:52:09,963 - 214 - INFO - Processing user provided quantization encodings:
2022-02-10 19:52:10,008 - 214 - INFO - INFO_ALL_BUILDING_NETWORK:

snpe-dlc-quantize 加上 --override_params

╰─ snpe-dlc-quantize --input_dlc test2.dlc  --input_list net_run_test2/raw_list.txt --output_dlc test_quantized_json.dlc --override_params --act_bitwidth 16
[INFO] Setting activation for layer: inference/coefficients/splat_cond/conv1/Conv2D and buffer: inference/coefficients/splat_cond/conv1/BiasAdd:0
[INFO] bw: 16, min: -3.243889, max: 263.157471, delta: 0.004065, offset: -798.000000
[INFO] Setting activation for layer: inference/coefficients/splat_cond/conv1/LeakyRelu and buffer: inference/coefficients/splat_cond/conv1/LeakyRelu:0
[INFO] bw: 16, min: -0.637729, max: 176.453720, delta: 0.002702, offset: -236.000000
[INFO] Setting activation for layer: inference/coefficients/splat_cond/res1_1/conv1/Conv2D and buffer: inference/coefficients/splat_cond/res1_1/conv1/BiasAdd:0
[INFO] bw: 16, min: -2.097124, max: 297.325623, delta: 0.004569, offset: -459.000000
[INFO] Writing quantized model to: test_quantized_json.dlc
[INFO] DebugLog shutting down.

查看DLC info可以看到量化参数和JSON中的基本一致，但是小数位后面的不一致可能是因为数据存储时截取位数不一样导致的。

Layer	Output encodings	Weights encodings
inference/coefficients/splat_cond/conv1/Conv2D (Bias 合并到Conv2D)	min -3.243889331818, max 263.157470703125, delta 0.004065024201,offset -798.00000000bitwidth 16	min -0.702886521816 max 0.708421051502, delta 0.005534539465, offset -127.000000000000 bitwidth 8
inference/coefficients/splat_cond/conv1/LeakyRelu	min -0.637729167938, max 176.453720092773, delta 0.002702242229, offset -236.000000000000 bitwidth 16
inference/coefficients/splat_cond/res1_1/conv1/Conv2D	min -2.097124338150, max 297.325622558594, delta 0.004568898585, offset -459.000000000000 bitwidth 16	min -1.480895876884, max 1.921162247658, delta 0.013341404498,offset -111.000000000000 bitwidth 8

ONNX, Pytorch,TFlite模型在转DLC时也是加上以上两个参数。
encoding.json的生成最直接的就是使用AIMET，有接口生成，也可以自己按照以上格式生成。

三、snpe-dlc-quantize 量化参数的处理

1. 如何由min max计算scale offset.

在第一种PB中的FakeQuant中的encoding写入DLC的过程中，并没有提供scale，offset和issymmetric.

这个时候会做如下计算:

https://github.com/quic/aimet/blob/develop/ModelOptimizations/DlQuantization/src/TfQuantizer.cpp#L133

void TfQuantizer<DTYPE>::MinAndMaxToFxpFormat(const StatsTf& stats, int bw, TfEncoding& encoding)
{
    double num_steps = pow(2, bw) - 1;
    // Make sure zero value is within the range
    double new_min = std::min(0.0, stats.min);
    double new_max = std::max(0.0, stats.max);

    // When the min and max are too close together, nudge the maximum to meet the
    // minimum range requirement
    // This also handles the case where min==max==0 to avoid division by zero
    new_max = std::max(new_max, new_min + MIN_RANGE);

    encoding.delta = (new_max - new_min) / num_steps;
    if (new_min < 0 && new_max > 0)
    {
        // Need to make sure 0-value is exactly quantizable
        // Quantization of q into b is given by:
        //     b = q / delta - offset, where
        //                             delta = (max - min)/#steps
        //                             offset = min / delta
        // For q = 0: b = -min / delta
        // Find the closest round b, and set q=0 for it
        double b_zero   = round(-new_min / encoding.delta);
        b_zero          = std::min(num_steps, std::max(0.0, b_zero));   // just to be safe
        encoding.offset = -b_zero;
    }
    else
    {
        // One of min or max is guaranteed to be zero, so 0 is exactly quantizable already
        encoding.offset = round(new_min / encoding.delta);
    }

    // Calculate 'min' and 'max' based on 'delta' and 'offset'.
    // Note this min and max can vary from the one in 'stats'. This min and max
    // can really be represented with the integer offset.
    encoding.min = encoding.delta * encoding.offset;
    // We want to calculate: max = delta * num_steps + min.
    // To avoid numerical accuracy issues on Linaro, we simplify the math.
    encoding.max = new_max - new_min + encoding.min;
    encoding.bw  = bw;
}

可以看出，对于min、max做了调整，这个调整主要是为了0点对齐。

2. use_symmetric_quantize_weights

snpe-dlc-quantize有个选项是use_symmetric_quantize_weights。

  [ --use_symmetric_quantize_weights ]
Use the symmetric quantizer feature when quantizing the weights of the model. It makes sure min and max have the
same absolute values about zero. Symmetrically quantized data will also be stored as int#_t data such that the offset is always 0.

如果你的encoding里面的is_symmetric是false，但是你也同时使用了这个参数，会报错。所以如果要同时使用这个参数，那么encoding中的is_symmetric需要设置成True.

snpe-dlc-quantize --input_dlc test2.dlc  --input_list net_run_test2/raw_list.txt --output_dlc test_quantized_json.dlc --override_params --act_bitwidth 16 --debug3 --use_symmetric_quantize_weights
[ERROR] Requested symmetric weights but instead got is_symmetric==False.

如果encoding里面的min、max并不是对称的，那么及时设置成True，计算也会强行min = -max = -max(abs(min）,abs(max)).

提醒大家一下，针对888，模型采用对称weights量化训练，HTP的精度会更好一点。

总结

到此为止，我们了解了如何将DLC中的量化参数用量化训练得到的参数覆写。

量化感知训练入门：

调用高通骁龙snpe神经网络分别跑在不同的核上代码的详细过程

snpe的后量化：

snpe后量化 重写DLC量化参数

1. 将带有FakeQaunt节点的PB转成成DLC.

2. snpe-dlc-quantize --override_params 选项来使用这些encoding.

3. 检查量化DLC的encoding

二、通过encoding.json重写入DLC量化参数

1. SNPE工具有 quantization_overrides选项

2. 需要的JSON文件的格式如下所示:

3. 使用方法

三、snpe-dlc-quantize 量化参数的处理

1. 如何由min max计算scale offset.

2. use_symmetric_quantize_weights

总结

snpe后量化重写DLC量化参数