MLIR算子量化Quantization (2)

$$ \begin{align*} re&al_value_{Single} \
&= roundToNearestFloat((affine_value_{uint8 , or , uint16} - zero_point_{uint8 , or , uint16})_{sint32})_{Single} * scale_{Single} \end{align*} $$

在上面的例子中,假设减法的结果,32位有符号整数格式,并且$$roundToNearestFloat$$返回Single精度。

仿射到不动点

当仿射标度和不动点标度相同时,从仿射值中减去零点得到等价的不固定值。

$$ scaled_value = affine_value_{non\mbox{-}negative} - zero_point_{non\mbox{-}negative} $$

Fixed point to affine 

当仿射尺度和不动点尺度相同时,将零点加到不动点的值上,得到等价的仿射值。

$$ affine_value_{non\mbox{-}negative} = scaled_value + zero_point_{non\mbox{-}negative} $$

Usage within MLIR 

MLIR中正在开发的量化系统有几个内容:

Quantization dialect containing:

A family of  which represent the mapping between expressed values (typically of a floating point computer type) and storage values (typically of an integral computer type).

 for converting between types based on a QuantizedType and its expressed and storage sub-types.

 for assigning instrumentation points within the computation where runtime statistics may help guide the quantization process.

Integration with simulated quantization at training time

The TFLite op-set natively supports uniform-quantized variants.

Passes and tools exist to convert directly from the TensorFlow dialect to the TFLite quantized operation set.

并不是所有的量子化应用都会用到所有这些设置。TensorFlow到TensorFlow Lite的转换,使用QuantizedTypes,但有自己的类型转换算子和支持数学的表达式。

Quantization Dialect 

Quantized type 

TODO: Flesh this section out.

QuantizedType base class

UniformQuantizedType

Quantized type conversion operations 

qcast : Convert from an expressed type to QuantizedType

dcast : Convert from a QuantizedType to its expressed type

scast : Convert between a QuantizedType and its storage type

Instrumentation and constraint operations 

const_fake_quant : Emulates the logic of the historic TensorFlow fake_quant_with_min_max_args operation.

stats_ref : Declares that statistics should be gathered at this point with a unique key and made available to future passes of the solver.

stats : Declares inline statistics (per layer and per axis) for the point in the computation. stats_ref ops are generally converted to statistical operations once trial runs have been performed.

coupled_ref : Declares points in the computation to be coupled from a type inference perspective based on a unique key.

Integration with simulated quantization at training time 

训练时与模拟量化的集成

TensorFlow历来使用tf.quantization.fake_quant_*模拟训练时,量化效果的算子族。

正如最初实现的那样,TensorFlow Lite是推理时此类操作的主要对象。当启用量化推断时,如果每个合格的张量都经过一个适当的伪量化节点(张量可以应用伪量化的规则,多少有些牵扯),那么TensorFlow Lite将使用伪量化操作的属性,判断如何从量化算子转换为使用kernel子集。

在基于MLIR的量化中,伪量化算子将它们转换成一个序列来处理的,该序列是*qcast*(quantize),然后是*dcast*(dequantize),具有适当的*UniformQuantizedType*作为qcast算子的对象。

后续的编译器传递保留量化,以某种方式模拟的知识,同时允许编译器灵活地移动类型转换,简化了计算,并将其转换为基于积分算子的形式。

允许部分量化的计算,其中不能简化为积分运算的部分,仍然以浮点形式执行,并在边界处进行适当的转换。

TFLite native quantization 

TODO: Flesh this out

General algorithm 

Take input min/max information and set the ArrayInfo (which really is InputOrOutputArrayInfo.

In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q -> tfl.DQ).

Hardcode logic/propagation needs to happen here.

Run TF constant folding.

In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).

Run quantization pass that take (tfl.DQ (for both input and weights) -> op -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q) with (constant_quant).

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zgfxdf.html