移植TensorFlow Lite到ARM板i.MX6上

上一篇文章說到移植到LC1860C板上失敗后,我又換了一塊庫更全更新的板子,繼續(xù)大業(yè)。

運行label_image

 ./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite  -i ./grace_hopper.jpg-l ./imagenet_slim_labels.txt

alloc失敗

遇到的第一個問題是alloc失敗。

...
83: MobilenetV1/MobilenetV1/Conv2d_9_depthwise/weights_quant/FakeQuantWithMinMaxVars, 1152, 3, 0.0212288, 120
84: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Conv2D_Fold_bias, 512, 2, 0.000260965, 0
85: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6, 8192, 3, 0.0235285, 0
86: MobilenetV1/MobilenetV1/Conv2d_9_pointwise/weights_quant/FakeQuantWithMinMaxVars, 16384, 3, 0.0110914, 146
87: MobilenetV1/Predictions/Reshape_1, 1001, 3, 0.00390625, 0
88: input, 49152, 3, 0.0078125, 128
len: 61306
width, height, channels: -16842752, 1766213120, 246279780
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

一開始我也沒有在意,以為是板子太破,tensorflow lite這個label_image的例子耗內(nèi)存太大,所以才掛的。準(zhǔn)備自己新寫一個簡單的例子,來看看會不會掛。后來在學(xué)習(xí)tensorflow準(zhǔn)備寫例子的間隙里,發(fā)現(xiàn)log里的width, height, channels的值好像不對,怎么這么大還有負(fù)的。于是去仔細(xì)看了label_image的代碼,發(fā)現(xiàn)執(zhí)行到這里根本還沒有invoke,也就是還沒有開始跑tensorflow lite,再發(fā)現(xiàn)代碼里面停在了read_bmp里面,才后知后覺發(fā)現(xiàn)我給的圖片格式是jpg而不是bmp的,感緊換成bmp的,就沒有這個alloc問題了。

  std::vector<uint8_t> in = read_bmp(s->input_bmp_name, &image_width,
                                     &image_height, &image_channels, s);

illegal instruction

然后遇到的就是illegal instruction問題

...
Node  29 Operator Builtin Code  22
  Inputs: 1 5
  Outputs: 4
Node  30 Operator Builtin Code  25
  Inputs: 4
  Outputs: 87
Illegal instruction

以前并沒有遇到過Illegal instruction的問題。一開始還以為是tensorflow報的log,在代碼里面找了一圈沒找到這個log,上網(wǎng)查了才知道這個是linux報的。通常是程序的某條指令板子上的CPU不識別,一般編譯時候的架構(gòu)選擇的與板子上的ARM實際架構(gòu)不符合不兼容導(dǎo)致的,所以還是環(huán)境的鍋。
從網(wǎng)上知道這個illegal instruction其實是一種core dump,那就從core文件開始吧。
此處必須感謝:https://blog.csdn.net/chyxwzn/article/details/8879750?utm_source=tuicool

ulimit -c unlimited

重新跑一遍label_image得到core文件,上gdb。

(gdb) bt
#0  0x0001f030 in tflite::optimized_ops::ResizeBilinear(tflite::ResizeBilinearParams const&, tflite::RuntimeShape const&, float const*, tflite::RuntimeShape const&, int const*, tflite::RuntimeShape const&, float*) ()
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p $pc
$1 = (void (*)()) 0x1f030 <tflite::optimized_ops::ResizeBilinear(tflite::ResizeBilinearParams const&, tflite::RuntimeShape const&, float const*, tflite::RuntimeShape const&, int const*, tflite::RuntimeShape const&, float*)+7720>
(gdb) p $sp
$2 = (void *) 0x7e9e43c8
(gdb) x/5i $pc
=> 0x1f030 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7720>:
    vfma.f32    s14, s13, s15
   0x1f034 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7724>:
    vstmia      r2!, {s14}
   0x1f038 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7728>:
    bgt 0x1f01c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7700>
   0x1f03c <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7732>:
    vorr        d30, d16, d16
   0x1f040 <_ZN6tflite13optimized_ops14ResizeBilinearERKNS_20ResizeBilinearParamsERKNS_12RuntimeShapeEPKfS6_PKiS6_Pf+7736>:
    vorr        d31, d17, d17

由此可以看出是掛在vfma.f32 s14, s13, s15命令上??雌饋硎俏沂稚线@個板子不支持這個VFM浮點操作。

先查編好的程序平臺屬性:

root@imx6dl-albatross2:~/march_build# readelf -A label_image
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "7-A"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_FP_arch: VFPv4
  Tag_Advanced_SIMD_arch: NEONv1 with Fused-MAC
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_rounding: Needed
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align_needed: 8-byte
  Tag_ABI_align_preserved: 8-byte, except leaf SP
  Tag_ABI_enum_size: int
  Tag_ABI_VFP_args: VFP registers
  Tag_CPU_unaligned_access: v6
...

說明用的是VFPv4指令集。

再查看板子的情況:

root@imx6dl-albatross2:~# gcc -march=native -Q --help=target|grep march
  -march=                               armv7-a
  Known ARM architectures (for use with the -march= option):

root@imx6dl-albatross2:~# cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 3.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

processor       : 1
model name      : ARMv7 Processor rev 10 (v7l)
BogoMIPS        : 3.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

Hardware        : Freescale i.MX6 Quad/DualLite (Device Tree)
Revision        : 0000
Serial          : 0000000000000000

而板子卻只支持VFP3,應(yīng)該就是這里不一致導(dǎo)致的指令不識別。
所以需要把vfp編譯指令重新寫。
一開始我在\tensorflow\contrib\lite\tools\make\Makefile里的CXXFLAGS中增加了-mfpu=vfpv3,但是發(fā)現(xiàn)生成的還是VFP4的,觀察編譯時的log,可以看到:

arm-poky-linux-gnueabi-g++  -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a9 --sysroot=/opt/fsl-imx-x11/4.1.15-1.2.0/sysroots/cortexa9hf-vfp-neon-poky-linux-gnueabi -O3 -DNDEBUG -mfpu=vfpv3 -march=armv4t --std=c++11 -march=armv7-a -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize -fPIC -I. -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/../../../../../../ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/ -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/eigen -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/absl -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/gemmlowp -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/neon_2_sse -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/farmhash/src -I/home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/downloads/flatbuffers/include -I -I/usr/local/include -c tensorflow/contrib/lite/kernels/slice.cc -o /home/alcht0/share/project/tensorflow-v1.12.0/tensorflow-v1.12.0/tensorflow/contrib/lite/tools/make/gen/rpi_armv7l/obj/tensorflow/contrib/lite/kernels/slice.o

其中還是有-mfpu=neon-vfpv4,說明還有其他地方設(shè)置了,但是卻不在Makefile里面。只好在工程里面全局搜索-mfpu,發(fā)現(xiàn)\tensorflow\contrib\lite\tools\make\target\rpi_makefile.inc里面還有定義,這個名字一看就是會被調(diào)用的,我把其中的-mfpu=neon-vfpv4 \都注釋掉了。

    CXXFLAGS += \
      -march=armv7-a \
      -mfpu=neon-vfpv4 \
      -funsafe-math-optimizations \
      -ftree-vectorize \
      -fPIC
    CCFLAGS += \
      -march=armv7-a \
      -mfpu=neon-vfpv4 \
      -funsafe-math-optimizations \
      -ftree-vectorize \
      -fPIC

重新運行編譯log里面果然沒有再出現(xiàn)VFPv4,編好的label_image在板子上也能順利運行:

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 1 -m ./mobilenet_v1_0.25_128_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  30 Operator Builtin Code  25
  Inputs: 4
  Outputs: 87
invoked
average time: 380.068 ms
0.164706: 401 academic gown
0.145098: 835 suit
0.0745098: 668 mortarboard
0.0745098: 458 bow tie
0.0509804: 653 military uniform

不過好像結(jié)果不大好,我看別人都是大概率是military uniform,可能是我用的mobilenet_v1_0.25_128_quant.tflite模型不行,換mobilenet_v1_1.0_224.tflite試試:

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 1 -m ./mobilenet_v1_1.0_224.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  30 Operator Builtin Code  25
  Inputs: 31
  Outputs: 86
invoked
average time: 2784.13 ms
0.860174: 653 military uniform
0.0481022: 907 Windsor tie
0.007867: 466 bulletproof vest
0.00644933: 514 cornet
0.00608031: 543 drumstick

果然出來結(jié)果對了,不過運行時間也長了很多。
參考https://blog.csdn.net/computerme/article/details/80345065 ,它的結(jié)果只要800多ms,看來這個板子可能性能還是不夠啊。

換量化后的mobilenet_v1_1.0_224_quant.tflite

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 4 -m ./mobilenet_v1_1.0_224_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  30 Operator Builtin Code  25
  Inputs: 4
  Outputs: 87
invoked
average time: 2311.57 ms
0.780392: 653 military uniform
0.105882: 907 Windsor tie
0.0156863: 458 bow tie
0.0117647: 466 bulletproof vest
0.00784314: 835 suit

鏈接里面那位量化后運行時間顯著減少,我的卻沒有。。。

mobilenet_v2_1.0_224_quant.tflite好像也沒有什么改進(jìn)。。

root@imx6dl-albatross2:~/vfpv3_build#  ./label_image -v 4 -m ./mobilenet_v2_1.0_224_quant.tflite  -i ./grace_hopper.bmp -l ./imagenet_slim_labels.txt
...
Node  64 Operator Builtin Code  22
  Inputs: 7 10
  Outputs: 172
invoked
average time: 2073.31 ms
0.717647: 653 military uniform
0.560784: 835 suit
0.533333: 458 bow tie
0.52549: 907 Windsor tie
0.517647: 753 racket

這個時間問題就留到后面解決啦,至少tensorflow lite跑起來了,我可以繼續(xù)寫自己的例子了。
完美的下班!

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容