在手机部署Auto-TVM真是说来一把心酸泪,踩了不少坑,尝试了不同的方法,花了近三个星期才部署成功。本来之前以为通过rpc将主机和手机连接成功已经成功在望,还是高兴的太早。下面将碰到的坑一一道来。
填坑之一:跑rpc测试代码android_rpc_test.py不成功
在主机开启rpc服务,绑定端口9191,在手机端通过rpc应用注册主机ip及端口,指定key值为android,手机与主机能成功连通。

然后跑官方提供的android_rpc_test.py代码,发现一直报错。

# Establish remote connection with target hardware
tracker = rpc.connect_tracker(tracker_host, tracker_port)
remote = tracker.request(key, priority=0,
session_timeout=10, max_retry=2)问题出在上述代码,错误信息显示主机与手机的Socket连接不成功。
首先是怀疑虽然主机与手机通过rpc能连接,Socket绑定的端口(手机是5001,主机是9191)是不是不通。用telnet命令验证端口是互通的。然后在tvm和rpc源码里加log日志发现Socket也能接收和传输信息。

最后一招:Android Studio外接usb线连接手机调试rpc app源码,运行android_rpctest.py在Android Studio里看到错误日志:找不到libtvm4j.so和libtvm_runtime.so,原来问题一直出在这里导致rpc连接不成功。在rpc源码tvm/jvm/core/src/main/java/org/apache/tvm/Base.java文件找到加载这两个库的代码如下。
static {
boolean loadNativeRuntimeLib = true;
try {
try {
tryLoadLibraryOS("tvm4j");
} catch (UnsatisfiedLinkError e) {
System.err.println("[WARN] TVM native library not found in path. ");
NativeLibraryLoader.loadLibrary("tvm4j");
}
} catch (Throwable e) {
System.err.println("[WARN] Couldn't find native library tvm4j.");
e.printStackTrace();
System.err.println("Try to load tvm4j (runtime packed version) ...");
try {
System.loadLibrary("tvm4j_runtime_packed");
// if tvm runtime is packed in libtvm4j, we do not need to dlopen libtvm_runtime.so.
loadNativeRuntimeLib = false;
} catch (UnsatisfiedLinkError errFull) {
System.err.println("[ERROR] Couldn't find native library tvm4j_runtime_packed.");
throw new RuntimeException(errFull);
}
}
System.err.println("libtvm4j loads successfully.");
if (loadNativeRuntimeLib) {
String tvmLibFilename = System.getProperty("libtvm.so.path");
if (tvmLibFilename == null || !new File(tvmLibFilename).isFile()
|| _LIB.nativeLibInit(tvmLibFilename) != 0) {
try {
String runtimeLibname;
String os = System.getProperty("os.name");
// ref: http://lopica.sourceforge.net/os.html
if (os.startsWith("Linux")) {
runtimeLibname = "libtvm_runtime.so";
} else if (os.startsWith("Mac")) {
runtimeLibname = "libtvm_runtime.dylib";
} else {
// TODO(yizhi) support windows later
throw new UnsatisfiedLinkError(os + " not supported currently");
}
NativeLibraryLoader.extractResourceFileToTempDir(runtimeLibname, new Action() {
@Override public void invoke(File target) {
System.err.println("Loading tvm runtime from " + target.getPath());
checkCall(_LIB.nativeLibInit(target.getPath()));
}
});
} catch (IOException e) {
throw new RuntimeException(e);
}
}
} else {
_LIB.nativeLibInit(null);
}可以看到先在本地目录加载libtvm4j.so,如果找不到就去找libtvm4j_runtime_packed.so,这个库是通过执行rpc app工程下的jni目录里的build.sh脚本编译生成的,将libtvm4j.so和libtvm_runtime.so打包为一个库文件。如果能成功加载libtvm4j_runtime_packed.so则不用再加载libtvm_runtime.so库。
在rpc的app工程目录下jni目录执行build.sh可以编译生成不同平台下的libtvm4j_runtime_packed.so库,如arm64-v8a、x86-64等,可以修改同目录下的Application.mk进行修改。

成功编译库文件后,还差最后一步:需要在app/src/main/java/org/apache/tvm/tvmrpc/RPCProcessor.java中添加导入库的代码。
public RPCProcessor(Activity activity) {
super();
rpc_activity = activity;
System.loadLibrary("c++_shared");
System.loadLibrary("tvm4j_runtime_packed");
// System.loadLibrary("tvm4j");
// System.loadLibrary("tvm_runtime");
}至此运行android_rpc_test.py成功,能正确返回结果。卡了两周的问题终于解决!
另外部署到手机需要将模型导出为arm架构的库文件部署,需要指定ndk的路径,可以在环境变量中设置TVM_NDK_CC指定ndk路径,我是在tvm/python/tvm/contrib/ndk.py中直接写死路径。
# if "TVM_NDK_CC" not in os.environ:
# raise RuntimeError("Require environment variable TVM_NDK_CC"
# " to be the NDK standalone compiler")
# compiler = os.environ["TVM_NDK_CC"]
compiler = '/data_1/Projects/android/android-ndk/android-ndk-r21/opt2/android-toolchain-arm64/bin/aarch64-linux-android-clang'
cmd = [compiler]填坑之二:跑Auto_TVM测试代码deploy_model_on_android.py出错
解决了rpc连接和数据传输问题,能跑通android_rpc_test.py还有点小激动,然后再尝试跑官方提供的deploy_model_on_android.py竟然又报错了!!!

看这个日志看不出所以然来,只能定位到是用rpc load_module有问题。再看Android Studio里面rpc的运行日志如下:


第一日志显示导出的resnet18.so已经成功上传至手机,第二个日志显示:Binary was created using GraphRuntimeFactory but a loader of that name is not registered.意思解析resnet18.so的方法GraphRuntimeFactory没有在tvm_runtime注册。这个错误是在tvm源码tvm/src/runtime/http://library_module.cc中报错的。
for (uint64_t i = 0; i < size; ++i) {
std::string tkey;
CHECK(stream->Read(&tkey));
// Currently, _lib is for DSOModule, but we
// don't have loadbinary function for it currently
VLOG(true) << " ProcessModuleBlob tkey: " << tkey << "n";
if (tkey == "_lib") {
auto dso_module = Module(make_object<LibraryModuleNode>(lib));
modules.emplace_back(dso_module);
} else if (tkey == "_import_tree") {
CHECK(stream->Read(&import_tree_row_ptr));
CHECK(stream->Read(&import_tree_child_indices));
} else {
std::string loadkey = "runtime.module.loadbinary_";
std::string fkey = loadkey + tkey;
// std::string fkey = "runtime.module.loadbinary_GraphRuntimeFactory";
VLOG(true) << " ProcessModuleBlob fkey: " << fkey << "n";
const PackedFunc* f = Registry::Get(fkey);
if (f == nullptr) {
std::string loaders = "";
for (auto name : Registry::ListNames()) {
if (name.rfind(loadkey, 0) == 0) {
if (loaders.size() > 0) {
loaders += ", ";
}
loaders += name.substr(loadkey.size());
}
}
VLOG(true) << " ProcessModuleBlob loaders: " << loaders_info << "n";
CHECK(f != nullptr)
<< "Binary was created using " << tkey
<< " but a loader of that name is not registered. Available loaders are " << loaders
<< ". Perhaps you need to recompile with this runtime enabled.";
}
Module m = (*f)(static_cast<void*>(stream));
modules.emplace_back(m);
}
}
是因为在 const PackedFunc* f = Registry::Get(fkey) 找不到key值为"runtime.module.loadbinary_GraphRuntimeFactory"对应方法。在tvm/src/runtime/graph/http://graph_runtime_factory.cc文件中发现key="runtime.module.loadbinary_GraphRuntimeFactory"实际已经注册。
TVM_REGISTER_GLOBAL("runtime.module.loadbinary_GraphRuntimeFactory")
.set_body_typed(GraphRuntimeFactoryModuleLoadBinary);
因此GraphRuntimeFactory对应的loader实际上已经在tvm_runtime注册。问题出在哪里?!
发现在rpc工程tvm_runtime.h文件(tvm/apps/android_rpc/app/src/main/jni/tvm_runtime.h)中并没有包含http://graph_runtime_factory.cc文件,所以才报找不到该方法的错误。解决方法是添加一行包含http://graph_runtime_factory.cc文件的代码。
#include "../src/runtime/graph/graph_runtime_factory.cc"
重新执行jni目录下的build.sh脚本,生成的libtvm4j_runtime_packed.so已经包含该方法,再执行deploy_model_on_android.py则成功运行。

这是tvm的一个bug,也花了近一周才终于解决。至此,官方提供Auto-TVM及rpc 模型部署代码都能成功运行!!
填坑之三:跑Auto_TVM测试代码tune_relay_arm.py出错
实际在手机上跑Auto-TVM代码优化mobilenet_v2的卷积算子时会碰到输出值一直为0的情况,如下图所示:

debug代码发现问题出在tvm/python/tvm/autotvm/measure/measure.py。
try:
random_fill = remote.get_function("tvm.contrib.random.random_fill")
except AttributeError:
raise AttributeError("Please make sure USE_RANDOM is ON in the config.cmake "
"on the remote devices")remote.get_function("tvm.contrib.random.random_fill")需要在手机端的tvm_runtime获取tvm.contrib.random.random_fill方法,获取不到则报错“Please make sure USE RANDOM is ON in the config.cmake on the remote devices,即需要在编译tvm_runtime时设置USE RANDOM=ON。该方法是在文件tvm/src/runtime/contrib/random/http://random.cc中注册的。
TVM_REGISTER_GLOBAL("tvm.contrib.random.random_fill").set_body([](TVMArgs args, TVMRetValue* ret) {
RandomThreadLocalEntry* entry = RandomThreadLocalEntry::ThreadLocal();
DLTensor* out = args[0];
entry->random_engine.RandomFill(out);
});
解决方法:在rpc安卓工程的jni头文件tvm_runime.h中包含tvm/src/runtime/contrib/random/http://random.cc文件并重新执行build.sh编译tvm runimte库。
#include "../src/runtime/contrib/random/random.cc"重新运行tune_relay_arm.py输出结果如下图所示,输出正常。

总结一下,在手机端跑tvm时,如果碰到方法get失败但该方法实际在tvm已经注册的情形,首先需要看下在rpc安卓工程的jni头文件tvm_runime.h是否已经包含该方法实现的文件,目前该头文件包含的.cc文件不全,需要自己添加。
下一步计划研究下tvm代码。