TensorFlow debugger (tfdbg) is a specialized debugger for TensorFlow. It lets you view the internal structure and states of running TensorFlow graphs during training and inference, which is difficult to debug with general-purpose debuggers such as Python’s pdb due to TensorFlow’s computation-graph paradigm.
TensorFlow调试器(tfdbg)是TensorFlow的专用调试器。 它允许您在训练和推理期间查看运行TensorFlow图形的内部结构和状态,由于TensorFlow的计算图范例,使用通用调试器(如Python的pdb)难以进行调试。
- NOTE: The system requirements of tfdbg on supported external
platforms include the following. On Mac OS X, the ncurses library is
required. It can be installed with brew install
homebrew/dupes/ncurses. On Windows, pyreadline is required. If you
use Anaconda3, you can install it with a command such as “C:\Program
Files\Anaconda3\Scripts\pip.exe” install pyreadline.
注:tfdbg在支持的外部系统上的要求平台包括以下内容。
在Mac OS X上,ncurses库是需要。 它可以安装brew安装自制/愚弄/ ncurses的。 在Windows上,pyreadline是必需的。如果你使用Anaconda3,您可以使用“C:\ Program”等命令进行安装Files \ Anaconda3 \ Scripts \ pip.exe“安装pyreadline。
This tutorial demonstrates how to use the tfdbg command-line interface (CLI) to debug the appearance of nans and infs, a frequently-encountered type of bug in TensorFlow model development. The following example is for users who use the low-level Session API of TensorFlow. A later section of this document describes how to use tfdbg with a higher-level API, namely tf-learn Estimators and Experiments. To observe such an issue, run the following command without the debugger (the source code can be found here):
本教程演示如何使用tfdbg命令行界面(CLI)调试nans和infs的外观,这是TensorFlow模型开发中经常遇到的一种错误。 以下示例适用于使用TensorFlow的低级Session API的用户。 本文后面的部分将介绍如何将tfdbg与更高级别的API一起使用,即tf-learn Estimators和Experiments。 要观察这样一个问题,运行下面的命令没有调试器(源代码可以在这里找到):
python -m tensorflow.python.debug.examples.debug_mnist
This code trains a simple neural network for MNIST digit image recognition. Notice that the accuracy increases slightly after the first training step, but then gets stuck at a low (near-chance) level:
这段代码训练MNIST的一段代码。注意开始之后准确率会上升,但是之后准确率会卡在比较低的值上。
Accuracy at step 0: 0.1113
Accuracy at step 1: 0.3183
Accuracy at step 2: 0.098
Accuracy at step 3: 0.098
Accuracy at step 4: 0.098
Wondering what might have gone wrong, you suspect that certain nodes in the training graph generated bad numeric values such as infs and nans, because this is a common cause of this type of training failure. Let’s use tfdbg to debug this issue and pinpoint the exact graph node where this numeric problem first surfaced.
想知道什么可能会出错,您怀疑training graph中的某些节点会生成错误的数值(如infs和nans),因为这是此类training failure的常见原因。 让我们使用tfdbg来调试这个问题,并确定这个数字问题首先出现的the exact graph node 。
1,Wrapping TensorFlow Sessions with tfdbg(使用tfdbg包装TensorFlow Sessions )
To add support for tfdbg in our example, all that is needed is to add the following lines of code and wrap the Session object with a debugger wrapper. This code is already added in debug_mnist.py, so you can activate tfdbg CLI with the –debug flag at the command line.
为了在我们的例子中添加对tfdbg的支持,所需要的只是添加下面的代码行,并用 debugger wrapper来包装Session对象。 此代码已添加到debug_mnist.py中,因此您可以在命令行中使用–debug标志激活tfdbg CLI。
# Let your BUILD target depend on "//tensorflow/python/debug:debug_py"
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
This wrapper has the same interface as Session, so enabling debugging requires no other changes to the code. The wrapper provides additional features, including:
这个wrapper 与Session具有相同的接口,因此启用调试不需要对代码进行其他更改。 包装提供了额外的功能,包括:
- Bringing up a CLI before and after Session.run() calls, to let you
control the execution and inspect the graph’s internal state. - Allowing you to register special filters for tensor values, to
facilitate the diagnosis of issues.
In this example, we have already registered a tensor filter called tfdbg.has_inf_or_nan, which simply determines if there are any nan or inf values in any intermediate tensors (tensors that are neither inputs or outputs of the Session.run() call, but are in the path leading from the inputs to the outputs). This filter is for nans and infs is a common enough use case that we ship it with the debug_data module.
在这个例子中,我们已经注册了一个称为tfdbg.has_inf_or_nan的张tensor filter,它可以简单地确定在任何中间张量(张量既不是Session.run()调用的输入也不是输出中有nan或inf值, 在从输入到输出的路径中)。 这个过滤器是nans和infs是一个常见的用例,我们用debug_data模块运送。
Note: You can also write your own custom filters. See the API documentation of DebugDumpDir.find() for additional information.
2,Debugging Model Training with tfdbg
Let’s try training the model again, but with the –debug flag added this time:
python -m tensorflow.python.debug.examples.debug_mnist --debug
The debug wrapper session will prompt you when it is about to execute the first Session.run() call, with information regarding the fetched tensor and feed dictionaries displayed on the screen.
The debug wrapper session 在您要执行第一个Session.run()调用时提示您,并提供有关屏幕上显示的提取张量和Feed字典的信息。
This is what we refer to as the run-start CLI. It lists the feeds and fetches to the current Session.run call, before executing anything.
If the screen size is too small to display the content of the message in its entirety, you can resize it.
Use the PageUp / PageDown / Home / End keys to navigate the screen output. On most keyboards lacking those keys Fn + Up / Fn + Down / Fn + Right / Fn + Left will work.
Enter the run command (or just r) at the command prompt:
tfdbg> run
The run command causes tfdbg to execute until the end of the next Session.run() call, which calculates the model’s accuracy using a test data set. tfdbg augments the runtime Graph to dump all intermediate tensors. After the run ends, tfdbg displays all the dumped tensors values in the run-end CLI. For example:
This list of tensors can also be obtained by running the command lt after you executed run.
2.1,tfdbg CLI Frequently-Used Commands(tfdbg CLI经常使用的命令)
Try the following commands at the tfdbg> prompt (referencing the code at tensorflow/python/debug/examples/debug_mnist.py):
Note that each time you enter a command, a new screen output will appear. This is somewhat analogous to web pages in a browser. You can navigate between these screens by clicking the <– and –> text arrows near the top-left corner of the CLI.
2.2,Other Features of the tfdbg CLI( tfdbg CLI的其他功能)
In addition to the commands listed above, the tfdbg CLI provides the following addditional features:
- To navigate through previous tfdbg commands, type in a few characters
followed by the Up or Down arrow keys. tfdbg will show you the
history of commands that started with those characters. To navigate through the history of screen outputs, do either of the
following:- Use the prev and next commands. - Click underlined <-- and --> links near the top left corner of the screen.
Tab completion of commands and some command arguments.
To redirect the screen output to a file instead of the screen, end
the command with bash-style redirection. For example, the following
command redirects the output of the pt command to the
/tmp/xent_value_slices.txt file:
tfdbg> pt cross_entropy/Log:0[:, 0:10] > /tmp/xent_value_slices.txt
2.3,Finding nans and infs
In this first Session.run() call, there happen to be no problematic numerical values. You can move on to the next run by using the command run or its shorthand r.
在第一个Session.run() 调用中,碰巧没有问题的数值。 您可以使用命令run或其简写r继续下一次运行。
TIP: If you enter run or r repeatedly, you will be able to move through the Session.run() calls in a sequential manner.
You can also use the -t flag to move ahead a number of Session.run() calls at a time, for example:
tfdbg> run -t 10
Instead of entering run repeatedly and manually searching for nans and infs in the run-end UI after every Session.run() call (for example, by using the pt command shown in the table above) , you can use the following command to let the debugger repeatedly execute Session.run() calls without stopping at the run-start or run-end prompt, until the first nan or inf value shows up in the graph. This is analogous to conditional breakpoints in some procedural-language debuggers:
在每次Session.run()调用之后(例如,使用上表中显示的pt命令),您可以使用以下命令来让它们反复运行并手动搜索运行结束UI中的nans和infs 调试器会重复执行Session.run()调用,而不会在运行开始或运行结束提示符处停止,直到第一个nan或inf值出现在图形中。 这类似于某些过程式语言调试器中的conditional breakpoints:
tfdbg> run -f has_inf_or_nan
NOTE: The preceding command works properly because a tensor filter called has_inf_or_nan has been registered for you when the wrapped session is created. This filter detects nans and infs (as explained previously). If you have registered any other filters, you can use "run -f" to have tfdbg run until any tensor triggers that filter (cause the filter to return True).
def my_filter_callable(datum, tensor):
# A filter that detects zero-valued scalars.
return len(tensor.shape) == 0 and tensor == 0.0
sess.add_tensor_filter('my_filter', my_filter_callable)
Then at the tfdbg run-start prompt run until your filter is triggered:
tfdbg> run -f my_filter
See this API document for more information on the expected signature and return value of the predicate Callable used with add_tensor_filter().
As the screen display indicates on the first line, the has_inf_or_nan filter is first triggered during the fourth Session.run() call: an Adam optimizer forward-backward training pass on the graph. In this run, 36 (out of the total 95) intermediate tensors contain nan or inf values. These tensors are listed in chronological order, with their timestamps displayed on the left. At the top of the list, you can see the first tensor in which the bad numerical values first surfaced: cross_entropy/Log:0.
To view the value of the tensor, click the underlined tensor name cross_entropy/Log:0 or enter the equivalent command:
tfdbg> pt cross_entropy/Log:0
Scroll down a little and you will notice some scattered inf values. If the instances of inf and nan are difficult to spot by eye, you can use the following command to perform a regex search and highlight the output:
tfdbg> /inf
Or, alternatively:
tfdbg> /(inf|nan)
You can also use the -s or –numeric_summary command to get a quick summary of the types of numeric values in the tensor:
tfdbg> pt -s cross_entropy/Log:0
From the summary, you can see that several of the 1000 elements of the cross_entropy/Log:0 tensor are -infs (negative infinities).
Why did these infinities appear? To further debug, display more information about the node cross_entropy/Log by clicking the underlined node_info menu item on the top or entering the equivalent node_info (ni) command:
tfdbg> ni cross_entropy/Log
You can see that this node has the op type Log and that its input is the node softmax/Softmax. Run the following command to take a closer look at the input tensor:
tfdbg> pt softmax/Softmax:0
Examine the values in the input tensor, searching for zeros:
tfdbg> /0\.000
Indeed, there are zeros. Now it is clear that the origin of the bad numerical values is the node cross_entropy/Log taking logs of zeros. To find out the culprit line in the Python source code, use the -t flag of the ni command to show the traceback of the node’s construction:
tfdbg> ni -t cross_entropy/Log
If you click “node_info” at the top of the screen, tfdbg automatically shows the traceback of the node’s construction.
From the traceback, you can see that the op is constructed at the following line: debug_mnist.py:
diff = y_ * tf.log(y)
tfdbg has a feature that makes it easy to trace Tensors and ops back to lines in Python source files. It can annotate lines of a Python file with the ops or Tensors created by them. To use this feature, simply click the underlined line numbers in the stack trace output of the ni -t commands, or use the ps (or print_source) command such as: ps /path/to/source.py. For example, the following screenshot shows the output of a ps command.
tfdbg run-end UI: annotated Python source file
2.4,Fixing the problem
To fix the problem, edit debug_mnist.py, changing the original line:
diff = -(y_ * tf.log(y))
to the built-in, numerically-stable implementation of softmax cross-entropy:
diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits)
Rerun with the –debug flag as follows:
python -m tensorflow.python.debug.examples.debug_mnist --debug
At the tfdbg> prompt, enter the following command:
run -f has_inf_or_nan
Confirm that no tensors are flagged as containing nan or inf values, and accuracy now continues to rise rather than getting stuck. Success!
3,Debugging tf-learn Estimators and Experiments
This section explains how to debug TensorFlow programs that use the Estimator and Experiment APIs. Part of the convenience provided by these APIs is that they manage Sessions internally. This makes the LocalCLIDebugWrapperSession described in the preceding sections inapplicable. Fortunately, you can still debug them by using special hooks provided by tfdbg.
3.1,Debugging tf.contrib.learn Estimators
Currently, tfdbg can debug the fit() evaluate() methods of tf-learn Estimators. To debug Estimator.fit(), create a LocalCLIDebugHook and supply it in the monitors argument. For example:
# First, let your BUILD target depend on "//tensorflow/python/debug:debug_py"
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug
# Create a LocalCLIDebugHook and use it as a monitor when calling fit().
hooks = [tf_debug.LocalCLIDebugHook()]
classifier.fit(x=training_set.data,
y=training_set.target,
steps=1000,
monitors=hooks)
To debug Estimator.evaluate(), assign hooks to the hooks parameter, as in the following example:
accuracy_score = classifier.evaluate(x=test_set.data,
y=test_set.target,
hooks=hooks)["accuracy"]
debug_tflearn_iris.py, based on {tflearntf-learn’s iris tutorial}, contains a full example of how to use the tfdbg with Estimators. To run this example, do:
python -m tensorflow.python.debug.examples.debug_tflearn_iris –debug
3.2,Debugging tf.contrib.learn Experiments
Experiment is a construct in tf.contrib.learn at a higher level than Estimator. It provides a single interface for training and evaluating a model. To debug the train() and evaluate() calls to an Experiment object, you can use the keyword arguments train_monitors and eval_hooks, respectively, when calling its constructor. For example:
# First, let your BUILD target depend on "//tensorflow/python/debug:debug_py"
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug
hooks = [tf_debug.LocalCLIDebugHook()]
ex = experiment.Experiment(classifier,
train_input_fn=iris_input_fn,
eval_input_fn=iris_input_fn,
train_steps=FLAGS.train_steps,
eval_delay_secs=0,
eval_steps=1,
train_monitors=hooks,
eval_hooks=hooks)
ex.train()
accuracy_score = ex.evaluate()["accuracy"]
To build and run the debug_tflearn_iris example in the Experiment mode, do:
python -m tensorflow.python.debug.examples.debug_tflearn_iris \
--use_experiment --debug
The LocalCLIDebugHook also allows you to configure a watch_fn that can be used to flexibly specify what Tensors to watch on different Session.run() calls, as a function of the fetches and feed_dict and other states. See this API doc for more details.
4,Debugging Keras Models with TFDBG
To use TFDBG with Keras, let the Keras backend use a TFDBG-wrapped Session object. For example, to use the CLI wrapper:
import tensorflow as tf
from keras import backend as keras_backend
from tensorflow.python import debug as tf_debug
keras_backend.set_session(tf_debug.LocalCLIDebugWrapperSession(tf.Session()))
# Define your keras model, called "model".
model.fit(...) # This will break into the TFDBG CLI.
5,Debugging tf-slim with TFDBG
TFDBG currently supports only training with tf-slim. To debug the training process, provide LocalCLIDebugWrapperSession to the session_wrapper argument of slim.learning.train(). For example:
import tensorflow as tf
from tensorflow.python import debug as tf_debug
# ... Code that creates the graph and the train_op ...
tf.contrib.slim.learning_train(
train_op,
logdir,
number_of_steps=10,
session_wrapper=tf_debug.LocalCLIDebugWrapperSession)
6,Offline Debugging of Remotely-Running Sessions
Often, your model is running on a remote machine or a process that you don’t have terminal access to. To perform model debugging in such cases, you can use the offline_analyzer binary of tfdbg (described below). It operates on dumped data directories. This can be done to both the lower-level Session API and the higher-level Estimator and Experiment APIs.
6.1,Debugging Remote tf.Sessions
If you interact directly with the tf.Session API in python, you can configure the RunOptions proto that you call your Session.run() method with, by using the method tfdbg.watch_graph. This will cause the intermediate tensors and runtime graphs to be dumped to a shared storage location of your choice when the Session.run() call occurs (at the cost of slower performance). For example:
from tensorflow.python import debug as tf_debug
# ... Code where your session and graph are set up...
run_options = tf.RunOptions()
tf_debug.watch_graph(
run_options,
session.graph,
debug_urls=["file:///shared/storage/location/tfdbg_dumps_1"])
# Be sure to specify different directories for different run() calls.
session.run(fetches, feed_dict=feeds, options=run_options)
Later, in an environment that you have terminal access to (for example, a local computer that can access the shared storage location specified in the code above), you can load and inspect the data in the dump directory on the shared storage by using the offline_analyzer binary of tfdbg. For example:
python -m tensorflow.python.debug.cli.offline_analyzer \
--dump_dir=/shared/storage/location/tfdbg_dumps_1
The Session wrapper DumpingDebugWrapperSession offers an easier and more flexible way to generate file-system dumps that can be analyzed offline. To use it, simply wrap your session in a tf_debug.DumpingDebugWrapperSession. For example:
# Let your BUILD target depend on "//tensorflow/python/debug:debug_py
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug
sess = tf_debug.DumpingDebugWrapperSession(
sess, "/shared/storage/location/tfdbg_dumps_1/", watch_fn=my_watch_fn)
The watch_fn argument accepts a Callable that allows you to configure what tensors to watch on different Session.run() calls, as a function of the fetches and feed_dict to the run() call and other states.
6.2,C++ and other languages
If your model code is written in C++ or other languages, you can also modify the debug_options field of RunOptions to generate debug dumps that can be inspected offline. See the proto definition for more details.
6.3,Debugging Remotely-Running tf-learn Estimators and Experiments
If your remote TensorFlow server runs Estimators, you can use the non-interactive DumpingDebugHook. For example:
# Let your BUILD target depend on "//tensorflow/python/debug:debug_py
# (You don't need to worry about the BUILD dependency if you are using a pip
# install of open-source TensorFlow.)
from tensorflow.python import debug as tf_debug
hooks = [tf_debug.DumpingDebugHook("/shared/storage/location/tfdbg_dumps_1")]
Then this hook can be used in the same way as the LocalCLIDebugHook examples described earlier in this document. As the training and/or evalution of Estimator or Experiment happens, tfdbg creates directories having the following name pattern: /shared/storage/location/tfdbg_dumps_1/run__. Each directory corresponds to a Session.run() call that underlies the fit() or evaluate() call. You can load these directories and inspect them in a command-line interface in an offline manner using the offline_analyzer offered by tfdbg. For example:
python -m tensorflow.python.debug.cli.offline_analyzer \
--dump_dir="/shared/storage/location/tfdbg_dumps_1/run_<epoch_timestamp_microsec>_<uuid>"
7,Frequently Asked Questions
Q: Do the timestamps on the left side of the lt output reflect actual performance in a non-debugging session?
A: No. The debugger inserts additional special-purpose debug nodes to the graph to record the values of intermediate tensors. These nodes slow down the graph execution. If you are interested in profiling your model, check out
The profiling mode of tfdbg: tfdbg> run -p.
tfprof and other profiling tools for TensorFlow.
Q: How do I link tfdbg against my Session in Bazel? Why do I see an error such as “ImportError: cannot import name debug”?
A: In your BUILD rule, declare dependencies: “//tensorflow:tensorflow_py” and “//tensorflow/python/debug:debug_py”. The first is the dependency that you include to use TensorFlow even without debugger support; the second enables the debugger. Then, In your Python file, add:
from tensorflow.python import debug as tf_debug
# Then wrap your TensorFlow Session with the local-CLI wrapper.
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
Q: Does tfdbg help debug runtime errors such as shape mismatches?
A: Yes. tfdbg intercepts errors generated by ops during runtime and presents the errors with some debug instructions to the user in the CLI. See examples:
# Debugging shape mismatch during matrix multiplication.
python -m tensorflow.python.debug.examples.debug_errors \
--error shape_mismatch --debug
# Debugging uninitialized variable.
python -m tensorflow.python.debug.examples.debug_errors \
--error uninitialized_variable --debug
Q: How can I let my tfdbg-wrapped Sessions or Hooks run the debug mode only from the main thread?
A: This is a common use case, in which the Session object is used from multiple threads concurrently. Typically, the child threads take care of background tasks such as running enqueue operations. Often, you want to debug only the main thread (or less frequently, only one of the child threads). You can use the thread_name_filter keyword argument of LocalCLIDebugWrapperSession to achieve this type of thread-selective debugging. For example, to debug from the main thread only, construct a wrapped Session as follows:
sess = tf_debug.LocalCLIDebugWrapperSession(sess, thread_name_filter="MainThread$")
The above example relies on the fact that main threads in Python have the default name MainThread.
Q: The model I am debugging is very large. The data dumped by tfdbg fills up the free space of my disk. What can I do?
A: You might encounter this problem in any of the following situations:
models with many intermediate tensors
very large intermediate tensors
many tf.while_loop iterations
There are three possible workarounds or solutions:
The constructors of LocalCLIDebugWrapperSession and LocalCLIDebugHook provide a keyword argument, dump_root, to specify the path to which tfdbg dumps the debug data. You can use it to let tfdbg dump the debug data on a disk with larger free space. For example:
a = tf.ones([10], name="a")
b = tf.add(a, a, name="b")
sess = tf.Session()
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.run(b)
A: The reason why you see no data dumped is because every node in the executed TensorFlow graph is constant-folded by the TensorFlow runtime. In this exapmle, a is a constant tensor; therefore, the fetched tensor b is effectively also a constant tensor. TensorFlow’s graph optimization folds the graph that contains a and b into a single node to speed up future runs of the graph, which is why tfdbg does not generate any intermediate tensor dumps. However, if a were a tf.Variable, as in the following example:
import numpy as np
a = tf.Variable(np.ones[10], name="a")
b = tf.add(a, a, name="b")
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.run(b)
the constant-folding would not occur and tfdbg should show the intermediate tensor dumps.