【转】如何比较两个浮点数

https://bot-man-jl.github.io/articles/?post=2020/Comparing-Floating-Point-Numbers

比较两个浮点数是否相等 并不是一个简单的问题：

由于浮点数的精度误差，一般不能使用 绝对相等 比较；
基于 绝对误差 的 近似相等 比较要求使用者对允许的误差范围有明确的预期，并不通用；
基于 相对误差 的 近似相等 比较看似通用，但又可能会在 0 附近 “栽跟头”。。。

本文主要参考了 “浮点数专家” Bruce Dawson 写的 Comparing Floating Point Numbers, 2012 Edition，也推荐他关于浮点数的其他文章（最近还一直在更新 ?）。

本文提到的 Equals() / AlmostEqualsAbs() / AlmostEqualsRel() / AlmostEqualsUlp() 函数的具体实现参考 almost_equals.h（在线演示）?

绝对相等？不靠谱

比较两个数是否相等，最直观的 比较方法 就是直接使用 == 进行比较：

f1 == f2  // Equals()

根据 IEEE 754 浮点数标准，浮点数 \pm p \cdot b^e±p⋅be（其中 bb 为固定的基数，\pm± 为符号位，pp 为尾数，ee 为指数）表示为 符号位-指数-尾数 形式：

十进制浮点数（b == 10）就是数学中常用的 科学计数法 (scientific notation)
二进制浮点数（b = 2）常用于计算机中的表示和存储，其中 单精度 (single-precision) 对应 32 位，双精度 (double-precision) 对应 64 位

在 C++ 中，0.1 属于双精度浮点数，0.1f 属于单精度浮点数。
为什么计算机使用二进制浮点数？因为基于现有的 CPU 架构，二进制运算效率更高。

虽然在数学上，实数是 连续的 —— 任意十进制数和二进制数可以 相互转换，它们的浮点数表示形式也可以相互转换：

十进制定点数	十进制浮点数	二进制定点数	二进制浮点数
`0.125`	1.25 \times 10^{-1}1.25×10−1	`0.001`	(2^{0} + 0) \times 2^{-3}20+0)×2−3
`0.1`	1.0 \times 10^{-1}1.0×10−1	`0.000110011...` （`0011` 循环）	(2^{0} + 2^{-1} + 2^{-4} + 2^{-5} + ...) \times 2^{-4}(20+2−1+2−4+2−5+...)×2−4

但是在计算机中，数值是 离散的 —— 有限精度的二进制浮点数 不能准确表示 所有的十进制数（在线转换工具）：

十进制数 0.1 对应的二进制浮点数的尾数 pp 是 1.100110011...（0011 循环）
由于 32 位单精度浮点数的尾数 pp 只有 23 位，只能保留小数点后 23 位（即 1.10011001100110011001101）
所以，近似表示的二进制数 0.000110011001100110011001101 约等于十进制数 0.10000000149，出现误差

存在问题 —— 有限精度 的浮点数转换 近似表示 会带来误差，而浮点数运算则会 放大误差，所以判断两个浮点数（其中至少一个是运算结果）的 绝对相等 往往是不可靠的：

float sum = 0.0f;
for (int i = 0; i < 10; ++i) { sum += 0.1f; }
assert(Equals(sum, 1.0f) == false);  // expect true

为此，编译器会警告浮点数的 ==/!= 比较：

warning: comparing floating point with == or != is unsafe [-Wfloat-equal]

解决办法 —— 由于误差的存在，只能比较两个浮点数是否 近似相等：

assert(AlmostEqualsAbs(sum, 1.0f) == true);
assert(AlmostEqualsRel(sum, 1.0f) == true);
assert(AlmostEqualsUlp(sum, 1.0f) == true);

近似相等 —— 绝对误差

最简单的 比较方法 是直接比较两数差值的绝对值是否小于一个 允许的误差值 epsilon：

fabs(f1 - f2) <= epsilon  // AlmostEqualsAbs()

其中，epsilon 常用 FLT_EPSILON / DBL_EPSILON，表示在单精度 1.0f / 双精度 1.0 附近的最小误差。

存在问题 —— 由于 绝对误差 是一个固定值，不一定适用于任意浮点数的比较：

assert(AlmostEqualsAbs(67329.234f, 67329.242f) == false);  // expect true
assert(AlmostEqualsAbs(1.2e-32f, 2.4e-32f) == true);       // expect false

当比较的两个数远大于 1 时，FLT_EPSILON/DBL_EPSILON 允许的误差过小（例如，虽然 67329.234f 产生误差后会被近似表示为最近的下一个浮点数 67329.242f，但差值远大于 FLT_EPSILON，所以被误判为不相等 ?）
当比较的两个数远小于 1 时，FLT_EPSILON/DBL_EPSILON 允许的误差过大（例如，尽管 1.2e-32f 和 2.4e-32f 相差了一倍，应该被认为不相等，但差值远小于 FLT_EPSILON，所以被误判为近似相等 ?）

解决办法 —— 需要借助 相对误差 弥补上述局限性：

assert(AlmostEqualsRel(67329.234f, 67329.242f) == true);
assert(AlmostEqualsUlp(67329.234f, 67329.242f) == true);
assert(AlmostEqualsRel(1.2e-32f, 2.4e-32f) == false);
assert(AlmostEqualsUlp(1.2e-32f, 2.4e-32f) == false);

近似相等 —— 相对误差

a) 一种 比较方法 是把绝对的 epsilon 放缩到待比较数的量级 (magnitude)，即 epsilon 乘以两数绝对值的较大值：

fabs(f1 - f2) <= epsilon * std::max(fabs(f1), fabs(f2))  // AlmostEqualsRel()

然而，这种比较方法：

一方面可能下溢 (underflow) —— 如果两数的绝对值都很小，那么乘上 epsilon 后可能得到 0.0，导致判断失败
另一方面 并不高效 —— 需要进行多次 浮点数运算（除了求绝对值外，两次比较、一次乘法、一次减法）

所以，有没有更好的比较方法呢？

b) 另一种方法是基于浮点数二进制表示的 ULP (unit in the last place) 进行比较：

根据 IEEE 754 标准，相邻两个浮点数的 二进制表示 恰好也是相邻的 —— 相邻两个浮点数之间的距离就是 “它们的 ULP”（定义参考：What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg）
例如，距离 1.0f 最近的下一个浮点数是 nextafter(1.0f)，恰好等于 1.0f + FLT_EPSILON，所以 ULP(1.0f) 等于 FLT_EPSILON（即 2^{-23}2−23）：

十进制定点数	二进制浮点数	二进制表示
`1.0f`	(2^{0} + 0) \times 2^{0}(20+0)×20	`0 01111111 00000000000000000000000`
`nextafter(1.0f)`	(2^{0} + 2^{-23}) \times 2^{0}(20+2−23)×20	`0 01111111 00000000000000000000001`

直观上，比较方法 很简单 —— 比较浮点数的 二进制表示 的差值的绝对值是否小于一个 允许的 ULP 误差 max_ulp：

abs(Biased(Bits(f1)) - Biased(Bits(f2))) <= max_ulp  // AlmostEqualsUlp()

首先，将两个单精度/双精度 浮点数 直接按位转成 32/64 位整数
然后，由于这两个整数表示为 符号位-数值 (sign-magnitude) 格式，导致 +0/-0 的表示形式不一致，所以需要转换为连续范围上的 无符号整数
最后，再比较两个无符号整数 差值的绝对值 是否小于 max_ulp

相对于放缩 epsilon 的方法，这种比较方法：

一方面没有浮点数乘法的下溢问题
另一方面在多数现代处理器上，运算 效率更高（计算 次数更少，只需要进行 整数运算，不涉及任何 浮点数运算）

根据 gtest-internal.h，一般 max_ulp 取值为 4：
The maximum error of a single floating-point operation is 0.5 units in the last place. On Intel CPU's, all floating-point calculations are done with 80-bit precision, while double has 64 bits. Therefore, 4 should be enough for ordinary use.
另外，在实际代码中，还需要处理 +∞/-∞ 和 NaN (not a number) 等情况。（测试用例参考 gtest_unittest.cc）

再举个例子，4.0f 和它的下一个浮点数 nextafter(4.0f) 之间相差了 4 倍的 FLT_EPSILON（即 ULP(4.0f) == 4 FLT_EPSILON）：

十进制定点数	二进制浮点数	二进制表示
`4.0f`	(2^{0} + 0) \times 2^{2}(20+0)×22	`0 10000001 00000000000000000000000`
`nextafter(4.0f)`	(2^{0} + 2^{-23}) \times 2^{2}(20+2−23)×22	`0 10000001 00000000000000000000001`

ulp-vs-epsilon

所以，无法使用 绝对误差 判断 4.0f 和 nextafter(4.0f) 是否近似相等，只能使用 相对误差：

const float next_after_4 = nextafter(4.0f, FLT_MAX);
assert(AlmostEqualsAbs(4.0f, next_after_4) == false);  // expect true
assert(AlmostEqualsRel(4.0f, next_after_4) == true);
assert(AlmostEqualsUlp(4.0f, next_after_4) == true);

存在问题 —— 如果需要比较的数值 非常接近 0，就无法使用 相对误差 和 0 比较：

assert(AlmostEqualsRel(FLT_EPSILON, 0.0f) == false);  // expect true
assert(AlmostEqualsUlp(FLT_EPSILON, 0.0f) == false);  // expect true

虽然 nextafter(1.0f) 和 1.0f 相差 FLT_EPSILON，两者近似相等
但是 FLT_EPSILON（即 nextafter(1.0f) - 1.0f）和 0 相比，并不近似相等 ?

即使放大允许的误差范围，也不一定可行 —— 因为和 0 比起来，“相对误差” 还是太大了 —— 在这种情况下，基于相对误差的比较方法往往是 没有实际意义 的：

assert(AlmostEqualsRel(FLT_EPSILON, 0.0f, 0.1f) == false);  // expect true
assert(AlmostEqualsUlp(FLT_EPSILON, 0.0f, 1000) == false);  // expect true

解决办法 —— 只能使用 绝对误差 进行 “有意义” 的比较：

assert(AlmostEqualsAbs(FLT_EPSILON, 0.0f) == true);

常量的陷阱

需要注意，从低精度转为高精度（从 float 转为 double）会导致 “巨大” 的误差：

assert(Equals<double>(0.1f, 0.1) == false);           // expect true
assert(AlmostEqualsAbs<double>(0.1f, 0.1) == false);  // expect true
assert(AlmostEqualsRel<double>(0.1f, 0.1) == false);  // expect true
assert(AlmostEqualsUlp<double>(0.1f, 0.1) == false);  // expect true

为此，可以通过 放大允许的误差，缓解精度丢失的问题：

assert(AlmostEqualsAbs<double>(0.1f, 0.1, static_cast<double>(FLT_EPSILON)) ==
       true);
assert(AlmostEqualsRel<double>(0.1f, 0.1, static_cast<double>(FLT_EPSILON)) ==
       true);
assert(AlmostEqualsUlp<double>(0.1f, 0.1, FLT_EPSILON / DBL_EPSILON) == true);

相反，从高精度转为低精度（从 double 转为 float）则会抹除细微的误差：

assert(Equals(static_cast<float>(0.1), 0.1f) == true);

另外，由于精度的限制，两个 非常接近（相差 0 ULP）的十进制定点数 字面量 (literal)，只能表示为相同的二进制浮点数：

assert(Equals(67329.234f, 67329.235f) == true);  // expect false

写在最后

Bruce Dawson 的总结是 “没有银弹” —— 在比较两个浮点数是否近似相等时，需要根据具体场景，选择更合适的比较方法：

如果要判断一个浮点数 是否接近 0
- 应该使用 绝对误差
- 因为此时的 相对误差 没有实际意义
如果要比较的两个浮点数 都不是 0
- 一般使用更通用的 相对误差
- 但如果能 提前判断 允许的误差范围，也可以使用 绝对误差
所以，并不存在 判断任意两个浮点数的通用方法 ?

现实世界也一样 ——

对于有钱的人来说，AlmostEquals(百元钞票, 0) == true
对于没钱的人来说，AlmostEquals(一元硬币, 0) == false

最后，今年双十一将抽取一名在 下方留言 的幸运读者，送出一张 玛莎拉蒂 5 元代金券 ?

Maserati-Voucher

（图片来自网络）

如果有什么问题，欢迎交流。?

// https://bot-man-jl.github.io/articles/?post=2020/Comparing-Floating-Point-Numbers

#include <math.h>
#include <stddef.h>
#include <stdint.h>

#include <algorithm>
#include <limits>
#include <type_traits>

template <class Flp>
bool Equals(Flp f1, Flp f2) {
  return f1 == f2;
}

template <class Flp>
bool AlmostEqualsAbs(Flp f1,
                     Flp f2,
                     Flp epsilon = std::numeric_limits<Flp>::epsilon()) {
  return fabs(f1 - f2) <= epsilon;
}

template <class Flp>
bool AlmostEqualsRel(Flp f1,
                     Flp f2,
                     Flp epsilon = std::numeric_limits<Flp>::epsilon()) {
  return fabs(f1 - f2) <= epsilon * std::max(fabs(f1), fabs(f2));
}

namespace detail {

// Reference:
// https://github.com/google/googletest/blob/master/googletest/include/gtest/internal/gtest-internal.h

template <typename Flp>
class FloatingPoint {
 public:
  FloatingPoint(const Flp& x, size_t max_ulps = 4) : max_ulps_(max_ulps) {
    u_.value_ = x;
  }

  // - returns false if either number is (or both are) NAN.
  // - treats really large numbers as almost equal to infinity.
  // - thinks +0.0 and -0.0 are 0 DLP's apart.
  bool AlmostEquals(const FloatingPoint& rhs) const {
    if (is_nan() || rhs.is_nan())
      return false;

    const Bits biased1 = SignAndMagnitudeToBiased();
    const Bits biased2 = rhs.SignAndMagnitudeToBiased();
    return (std::max(biased1, biased2) - std::min(biased1, biased2)) <=
           max_ulps_;
  }

 private:
  using Bits = std::conditional_t<
      sizeof(Flp) == 4,
      uint32_t,
      std::conditional_t<sizeof(Flp) == 8, uint64_t, uint8_t>>;
  static_assert(sizeof(Bits) == 4 || sizeof(Bits) == 8, "only float or double");

  static constexpr size_t kBitCount = 8 * sizeof(Flp);
  static constexpr size_t kFractionBitCount =
      std::numeric_limits<Flp>::digits - 1;
  static constexpr size_t kExponentBitCount = kBitCount - 1 - kFractionBitCount;

  static constexpr Bits kSignBitMask = static_cast<Bits>(1) << (kBitCount - 1);
  static constexpr Bits kFractionBitMask = ~static_cast<Bits>(0) >>
                                           (kExponentBitCount + 1);
  static constexpr Bits kExponentBitMask = ~(kSignBitMask | kFractionBitMask);

  bool is_nan() const {
    // It's a NAN if the exponent bits are all ones and the fraction
    // bits are not entirely zeros.
    return ((kExponentBitMask & u_.bits_) == kExponentBitMask) &&
           ((kFractionBitMask & u_.bits_) != 0);
  }

  // Converts an integer from the sign-and-magnitude representation to
  // the biased representation.  More precisely, let N be 2 to the
  // power of (kBitCount - 1), an integer x is represented by the
  // unsigned number x + N.
  //
  // For instance,
  //
  //   -N + 1 (the most negative number representable using
  //          sign-and-magnitude) is represented by 1;
  //   0      is represented by N; and
  //   N - 1  (the biggest number representable using
  //          sign-and-magnitude) is represented by 2N - 1.
  //
  // Read http://en.wikipedia.org/wiki/Signed_number_representations
  // for more details on signed number representations.
  Bits SignAndMagnitudeToBiased() const {
    return (kSignBitMask & u_.bits_)
               ? (~u_.bits_ + 1)             // represents a negative number.
               : (kSignBitMask | u_.bits_);  // represents a positive number.
  }

  union FloatingPointUnion {
    Flp value_;
    Bits bits_;
  };

  FloatingPointUnion u_;
  size_t max_ulps_ = 0;
};

}  // namespace detail

template <class Flp>
bool AlmostEqualsUlp(Flp lhs, Flp rhs, size_t max_ulps = 4) {
  using FloatingPoint = detail::FloatingPoint<Flp>;
  return FloatingPoint(lhs, max_ulps).AlmostEquals(FloatingPoint(rhs));
}

"如果要判断一个浮点数是否接近 0
应该使用绝对误差
因为此时的相对误差没有实际意义"
猜测“因为此时的相对误差没有实际意义”中，“没有实际意义”的原因：
是不是因为“相对误差”中的“相对”其实是不成立的？
因为，0与任何数的减法都是这个数它自己，这个过程没有产生出“相对”，所以，“没有实际意义”。
任何数“相对”于0的失效。（包含0自己吗？）