将32位循环计数器替换为64位会在Intel CPU上使用_mm_popcnt_u64引起疯狂的性能偏差

本文翻译自:Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs

I was looking for the fastest way to popcount large arrays of data. 我一直在寻找最快的方法来popcount大量数据的数量。 I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC. 我遇到了一个非常奇怪的效果:将循环变量从unsigned更改为uint64_t使PC上的性能下降了50%。

The Benchmark 基准测试

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. 如您所见,我们创建了一个随机数据缓冲区,大小为x兆字节,其中从命令行读取x Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. 之后,我们遍历缓冲区并使用x86 popcount内部版本的展开版本来执行popcount。 To get a more precise result, we do the popcount 10,000 times. 为了获得更精确的结果,我们将弹出次数进行了10,000次。 We measure the times for the popcount. 我们计算弹出次数的时间。 In the upper case, the inner loop variable is unsigned , in the lower case, the inner loop variable is uint64_t . 在大写情况下,内部循环变量是unsigned ,在小写情况下,内部循环变量是uint64_t I thought that this should make no difference, but the opposite is the case. 我认为这应该没有什么区别,但情况恰恰相反。

The (absolutely crazy) results (绝对疯狂)结果

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1): 我这样编译(g ++版本:Ubuntu 4.8.2-19ubuntu1):

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data): 这是我的Haswell Core i7-4770K CPU @ 3.50 GHz,运行test 1 (因此有1 MB随机数据)的结果:

  • unsigned 41959360000 0.401554 sec 26.113 GB/s 未签名41959360000 0.401554秒26.113 GB / s
  • uint64_t 41959360000 0.759822 sec 13.8003 GB/s uint64_t 41959360000 0.759822秒13.8003 GB / s

As you see, the throughput of the uint64_t version is only half the one of the unsigned version! 如您所见, uint64_t版本的吞吐量仅为 unsigned版本的吞吐量的一半 The problem seems to be that different assembly gets generated, but why? 问题似乎在于生成了不同的程序集,但是为什么呢? First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3): 首先,我想到了编译器错误,因此尝试了clang++ (Ubuntu Clang版本3.4-1ubuntu3):

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: test 1 结果: test 1

  • unsigned 41959360000 0.398293 sec 26.3267 GB/s 无符号41959360000 0.398293秒26.3267 GB / s
  • uint64_t 41959360000 0.680954 sec 15.3986 GB/s uint64_t 41959360000 0.680954秒15.3986 GB / s

So, it is almost the same result and is still strange. 因此,它几乎是相同的结果,但仍然很奇怪。 But now it gets super strange. 但是现在变得超级奇怪。 I replace the buffer size that was read from input with a constant 1 , so I change: 我用常量1替换了从输入中读取的缓冲区大小,因此我进行了更改:

uint64_t size = atol(argv[1]) << 20;

to

uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time. 因此,编译器现在在编译时就知道缓冲区的大小。 Maybe it can add some optimizations! 也许它可以添加一些优化! Here are the numbers for g++ : 这是g++的数字:

  • unsigned 41959360000 0.509156 sec 20.5944 GB/s 未签名41959360000 0.509156秒20.5944 GB / s
  • uint64_t 41959360000 0.508673 sec 20.6139 GB/s uint64_t 41959360000 0.508673秒20.6139 GB / s

Now, both versions are equally fast. 现在,两个版本都同样快。 However, the unsigned got even slower ! 但是, unsigned 变得更慢 It dropped from 26 to 20 GB/s , thus replacing a non-constant by a constant value lead to a deoptimization . 它从26 20 GB/s下降到20 GB/s ,因此用恒定值替换非恒定值会导致优化不足 Seriously, I have no clue what is going on here! 说真的,我不知道这是怎么回事! But now to clang++ with the new version: 但是现在使用新版本的clang++

  • unsigned 41959360000 0.677009 sec 15.4884 GB/s 未签名41959360000 0.677009秒15.4884 GB / s
  • uint64_t 41959360000 0.676909 sec 15.4906 GB/s uint64_t 41959360000 0.676909秒15.4906 GB / s

Wait, what? 等一下 Now, both versions dropped to the slow number of 15 GB/s. 现在,两个版本的速度都降低到了15 GB / s的缓慢速度 Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang! 因此,在两种情况下,用常数替换非常数都会导致Clang!代码缓慢。

I asked a colleague with an Ivy Bridge CPU to compile my benchmark. 我要求具有Ivy Bridge CPU的同事来编译我的基准测试。 He got similar results, so it does not seem to be Haswell. 他得到了类似的结果,因此似乎不是Haswell。 Because two compilers produce strange results here, it also does not seem to be a compiler bug. 因为两个编译器在这里产生奇怪的结果,所以它似乎也不是编译器错误。 We do not have an AMD CPU here, so we could only test with Intel. 我们这里没有AMD CPU,因此只能在Intel上进行测试。

More madness, please! 请更加疯狂!

Take the first example (the one with atol(argv[1]) ) and put a static before the variable, ie: 以第一个示例(带有atol(argv[1])示例)为例,然后在变量前放置一个static变量,即:

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++: 这是我在g ++中的结果:

  • unsigned 41959360000 0.396728 sec 26.4306 GB/s 无符号41959360000 0.396728秒26.4306 GB / s
  • uint64_t 41959360000 0.509484 sec 20.5811 GB/s uint64_t 41959360000 0.509484秒20.5811 GB / s

Yay, yet another alternative . 是的,另一种选择 We still have the fast 26 GB/s with u32 , but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! 我们仍然有快26 GB / s的u32 ,但我们设法u64从13 GB至少/ S到20 GB / s的版本! On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. 在我collegue的PC中, u64版本成为速度甚至超过了u32的版本,产生所有的最快的结果。 Sadly, this only works for g++ , clang++ does not seem to care about static . 可悲的是,这仅适用于g++clang++似乎并不关心static

My question 我的问题

Can you explain these results? 您能解释这些结果吗? Especially: 特别:

  • How can there be such a difference between u32 and u64 ? 哪有之间的这种差异u32u64
  • How can replacing a non-constant by a constant buffer size trigger less optimal code ? 如何用恒定的缓冲区大小替换非常数会触发次优代码
  • How can the insertion of the static keyword make the u64 loop faster? 插入static关键字如何使u64循环更快? Even faster than the original code on my collegue's computer! 甚至比同事计算机上的原始代码还要快!

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. 我知道优化是一个棘手的领域,但是,我从未想到过如此小的更改会导致执行时间差异100% ,而诸如恒定缓冲区大小之类的小因素又会完全混和结果。 Of course, I always want to have the version that is able to popcount 26 GB/s. 当然,我一直希望拥有能够以26 GB / s的速度计数的版本。 The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. 我能想到的唯一可靠的方法是针对这种情况复制粘贴程序集并使用内联程序集。 This is the only way I can get rid of compilers that seem to go mad on small changes. 这是我摆脱似乎对微小更改感到恼火的编译器​​的唯一方法。 What do you think? 你怎么看? Is there another way to reliably get the code with most performance? 还有另一种方法可以可靠地获得性能最高的代码吗?

The Disassembly 拆卸

Here is the disassembly for the various results: 这是各种结果的反汇编:

26 GB/s version from g++ / u32 / non-const bufsize : 来自g ++ / u32 / non-const bufsize的 26 GB / s版本:

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 GB/s version from g++ / u64 / non-const bufsize : 来自g ++ / u64 / non-const bufsize的 13 GB / s版本:

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15 GB/s version from clang++ / u64 / non-const bufsize : 来自clang ++ / u64 / non-const bufsize的 15 GB / s版本:

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20 GB/s version from g++ / u32&u64 / const bufsize : 来自g ++ / u32&u64 / const bufsize的 20 GB / s版本:

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15 GB/s version from clang++ / u32&u64 / const bufsize : 来自clang ++ / u32&u64 / const bufsize的 15 GB / s版本:

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

Interestingly, the fastest (26 GB/s) version is also the longest! 有趣的是,最快的版本(26 GB / s)也是最长的! It seems to be the only solution that uses lea . 它似乎是使用lea的唯一解决方案。 Some versions use jb to jump, others use jne . 一些版本使用jb跳转,另一些版本使用jne But apart from that, all versions seem to be comparable. 但是除此之外,所有版本似乎都是可比的。 I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. 我看不出100%的性能差距可能源于何处,但我不太擅长破译汇编。 The slowest (13 GB/s) version looks even very short and good. 最慢的版本(13 GB / s)看起来非常短而且很好。 Can anyone explain this? 谁能解释一下?

Lessons learned 得到教训

No matter what the answer to this question will be; 不管这个问题的答案是什么; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code . 我了解到,在真正的热循环中, 每个细节都可能很重要, 即使那些似乎与该热代码没有任何关联的细节也是如此 I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! 我从未考虑过要为循环变量使用哪种类型,但是如您所见,如此小的更改可能会产生100%的差异! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! 正如我们在size变量前面插入static关键字所看到的那样,甚至缓冲区的存储类型也可以产生巨大的变化! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance. 将来,在编写对系统性能至关重要的紧密而又热的循环时,我将始终在各种编译器上测试各种替代方案。

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. 有趣的是,尽管我已经四次展开循环,但性能差异仍然很高。 So even if you unroll, you can still get hit by major performance deviations. 因此,即使展开,仍然会受到重大性能偏差的影响。 Quite interesting. 挺有意思。


#1楼

参考:https://stackoom.com/question/1hE0T/将-位循环计数器替换为-位会在Intel-CPU上使用-mm-popcnt-u-引起疯狂的性能偏差


#2楼

This is not an answer, but it's hard to read if I put results in comment. 这不是答案,但是如果我将结果放入评论中则很难阅读。

I get these results with a Mac Pro ( Westmere 6-Cores Xeon 3.33 GHz). 我使用Mac ProWestmere 6核Xeon 3.33 GHz)获得了这些结果。 I compiled it with clang -O3 -msse4 -lstdc++ a.cpp -oa (-O2 get same result). 我用clang -O3 -msse4 -lstdc++ a.cpp -oa编译了它(-O2得到了相同的结果)。

clang with uint64_t size=atol(argv[1])<<20; 带有uint64_t size=atol(argv[1])<<20; clang uint64_t size=atol(argv[1])<<20;

unsigned    41950110000 0.811198 sec    12.9263 GB/s
uint64_t    41950110000 0.622884 sec    16.8342 GB/s

clang with uint64_t size=1<<20; uint64_t size=1<<20; clang uint64_t size=1<<20;

unsigned    41950110000 0.623406 sec    16.8201 GB/s
uint64_t    41950110000 0.623685 sec    16.8126 GB/s

I also tried to: 我还试图:

  1. Reverse the test order, the result is the same so it rules out the cache factor. 颠倒测试顺序,结果相同,因此排除了缓存因素。
  2. Have the for statement in reverse: for (uint64_t i=size/8;i>0;i-=4) . 反向使用for语句: for (uint64_t i=size/8;i>0;i-=4) This gives the same result and proves the compile is smart enough to not divide size by 8 every iteration (as expected). 这给出了相同的结果,并证明了编译足够聪明,不会在每次迭代中都将大小除以8(如预期的那样)。

Here is my wild guess: 这是我的疯狂猜测:

The speed factor comes in three parts: 速度因数分为三个部分:

  • code cache: uint64_t version has larger code size, but this does not have an effect on my Xeon CPU. 代码缓存: uint64_t版本具有更大的代码大小,但这对我的Xeon CPU没有影响。 This makes the 64-bit version slower. 这会使64位版本变慢。

  • Instructions used. 使用说明。 Note not only the loop count, but the buffer is accessed with a 32-bit and 64-bit index on the two versions. 请注意,不仅循环计数,而且在两个版本上还使用32位和64位索引访问缓冲区。 Accessing a pointer with a 64-bit offset requests a dedicated 64-bit register and addressing, while you can use immediate for a 32-bit offset. 访问具有64位偏移量的指针需要专用的64位寄存器和地址,而您可以将Instant用于32位偏移量。 This may make the 32-bit version faster. 这可能会使32位版本更快。

  • Instructions are only emitted on the 64-bit compile (that is, prefetch). 指令仅在64位编译器上发出(即预取)。 This makes 64-bit faster. 这使64位速度更快。

The three factors together match with the observed seemingly conflicting results. 这三个因素与观察到的看似矛盾的结果相吻合。


#3楼

Have you tried moving the reduction step outside the loop? 您是否尝试过将还原步骤移出循环? Right now you have a data dependency that really isn't needed. 现在,您确实不需要数据依赖。

Try: 尝试:

  uint64_t subset_counts[4] = {};
  for( unsigned k = 0; k < 10000; k++){
     // Tight unrolled loop with unsigned
     unsigned i=0;
     while (i < size/8) {
        subset_counts[0] += _mm_popcnt_u64(buffer[i]);
        subset_counts[1] += _mm_popcnt_u64(buffer[i+1]);
        subset_counts[2] += _mm_popcnt_u64(buffer[i+2]);
        subset_counts[3] += _mm_popcnt_u64(buffer[i+3]);
        i += 4;
     }
  }
  count = subset_counts[0] + subset_counts[1] + subset_counts[2] + subset_counts[3];

You also have some weird aliasing going on, that I'm not sure is conformant to the strict aliasing rules. 您还会遇到一些奇怪的别名,但我不确定这是否符合严格的别名规则。


#4楼

I can't give an authoritative answer, but provide an overview of a likely cause. 我无法给出权威性的答案,但会概述可能的原因。 This reference shows pretty clearly that for the instructions in the body of your loop there is a 3:1 ratio between latency and throughput. 该参考资料清楚地表明,对于循环主体中的指令,延迟和吞吐量之间的比例为3:1。 It also shows the effects of multiple dispatch. 它还显示了多个调度的效果。 Since there are (give-or-take) three integer units in modern x86 processors, it's generally possible to dispatch three instructions per cycle. 由于现代x86处理器中有(允许或不采取)三个整数单元,因此通常每个周期可以调度三个指令。

So between peak pipeline and multiple dispatch performance and failure of these mechanisms, we have a factor of six in performance. 因此,在高峰管道和多种调度性能以及这些机制的失败之间,我们的性能有六分之一。 It's pretty well known that the complexity of the x86 instruction set makes it quite easy for quirky breakage to occur. 众所周知,x86指令集的复杂性使得它很容易发生古怪的破坏。 The document above has a great example: 上面的文档有一个很好的例子:

The Pentium 4 performance for 64-bit right shifts is really poor. 奔腾4的64位右移性能确实很差。 64-bit left shift as well as all 32-bit shifts have acceptable performance. 64位左移以及所有32位移都具有可接受的性能。 It appears that the data path from the upper 32 bits to the lower 32 bit of the ALU is not well designed. 看来从ALU的高32位到低32位的数据路径设计得不好。

I personally ran into a strange case where a hot loop ran considerably slower on a specific core of a four-core chip (AMD if I recall). 我个人遇到了一个奇怪的情况,即热循环在四核芯片(如果我还记得的话,AMD)的特定核心上运行得相当慢。 We actually got better performance on a map-reduce calculation by turning that core off. 通过关闭该核心,我们实际上在映射减少计算上获得了更好的性能。

Here my guess is contention for integer units: that the popcnt , loop counter, and address calculations can all just barely run at full speed with the 32-bit wide counter, but the 64-bit counter causes contention and pipeline stalls. 在这里,我的猜测是争用整数单位: popcnt ,循环计数器和地址计算都只能使用32位宽的计数器全速运行,但是64位计数器会引起争用和流水线停顿。 Since there are only about 12 cycles total, potentially 4 cycles with multiple dispatch, per loop body execution, a single stall could reasonably affect run time by a factor of 2. 因为每个循环体执行总共只有大约12个周期,可能有4个周期有多个调度,所以单个停顿可以合理地将运行时间影响2倍。

The change induced by using a static variable, which I'm guessing just causes a minor reordering of instructions, is another clue that the 32-bit code is at some tipping point for contention. 我猜想使用静态变量引起的更改会导致对指令进行较小的重新排序,这是32位代码处于竞争的临界点的另一个线索。

I know this is not a rigorous analysis, but it is a plausible explanation. 我知道这不是严格的分析,但这一个合理的解释。


#5楼

Culprit: False Data Dependency (and the compiler isn't even aware of it) 罪魁祸首:错误的数据依赖关系 (并且编译器甚至没有意识到)

On Sandy/Ivy Bridge and Haswell processors, the instruction: 在Sandy / Ivy Bridge和Haswell处理器上,指令如下:

popcnt  src, dest

appears to have a false dependency on the destination register dest . 似乎对目标寄存器dest有错误的依赖性。 Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. 即使指令仅写入指令,指令也会等到dest准备好后再执行。 This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake) 英特尔现已(现在)将这种错误依赖性记录为勘误表HSD146(Haswell)SKL029(Skylake)

Skylake fixed this for lzcnt and tzcnt . Skylake将其固定为lzcnttzcnt
Cannon Lake (and Ice Lake) fixed this for popcnt . 坎农湖(和冰湖) popcnt修复为popcnt
bsf / bsr have a true output dependency: output unmodified for input=0. bsf / bsr具有真正的输出依赖性:输入= 0时输出未修改。 (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.) (但是没有办法利用内在函数来利用它 -只有AMD记录了它,编译器没有公开它。)

(Yes, these instructions all run on the same execution unit ). (是的,这些指令都在同一执行单元上运行)。


This dependency doesn't just hold up the 4 popcnt s from a single loop iteration. 这种依赖关系不只是从单个循环迭代中容纳4个popcnt It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations. 它可以进行循环迭代,从而使处理器无法并行化不同的循环迭代。

The unsigned vs. uint64_t and other tweaks don't directly affect the problem. unsigned vs. uint64_t和其他调整不会直接影响问题。 But they influence the register allocator which assigns the registers to the variables. 但是它们会影响寄存器分配器,后者将寄存器分配给变量。

In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do. 在您的情况下,速度是卡在(假)依赖链上的直接结果,具体取决于寄存器分配器决定执行的操作。

  • 13 GB/s has a chain: popcnt - add - popcnt - popcnt → next iteration 13 GB / s有一条链: popcnt add popcnt - popcnt →下一次迭代
  • 15 GB/s has a chain: popcnt - add - popcnt - add → next iteration 15 GB / s有一条链: popcnt add popcnt add →下一个迭代
  • 20 GB/s has a chain: popcnt - popcnt → next iteration 20 GB / s有一条链: popcnt - popcnt →下一次迭代
  • 26 GB/s has a chain: popcnt - popcnt → next iteration 26 GB / s有一条链: popcnt - popcnt →下一次迭代

The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. 20 GB / s和26 GB / s之间的差异似乎只是间接寻址的一个小问题。 Either way, the processor starts to hit other bottlenecks once you reach this speed. 无论哪种方式,一旦达到此速度,处理器就会开始遇到其他瓶颈。


To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. 为了测试这一点,我使用了内联程序集来绕过编译器并获得我想要的程序集。 I also split up the count variable to break all other dependencies that might mess with the benchmarks. 我还拆分了count变量,以打破可能与基准混乱的所有其他依赖项。

Here are the results: 结果如下:

Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom) Sandy Bridge Xeon @ 3.5 GHz :(完整的测试代码可在底部找到)

  • GCC 4.6.3: g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native GCC 4.6.3: g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
  • Ubuntu 12 Ubuntu 12

Different Registers: 18.6195 GB/s 不同的寄存器: 18.6195 GB / s

.L4:
    movq    (%rbx,%rax,8), %r8
    movq    8(%rbx,%rax,8), %r9
    movq    16(%rbx,%rax,8), %r10
    movq    24(%rbx,%rax,8), %r11
    addq    $4, %rax

    popcnt %r8, %r8
    add    %r8, %rdx
    popcnt %r9, %r9
    add    %r9, %rcx
    popcnt %r10, %r10
    add    %r10, %rdi
    popcnt %r11, %r11
    add    %r11, %rsi

    cmpq    $131072, %rax
    jne .L4

Same Register: 8.49272 GB/s 相同寄存器: 8.49272 GB / s

.L9:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # This time reuse "rax" for all the popcnts.
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L9

Same Register with broken chain: 17.8869 GB/s 链断裂的同一个寄存器: 17.8869 GB / s

.L14:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # Reuse "rax" for all the popcnts.
    xor    %rax, %rax    # Break the cross-iteration dependency by zeroing "rax".
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L14

So what went wrong with the compiler? 那么编译器出了什么问题?

It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency. 似乎GCC和Visual Studio都不知道popcnt具有这种错误的依赖性。 Nevertheless, these false dependencies aren't uncommon. 但是,这些错误的依赖关系并不少见。 It's just a matter of whether the compiler is aware of it. 只是编译器是否知道它。

popcnt isn't exactly the most used instruction. popcnt并不是最常用的指令。 So it's not really a surprise that a major compiler could miss something like this. 因此,主要的编译器可能会错过这样的内容,这并不奇怪。 There also appears to be no documentation anywhere that mentions this problem. 似乎也没有任何文档提到此问题。 If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance. 如果英特尔不公开,那么除非有人偶然碰到它,否则外界不会知道。

( Update: As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.) 更新: 从4.9.2版开始 ,GCC意识到了这种虚假依赖关系,并在启用优化后生成了代码来对其进行补偿。其他厂商的主要编译器,包括Clang,MSVC甚至是英特尔自己的ICC都尚未意识到此微体系结构勘误表,并且不会发出补偿它的代码。)

Why does the CPU have such a false dependency? 为什么CPU具有如此错误的依赖性?

We can speculate: it runs on the same execution unit as bsf / bsr which do have an output dependency. 我们可以推测:它运行相同的执行单元作为bsf / bsr确实有一个输出的依赖。 ( How is POPCNT implemented in hardware? ). 如何在硬件中实现POPCNT? )。 For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. 对于这些指令,Intel将input = 0的整数结果记录为“未定义”(ZF = 1),但是Intel硬件实际上为避免破坏旧软件提供了更强的保证:输出未修改。 AMD documents this behaviour. AMD记录了这种行为。

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not. 可能不方便地为此执行单元的输出依赖于输出,而另一些则不然。

AMD processors do not appear to have this false dependency. AMD处理器似乎没有这种错误的依赖性。


The full test code is below for reference: 完整的测试代码如下:

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

   using namespace std;
   uint64_t size=1<<20;

   uint64_t* buffer = new uint64_t[size/8];
   char* charbuffer=reinterpret_cast<char*>(buffer);
   for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;

   uint64_t count,duration;
   chrono::time_point<chrono::system_clock> startP,endP;
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %4  \n\t"
                "add %4, %0     \n\t"
                "popcnt %5, %5  \n\t"
                "add %5, %1     \n\t"
                "popcnt %6, %6  \n\t"
                "add %6, %2     \n\t"
                "popcnt %7, %7  \n\t"
                "add %7, %3     \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Chain 4   \t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }
   {
      uint64_t c0 = 0;
      uint64_t c1 = 0;
      uint64_t c2 = 0;
      uint64_t c3 = 0;
      startP = chrono::system_clock::now();
      for( unsigned k = 0; k < 10000; k++){
         for (uint64_t i=0;i<size/8;i+=4) {
            uint64_t r0 = buffer[i + 0];
            uint64_t r1 = buffer[i + 1];
            uint64_t r2 = buffer[i + 2];
            uint64_t r3 = buffer[i + 3];
            __asm__(
                "xor %%rax, %%rax   \n\t"   // <--- Break the chain.
                "popcnt %4, %%rax   \n\t"
                "add %%rax, %0      \n\t"
                "popcnt %5, %%rax   \n\t"
                "add %%rax, %1      \n\t"
                "popcnt %6, %%rax   \n\t"
                "add %%rax, %2      \n\t"
                "popcnt %7, %%rax   \n\t"
                "add %%rax, %3      \n\t"
                : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
                : "r"  (r0), "r"  (r1), "r"  (r2), "r"  (r3)
                : "rax"
            );
         }
      }
      count = c0 + c1 + c2 + c3;
      endP = chrono::system_clock::now();
      duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
      cout << "Broken Chain\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
            << (10000.0*size)/(duration) << " GB/s" << endl;
   }

   free(charbuffer);
}

An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si 一个同样有趣的基准可以在这里找到: http : //pastebin.com/kbzgL8si
This benchmark varies the number of popcnt s that are in the (false) dependency chain. 此基准会popcnt (假)依赖关系链中的popcnt的数量。

False Chain 0:  41959360000 0.57748 sec     18.1578 GB/s
False Chain 1:  41959360000 0.585398 sec    17.9122 GB/s
False Chain 2:  41959360000 0.645483 sec    16.2448 GB/s
False Chain 3:  41959360000 0.929718 sec    11.2784 GB/s
False Chain 4:  41959360000 1.23572 sec     8.48557 GB/s

#6楼

I coded up an equivalent C program to experiment, and I can confirm this strange behaviour. 我编写了一个等效的C程序进行实验,可以证实这种奇怪的行为。 What's more, gcc believes the 64-bit integer (which should probably be a size_t anyway...) to be better, as using uint_fast32_t causes gcc to use a 64-bit uint. 更重要的是, gcc认为64位整数(无论如何应该应该是size_t ...)会更好,因为使用uint_fast32_t会导致gcc使用64位uint。

I did a bit of mucking around with the assembly: 我对程序集做了一些修改:
Simply take the 32-bit version, replace all 32-bit instructions/registers with the 64-bit version in the inner popcount-loop of the program. 只需采用32位版本,即可在程序内部popcount循环中将所有32位指令/寄存器替换为64位版本。 Observation: the code is just as fast as the 32-bit version! 观察:代码与32位版本一样快!

This is obviously a hack, as the size of the variable isn't really 64 bit, as other parts of the program still use the 32-bit version, but as long as the inner popcount-loop dominates performance, this is a good start. 显然,这是一个hack,因为变量的大小并不是真正的64位,因为程序的其他部分仍使用32位版本,但是只要内部popcount循环主导性能,这就是一个好的开始。

I then copied the inner loop code from the 32-bit version of the program, hacked it up to be 64 bit, fiddled with the registers to make it a replacement for the inner loop of the 64-bit version. 然后,我从程序的32位版本复制了内部循环代码,将其修改为64位,并在寄存器中进行了修饰,以使其替换64位版本的内部循环。 This code also runs as fast as the 32-bit version. 此代码也可以与32位版本一样快地运行。

My conclusion is that this is bad instruction scheduling by the compiler, not actual speed/latency advantage of 32-bit instructions. 我的结论是,这是编译器不好的指令调度,而不是32位指令的实际速度/延迟优势。

(Caveat: I hacked up assembly, could have broken something without noticing. I don't think so.) (注意:我搞砸了程序集,可能在不注意的情况下破坏了某些东西。我不这样认为。)