Sometimes you need to count the number of 1’s in a stream of bits. The most direct application would be summarizing yes/no data packed into bits. It’s also useful in writing efficient, low-level bit twiddling code. But there are less direct applications as well. For example, three weeks ago this came up in a post I wrote about Pascal’s triangle.
The number of odd integers in the nth row of Pascal’s triangle equals 2b where b is the number of 1’s in the binary representation of n.
The function that takes a bit stream and returns the number of 1’s is commonly called popcount
, short for population count.
Formula using floor
Here’s an interesting formula for popcount
taken from Hacker’s Delight:
The sum is actually finite since after a certain point all the terms are zero. For example, if x = 13 (1101 in binary), then the right side is
13 − 6 − 3 − 1
which of course is 3.
Computing with gcc extensions
The gcc compiler has a function __builtin_popcount
that takes an unsigned integer x and returns the number of bits in x set to 1. If the target platform has a chip instruction for computing popcount
then compiler will generate code to call this instruction. Otherwise it uses library code.
For example, the following code prints 6, 1, and 2.
#include <stdio.h> int main() { for (unsigned int x = 63; x < 66; x++) printf("%d\n", __builtin_popcount(x)); }
There are also functions __builtin_popcountl
and __builtin_popcountll
for unsigned long
and unsigned long long
arguments.
Computing in C
If you want to use your own C code to compute popcount
rather than relying on compiler extensions, here are a couple possibilities taken from Hacker’s Delight. First, one that only requires unsigned int
s.
int pop1(unsigned x) { x -= ((x >> 1) & 0x55555555); x = (x & 0x33333333) + ((x >> 2) & 0x33333333); x = (x + (x >> 4)) & 0x0F0F0F0F; x += (x >> 8); x += (x >> 16); return x & 0x0000003F; }
And here is a more elegant version that uses an unsigned long long
.
int pop2(unsigned x) { unsigned long long y; y = x * 0x0002000400080010ULL; y = y & 0x1111111111111111ULL; y = y * 0x1111111111111111ULL; y = y >> 60; return y; }
The builtin popcount intrinsic is nice, but be sure that your compilation flags let the compiler assume the POPCNT hardware instruction is present otherwise there’s some run-time performance overhead.
If your bit stream is long enough (1024 bits or multiples thereof), then there’s an AVX2 solution which is faster than successive native POPCNT instructions. See https://arxiv.org/pdf/1611.07612.pdf , which also describes some of the other algorithms.
Extremely new hardware supports a 512-bit popcount via the _mm512_popcnt_epi64 intrinsic.
With SIMD instructions, you can do better… see our paper…
Wojciech Muła, Nathan Kurz, Daniel Lemire
Faster Population Counts Using AVX2 Instructions
Computer Journal 61 (1), 2018
https://arxiv.org/abs/1611.07612
Like most math/logic functions, there’s the classical trade-off between how much time you have and how much space (hardware) you have to perform the function.
What’s the fastest way to count and sum the bits in a bit-serial serial stream of an unknown but bounded length n? Brute-force says two counters are needed, each with width log2(n): One to count the clock and one to count the set bits. Ideally, the count and sum would be available one clock period after the last bit arrived.
How best to implement the counters? Clock in j bits (j ≤ n), then do a wider popcount_j()? Accumulating the bits increases the minimum delay, but lets you use slower circuits.
What if instead we focus on counting the bits using minimal energy? In hardware such as an FPGA, this means minimizing total bit-flips within the popcounter, given the assumption that a static state costs vastly less energy than a dynamic one, meaning no bounds on the number of gates used if only energy is being minimized.
Unfortunately, that’s where my recollection falters: The last time I helped implement this was over 2 decades ago for a novel high-speed codec, and I forget the specific initial implementation we finally used. Optimizing the trade-off between time and circuit area was difficult, especially when faster and larger FPGAs were being introduced seemingly daily, so we also optimized for scalability: In one late iteration we did use an external 8-bit shift register that was 8x faster than our fastest FPGA, gambling that the next “large, fast and cheap” FPGA would be 8x faster and available in time to go to market.
On that project (which was a key part of a much larger product) I worked on the requirements, algorithms, interfaces and the testbench, and not within the FPGA itself (other than some floorplan assistance, essentially graph optimizations). I also kept track of time, power and BOM costs, evaluating the implementation candidates every step of the way.
For the project as a whole, I also got to optimize the trade-offs between using multiple smaller FPGAs versus fewer giant ones, as well as the hardware/software trade-offs. It was challenging (and fun!) to help design interfaces and algorithms that could be efficiently implemented in either hardware or software. My earliest “deliverable” was a functionally complete system simulator.
The initial product release was nearly pure bleeding-edge FPGA hardware, as we had a “first mover” market advantage that let us set ludicrous initial prices. As embedded processors got faster and ASICs became cheaper, software content increased and the hardware content (COGS) steadily fell, keeping competitors out of that niche.
I left the company soon after the initial product debut: They kept the PhDs and laid-off the engineers (giving us stock and an awesome severance package), focusing on IP and letting the fab houses do the heavy lifting for subsequent implementations. Today’s implementation is a semi-custom cell in a commodity ARM SoC.
The money scaled perhaps the fastest: From a $20K initial system cost (~$30K in 2020 dollars) to a $0.30 per-chip IP licence fee, a scaling factor of 100,000x. And the single fastest thing done was count bits. Our core “value-add” was minimizing the number of bits needing to be counted: By itself a significant but incremental advance, quickly debuting our tech using the fastest available FPGA hardware shook the market. Within a year that initial price dropped by 10x when the single-chip ASIC launched (basically unchanged from the FPGA version, but faster still).
pop6: # @pop6
push rbp
mov rbp, rsp
mov dword ptr [rbp – 4], edi
mov eax, dword ptr [rbp – 4]
and eax, 255
mov eax, eax
mov ecx, eax
movsx eax, byte ptr [rcx + pop6.table]
mov edx, dword ptr [rbp – 4]
shr edx, 8
and edx, 255
mov edx, edx
mov ecx, edx
movsx edx, byte ptr [rcx + pop6.table]
add eax, edx
mov edx, dword ptr [rbp – 4]
shr edx, 16
and edx, 255
mov edx, edx
mov ecx, edx
movsx edx, byte ptr [rcx + pop6.table]
add eax, edx
mov edx, dword ptr [rbp – 4]
shr edx, 24
mov edx, edx
mov ecx, edx
movsx edx, byte ptr [rcx + pop6.table]
add eax, edx
pop rbp
ret
pop6.table:
.ascii “\000\001\001\002\001\002\002\003\001\002\002\003\002\003\003\004\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\004\005\005\006\005\006\006\007\005\006\006\007\006\007\007\b”
Last year I saw this video with the clickbaity “One Weird CPU Instruction” title – some of this stuff seems like conspiracy theory, but it has an interesting history regardless.
https://www.youtube.com/watch?v=bLFqLfz2Fmc
The “more elegant” version returns only 4 bits, and it actually only works for 0-32767
trivia – the original popcount instructions were added to Intel CPUs at the request of the NSA as a quick check as to whether a text was possibly a plaintext or not.