Popcount: counting 1’s in a bit stream

Sometimes you need to count the number of 1’s in a stream of bits. The most direct application would be summarizing yes/no data packed into bits. It’s also useful in writing efficient, low-level bit twiddling code. But there are less direct applications as well. For example, three weeks ago this came up in a post I wrote about Pascal’s triangle.

The number of odd integers in the nth row of Pascal’s triangle equals 2b where b is the number of 1’s in the binary representation of n.

The function that takes a bit stream and returns the number of 1’s is commonly called popcount, short for population count.

Formula using floor

Here’s an interesting formula for popcount taken from Hacker’s Delight:

\mbox{popcount}(x) = x - \sum_{n=1}^\infty \left\lfloor \frac{x}{2^n} \right\rfloor

The sum is actually finite since after a certain point all the terms are zero. For example, if x = 13 (1101 in binary), then the right side is

13 − 6 − 3 − 1

which of course is 3.

Computing with gcc extensions

The gcc compiler has a function __builtin_popcount that takes an unsigned integer x and returns the number of bits in x set to 1. If the target platform has a chip instruction for computing popcount then compiler will generate code to call this instruction. Otherwise it uses library code.

For example, the following code prints 6, 1, and 2.

    #include <stdio.h>

    int main() {
        for (unsigned int x = 63; x < 66; x++)
            printf("%d\n", __builtin_popcount(x));
    }

There are also functions __builtin_popcountl and __builtin_popcountll for unsigned long and unsigned long long arguments.

Computing in C

If you want to use your own C code to compute popcount rather than relying on compiler extensions, here are a couple possibilities taken from Hacker’s Delight. First, one that only requires unsigned ints.

    int pop1(unsigned x) {
        x -= ((x >> 1) & 0x55555555);
        x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
        x = (x + (x >> 4)) & 0x0F0F0F0F;
        x += (x >> 8);
        x += (x >> 16);    
        return x & 0x0000003F;
    }

And here is a more elegant version that uses an unsigned long long.

    int pop2(unsigned x) {
        unsigned long long y;
        y = x * 0x0002000400080010ULL;
        y = y & 0x1111111111111111ULL;
        y = y * 0x1111111111111111ULL;
        y = y >> 60;
        return y;
    }

7 thoughts on “Popcount: counting 1’s in a bit stream

  1. The builtin popcount intrinsic is nice, but be sure that your compilation flags let the compiler assume the POPCNT hardware instruction is present otherwise there’s some run-time performance overhead.

    If your bit stream is long enough (1024 bits or multiples thereof), then there’s an AVX2 solution which is faster than successive native POPCNT instructions. See https://arxiv.org/pdf/1611.07612.pdf , which also describes some of the other algorithms.

    Extremely new hardware supports a 512-bit popcount via the _mm512_popcnt_epi64 intrinsic.

  2. Like most math/logic functions, there’s the classical trade-off between how much time you have and how much space (hardware) you have to perform the function.

    What’s the fastest way to count and sum the bits in a bit-serial serial stream of an unknown but bounded length n? Brute-force says two counters are needed, each with width log2(n): One to count the clock and one to count the set bits. Ideally, the count and sum would be available one clock period after the last bit arrived.

    How best to implement the counters? Clock in j bits (j ≤ n), then do a wider popcount_j()? Accumulating the bits increases the minimum delay, but lets you use slower circuits.

    What if instead we focus on counting the bits using minimal energy? In hardware such as an FPGA, this means minimizing total bit-flips within the popcounter, given the assumption that a static state costs vastly less energy than a dynamic one, meaning no bounds on the number of gates used if only energy is being minimized.

    Unfortunately, that’s where my recollection falters: The last time I helped implement this was over 2 decades ago for a novel high-speed codec, and I forget the specific initial implementation we finally used. Optimizing the trade-off between time and circuit area was difficult, especially when faster and larger FPGAs were being introduced seemingly daily, so we also optimized for scalability: In one late iteration we did use an external 8-bit shift register that was 8x faster than our fastest FPGA, gambling that the next “large, fast and cheap” FPGA would be 8x faster and available in time to go to market.

    On that project (which was a key part of a much larger product) I worked on the requirements, algorithms, interfaces and the testbench, and not within the FPGA itself (other than some floorplan assistance, essentially graph optimizations). I also kept track of time, power and BOM costs, evaluating the implementation candidates every step of the way.

    For the project as a whole, I also got to optimize the trade-offs between using multiple smaller FPGAs versus fewer giant ones, as well as the hardware/software trade-offs. It was challenging (and fun!) to help design interfaces and algorithms that could be efficiently implemented in either hardware or software. My earliest “deliverable” was a functionally complete system simulator.

    The initial product release was nearly pure bleeding-edge FPGA hardware, as we had a “first mover” market advantage that let us set ludicrous initial prices. As embedded processors got faster and ASICs became cheaper, software content increased and the hardware content (COGS) steadily fell, keeping competitors out of that niche.

    I left the company soon after the initial product debut: They kept the PhDs and laid-off the engineers (giving us stock and an awesome severance package), focusing on IP and letting the fab houses do the heavy lifting for subsequent implementations. Today’s implementation is a semi-custom cell in a commodity ARM SoC.

    The money scaled perhaps the fastest: From a $20K initial system cost (~$30K in 2020 dollars) to a $0.30 per-chip IP licence fee, a scaling factor of 100,000x. And the single fastest thing done was count bits. Our core “value-add” was minimizing the number of bits needing to be counted: By itself a significant but incremental advance, quickly debuting our tech using the fastest available FPGA hardware shook the market. Within a year that initial price dropped by 10x when the single-chip ASIC launched (basically unchanged from the FPGA version, but faster still).

  3. pop6: # @pop6
    push rbp
    mov rbp, rsp
    mov dword ptr [rbp – 4], edi
    mov eax, dword ptr [rbp – 4]
    and eax, 255
    mov eax, eax
    mov ecx, eax
    movsx eax, byte ptr [rcx + pop6.table]
    mov edx, dword ptr [rbp – 4]
    shr edx, 8
    and edx, 255
    mov edx, edx
    mov ecx, edx
    movsx edx, byte ptr [rcx + pop6.table]
    add eax, edx
    mov edx, dword ptr [rbp – 4]
    shr edx, 16
    and edx, 255
    mov edx, edx
    mov ecx, edx
    movsx edx, byte ptr [rcx + pop6.table]
    add eax, edx
    mov edx, dword ptr [rbp – 4]
    shr edx, 24
    mov edx, edx
    mov ecx, edx
    movsx edx, byte ptr [rcx + pop6.table]
    add eax, edx
    pop rbp
    ret

    pop6.table:
    .ascii “\000\001\001\002\001\002\002\003\001\002\002\003\002\003\003\004\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\001\002\002\003\002\003\003\004\002\003\003\004\003\004\004\005\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\002\003\003\004\003\004\004\005\003\004\004\005\004\005\005\006\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\003\004\004\005\004\005\005\006\004\005\005\006\005\006\006\007\004\005\005\006\005\006\006\007\005\006\006\007\006\007\007\b”

  4. The “more elegant” version returns only 4 bits, and it actually only works for 0-32767

  5. trivia – the original popcount instructions were added to Intel CPUs at the request of the NSA as a quick check as to whether a text was possibly a plaintext or not.

Comments are closed.