Shuffle product

The shuffle product of two words, w1 and w2, written

w1 Ш w2,

is the set of all words formed by the letters in w1 and w2, preserving the order of each word’s letters. The name comes from the analogy with doing a riffle shuffle of two decks of cards.

For example, bcd Ш ae, the shuffle product of bcd and ae, would be all permutations of abcde in which the consonants appear in alphabetical order and the vowels are also in alphabetical order. So abecd and baecd would be included, but badec would not be because the d and c are in the wrong order.

Side note on Ш

Incidentally, the symbol for shuffle product is the Cyrillic letter sha (Ш, U+0428), the only Cyrillic letter commonly used in mathematics, at least internationally. Presumably Russian mathematicians use other Cyrillic letters, but the only Cyrillic letter an American mathematician, for example, is likely to use is Ш.

The uses of Ш that I’m aware of are the Dirac comb distribution, the Tate–Shafarevich group, and the shuffle product.

Duplicate letters

What is the shuffle product of words containing duplicate letters? For example, what about the shuffle product of bread and crumb? Each word contains an r. The shuffle product, defined above as a set, doesn’t distinguish between the two rs. But another way to define the shuffle product is as a formal sum, with coefficients that count duplicates.

Imagine coloring the letters in abc blue and the letters in cde red. Then abccde and abccde would count as two different possibilities, one with blue c followed by red c, and one the other way around. This term in the formal sum would be 2abccde, the two capturing that there are two ways to arrive at this word.

You could also have duplicate letters within a single word. So in banana, for example, you could imagine coloring each a a different color and coloring the two ns different colors.

Mathematica code

This page gives an implementation of the shuffle product in Mathematica.

shuffleW[s1_, s2_] := 
     Module[{p, tp, ord}, 
          p = Permutations@Join[1 & /@ s1, 0 & /@ s2]\[Transpose];
          tp = BitXor[p, 1];
          ord = Accumulate[p] p + (Accumulate[tp] + Length[s1]) tp;
          Outer[Part, {Join[s1, s2]}, ord, 1][[1]]\[Transpose]]

This code takes two lists of characters and returns a list of lists of characters. You can use this to compute both senses of the shuffle product. For example, let’s compute abc Ш ac.

The Mathematica command

    shuffleW[{a, b, c}, {a, c}]

returns a list of 10 lists:

 
    {{a, b, c, a, c}, {a, b, a, c, c}, {a, b, a, c, c}, 
     {a, a, b, c, c}, {a, a, b, c, c}, {a, a, c, b, c}, 
     {a, a, b, c, c}, {a, a, b, c, c}, {a, a, c, b, c}, 
     {a, c, a, b, c}}

If we ask for the union of the set above with Union[%] we get

    {{a, a, b, c, c}, {a, a, c, b, c}, {a, b, a, c, c}, 
     {a, b, c, a, c}, {a, c, a, b, c}}

So using the set definition, we could say

abc Ш ac = {aabcc, aacbc, abacc, abcac, acabc}.

Using the formal sum definition we could say

abc Ш ac = 4aabcc + 2aacbc + 2abacc + abcacacabc.

Related posts

Reverse engineering options

There are 221,184 ways to have a Whopper. You Rule.

This weekend I saw a sign in the window of a Burger King™ that made me think of an interesting problem. If you know the number of possibilities like this, how would you reverse engineer what the options that created the possibilities?

In the example above, there are 211,184 = 213×33 possible answers, and so most likely there are 13 questions that have a yes or no answer, and 3 questions that have 3 possible answers each. But there are other possibilities in principle, a lot of other possibilities.

Maybe there are 15 questions: 1 with 4 possible answers, 12 with 2 possible answers, and 3 with 3 possible answers. Or maybe there are again 15 questions, but one of the questions has 9 answers. If there are 14 questions, maybe one question has 8 answers, or 27 answers, or maybe two questions have 6 answers, … it gets complicated.

More general problem

How would you address this kind of problem in general?

Suppose there is some list of questions Q1, Q2, …, Qk where there are qi possible answers to question Qi. Then the product of all the qi equals the total number of possible responses P to the questions. Note that we don’t know the number of questions k; all we know is the product P. Let’s call the ordered list {q1, q2, …, qk} of number of responses to questions a survey.

There are as many possible surveys as there are multiplicative partitions of P. The number of possible surveys only depends on the exponents in the prime factorization of P, not on the prime factors themselves. It would make no difference if in the example above we had 513×73 possibilities rather than 213×33 possibilities. We’d reason the same way when counting the possible surveys. The number of questions is somewhere between 1 and the sum of the exponents in the prime factorization.

We have to decide how we want to count surveys. For example, if we do have 13 questions with binary answers and 3 questions with trinary answers, does it matter which questions have three answers? If it does, we multiply the number of possibilities by the binomial coefficient C(16, 3) because we choose which of the 16 total questions have three possible answers. And then we have to reason about the case of 15 questions, 14 questions, etc.

Multiplicative partitions are complicated, and its not possible to write down simple expressions for the number of multiplicative partitions of numbers with many prime factors. But in the special case of only two distinct prime factors, i.e.

P = p_1^i \, p_2^j

then there is a relatively simple expression for the total number of possible surveys, assuming order matters [1]:

2^{i+j-1} \sum_{k=0}^j {i \choose k}{j \choose k} 2^{-k}

This works out to 3,760,128 in the example above. Curiously,

3760128 = 221184 × 17.

But note that our result does not depend on the primes in the factorization of P, only on their exponents. There are no p‘s in the summation above, only i‘s and j‘s. As we noted earlier, we’d come to the same count if we’d started with P = 513×73. The factors of 2 and 3 in our count arrive for different reasons than being the primes in the factorization of P.

Related posts

[1] A. Knopfmacher and M. E. Mays. A survey of factorization counting functions, International Journal of Number Theory. Vol. 01, No. 04, pp. 563-581 (2005)

Design of experiments and design theory

Design of experiments is a branch of statistics, and design theory is a branch of combinatorics, and yet they overlap quite a bit.

It’s hard to say precisely what design theory is, but it’s consider with whether objects can be arranged in certain ways, and if so how many ways this can be done. Design theory is pure mathematics, but it is of interest to people working in ares of applied mathematics such as coding theory and statistics.

Here’s a recap of posts I’ve written recently related to design of experiments and design theory.

Design of Experiments

A few weeks ago I wrote about fractional factorial design. Then later I wrote about response surface models. Then a diagram from central composite design, a popular design in response surface methodology, was one the diagrams in a post I wrote about visually similar diagrams from separate areas of application.

I wrote two posts about pitfalls with A/B testing. One shows how play-the-winner sequential testing has the same problems as Condorcet’s voter paradox, with the order of the tests potentially determining the final winner. More seriously, A/B testing cannot detect interaction effects which may be critical.

ANSI and Military Standards

There are several civilian and military standards related to design of experiments. The first of these was MIL-STD-105. The US military has retired this standard in favor of the civilian standard ASQ/ANSI Z1.4 which is virtually identical.

Similarly, the US military standard MIL-STD-414 was replaced by the very similar civilian standard ASQ/ANSI Z1.9. This post looks at the mean-range method for estimating variation which these two standards reference.

Design Theory

I wrote a couple posts on Room squares, one on Room squares in general and one on Thomas Room’s original design now known as a Room square. Room squares are used in tournament designs.

I wrote a couple posts about Costas arrays, an introduction and a post on creating Costas arrays in Mathematica.

Latin Squares

Latin squares and Greco-Latin squares a part of design theory and a part of design of experiments. Here are several posts on Latin and Greco-Latin squares.

The Chicken McNugget Monoid

When McDonalds first introduced Chicken McNuggets, you could buy McNuggets in boxes of 6, 9, or 20. The Chicken McNugget problem is to determine which numbers of McNuggets you can and cannot buy. A number n is a McNugget number if it is possible to buy exactly that many McNuggets (using the original boxes).

Box of 6 Chicken McNuggets

There are only finitely many non-McNugget numbers, and these are listed in OEIS sequence A065003.

Note that if you order eight boxes of 6 nuggets you get 48 nuggets, and so if you order more than eight boxes, in any combination, you get more than 48 nuggets. So the following program will compute all McNugget numbers less than 48.

    ns = set() 
    for i in range(8):
        for j in range(8):
            for k in range(8):
                n = 6*i + 9*j + 20*k
                if n <= 48:
                    ns.add(n)
    
    comp = set(range(48)) - ns

The non-McNugget numbers less than 48 are stored in comp, the set complement of ns. These numbers are

1, 2, 3, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 22, 23, 25, 28, 31, 34, 37, 43.

It turns out 48 and 49 are also McNugget numbers. You can verify this by changing “48” to “50” in the code above and looking at ns. This means that 44, 45, 46, 47, 48, and 49, a run of 6 consecutive numbers are all McNugget numbers, and so you can add boxes of 6 to these numbers to get any greater number. This shows that 43 is the largest non-McNugget number.

The set of McNugget numbers contains 0, my favorite number of McNuggets, and is closed under addition, and so it forms a monoid.

Update: The next post generalizes this one, giving a little theory and more general code.

Source: Factoring in the Chicken McNugget monoid. arXiv:1709.01606

Efficiently testing a black box

Suppose you have a black box that takes three bits as input and produces one bit as output. You could think of the input bits as positions of toggle switches, and the output bit as a light attached to the box that is either on or off.

Full factorial design

Now suppose that only one combination of 3 bits produces a successful output. There’s one way to set the switches that makes the light turn on. You can find the successful input by brute force if you test all 2³ = 8 possible inputs. In statistical lingo, you are conducting an experiment with a factorial design, i.e. you test all combinations of inputs.

See the Python code below for a text version of this design

In the chart above, each row is an experimental run and each column is a bit. I used − and + rather than 0 and 1 because it is conventional in this topic to use a minus sign to indicate that a factor is not present and a plus sign to indicate that it is present.

Fractional factorial design

Now suppose your black box takes 4 bits as inputs, but only 3 of them matter. One of the bits does nothing, but you don’t know which bit that is. You could use a factorial design again, testing all 24 = 16 possible inputs. But there’s a more clever approach that requires only 8 guesses. In statistical jargon, this is a fractional factorial design.

See the Python code below for a text version of this design

No matter which three bits the output depends on, all combinations of those three bits will be tested. Said another way, if you delete any one of the four columns, the remaining columns contain all combinations of those bits.

Replications

Now suppose your black box takes 8 bits. Again only 3 bits matter, but you don’t know which 3. How many runs do you need to do to be certain of finding the successful combination of 3 bits? It’s clear that you need at least 8 runs: if you know that the first three bits are the important ones, for example, you still need 8 runs. And it’s also clear that you could go back to brute force and try all 28 = 256 possible inputs, but the example above raises your hopes that you could get by with less than 256 runs. Could you still get by with 8? That’s too much to hope for, but you could use only 16 runs.

--------, +----+++, -+--+-++, ++--++--, --+-+++-, +-+-+--+, -++--+-+, +++---+-, ---+++-+, +--++-+-, -+-+-++-, ++-+---+, --++--++, +-++-+--, -++++---, ++++++++

Note that since this design works for 8 factors, it also works for fewer factors. If you had 5, 6, or 7 factors, you could use the first 5, 6, or 7 columns of the design above.

This design has some redundancy: every combination of three particular bits is tested twice. This is unfortunate in our setting because we are assuming the black box is deterministic: the right combination of switches will always turn the light on. But what if the right combination of switches probably turns on the light? Then redundancy is a good thing. If there’s an 80% chance that the right combination will work, then there’s a 96% chance that at least one of the two tests of the right combination will work.

Fractional factorial experiment designs are usually used with the assumption that there are random effects, and so redundancy is a good thing.

You want to test each main effect, i.e. each single bit, and interaction effects, i.e. combinations of bits, such as pairs of bits or triples of bits. But you assume that not all possible interactions are important; otherwise you’d need a full factorial design. You typically hit diminishing returns with interactions quickly: pairs of effects are often important, combinations of three effects are less common, and rarely would an experiment consider fourth order interactions.

If only main effects and pairs of main effects matter, and you have a moderately large number of factors n, a fractional factorial design can let you use a lot less than 2n runs while giving you as many replications of main and interaction effects as you want.

Verification

The following Python code verifies that the designs above have the claimed properties.

    import numpy as np
    from itertools import combinations
    
    def verify(matrix, k):
        "verify that every choice of k columns has 2^k unique rows"
        nrows, ncols = matrix.shape
        for (a, b, c) in combinations(range(ncols), k):
            s = set()
            for r in range(nrows):
                s.add((matrix[r,a], matrix[r,b], matrix[r,c]))
            if len(s) != 2**k:
                print("problem with columns", a, b, c)
                print("number of unique elements: ", len(s))
                print("should be", 2**k)
                return
        print("pass")
    
    m = [
        [-1, -1, -1, -1],
        [-1, -1, +1, +1],
        [-1, +1, -1, +1],
        [-1, +1, +1, -1],
        [+1, -1, -1, +1],
        [+1, -1, +1, -1],
        [+1, +1, -1, -1],
        [+1, +1, +1, +1]
    ]
    
    verify(np.matrix(m), 3)
    
    m = [
        [-1, -1, -1, -1, -1, -1, -1, -1],
        [+1, -1, -1, -1, -1, +1, +1, +1],
        [-1, +1, -1, -1, +1, -1, +1, +1],
        [+1, +1, -1, -1, +1, +1, -1, -1],
        [-1, -1, +1, -1, +1, +1, +1, -1],
        [+1, -1, +1, -1, +1, -1, -1, +1],
        [-1, +1, +1, -1, -1, +1, -1, +1],
        [+1, +1, +1, -1, -1, -1, +1, -1],
        [-1, -1, -1, +1, +1, +1, -1, +1],
        [+1, -1, -1, +1, +1, -1, +1, -1],
        [-1, +1, -1, +1, -1, +1, +1, -1],
        [+1, +1, -1, +1, -1, -1, -1, +1],
        [-1, -1, +1, +1, -1, -1, +1, +1],
        [+1, -1, +1, +1, -1, +1, -1, -1],
        [-1, +1, +1, +1, +1, -1, -1, -1],
        [+1, +1, +1, +1, +1, +1, +1, +1],
    ]
    
    verify(np.matrix(m), 3)

Related services

The original Room square

A few days ago I wrote about Room squares, squares named after Thomas Room. This post will be about Room’s original square.

You could think of a Room square as a tournament design in which the rows represent locations and the columns represent rounds (or vice versa). Every team plays every other team exactly once, and only one pair of teams can play each other at the same location in the same round.

Here is an image creating from Room’s original square using the visualization approach from the earlier post.

This is not a trivial variation on the square in the earlier post. Notice that this square has a block of four contiguous squares in the middle row and middle column. The previous image does not.

Here is Room’s original square, presented as in his half-page announcement in [1] with the rows and columns labeled with binary numbers.

A plain text version of the image above is available below [2].

Room points out an interesting pattern: a cell is filled in if and only if the “dot product” of its coordinate is odd. Think of each coordinate number as three dimensional vector and take the dot product of the vector representing the row and the vector representing the column.

Room didn’t refer to dot products. In his words

(αβγ, λμν) are the coordinates of one of the points marked rs in and only if αλ + βμ + γν is odd.

Related posts

[1] T. G. Room. A new type of magic square. Mathematical Gazette. Volume 39 (1955), p. 307.

[2] Here is Room’s square as a plain text (org-mode) table.

|-----+-----+-----+-----+-----+-----+-----+-----|
| 001 |  12 |     |  34 |     |  56 |     |  78 |
| 010 |     |  37 |  25 |     |     |  48 |  16 |
| 011 |  47 |  15 |     |     |  38 |  26 |     |
| 100 |     |     |     |  68 |  14 |  57 |  23 |
| 101 |  58 |     |  67 |  24 |     |  13 |     |
| 110 |     |  46 |  18 |  35 |  27 |     |     |
| 111 |  36 |  28 |     |  17 |     |     |  45 |
|-----+-----+-----+-----+-----+-----+-----+-----|
|     | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
|-----+-----+-----+-----+-----+-----+-----+-----|

Costas arrays

The famous n queens problem is to find a way to position n queens on a n×n chessboard so that no queen attacks any other. That is, no two queens can be in the same row, the same column, or on the same diagonal. Here’s an example solution:

Costas arrays

In this post we’re going to look at a similar problem, weakening one requirement and adding another. First we’re going to remove the requirement that no two pieces can be on the same diagonal. This turns the n queens problem into the n rooks problem because rooks cannot move diagonally.

Next, imagine running stiff wires from every rook to every other rook. We will require that no two wires have the same length and run in the same direction. in more mathematical terms, we require that the displacement vectors between all the rooks are unique.

A solution to this problem is called a Costas array.  In 1965, J. P. Costas invented what are now called Costas arrays to solve a problem in radar signal processing.

Why do we call a solution a Costas array rather than a Costas matrix? Because a matrix solution can be described by recording for each column the row number of the occupied cell. For example, we could describe the eight queens solution above as

(2, 4, 6, 8, 3, 1, 7, 5).

Here I’m numbering rows from the bottom up, like points in the first quadrant, rather than top to bottom as one does with matrices.

Note that the solution to the eight queens problem above is not a Costas array because some of the displacement vectors between queens are the same. For example, if we number queens from left to right, the displacement vectors between the first and second queens is the same as the displacement vector between the second and third queens.

Visualizing a Costas array

Here’s a visualization of a Costas array of order 5.

It’s clear from the plot that no two dots are in the same row or column, but it’s not obvious that all the connecting lines are different. To see that the lines are different, we move all the tails of the connecting lines to the origin, keeping their length and direction the same.

There are 10 colored lines in the first plot, but at first you may only see 9 in the second plot. That’s because two of the lines have the same direction but different length. If you look closely, you’ll see that there’s a short purple line on top of the long red line. That is, one line runs from the origin to (1, -1) and another from the origin to (4, -4).

Here is a visualization of a Costas array of order 9.

And here are its displacement vectors translated to the origin.

Generating Costas arrays

There are algorithms for generating some Costas arrays, but not all. Every known algorithm [1] leaves out some solutions, and it is not know whether Costas arrays exist for some values of n.

The Costas arrays above were generated using the Lempel construction algorithm. Given a prime p [2] and a primitive root mod p [3], the following Python code will generate a Costas array of order p – 2.

    p = 11 # prime
    a = 2 # primitive element

    # verify a is a primitive element mod p
    s = {a**i % p for i in range(p)}
    assert( len(s) == p-1 )

    n = p-2
    dots = []

    for i in range(1, n+1):
        for j in range(1, n+1):
            if (pow(a, i, p) + pow(a, j, p)) % p == 1:
                dots.append((i,j))
                break

Related posts

[1] Strictly speaking, no scalable algorithm will enumerate all Costas arrays. You could enumerate all permutation matrices of order n and test whether each is a Costas array, but this requires generating and testing n! matrices and so is completely impractical for moderately large n.

[2] More generally, the Lempel algorithm can generate a solution of order q − 2 where q is a prime power. The code above only works for primes, not prime powers. For prime powers, you have to work with finite fields of order q and the code would be more complicated.

[3] A primitive root for a finite field of order q is a generator of the multiplicative group of the field, i.e. an element x such that every non-zero element of the field is some power of x.

Balanced tournament designs

Suppose you have an even number of teams that you’d like to schedule in a Round Robin tournament. This means each team plays every other team exactly once.

Denote the number of teams as 2n. You’d like each team to play in each round, so you need n locations for the games to be played.

Each game will choose 2 teams from the set of 2n teams, so the number of games will be n(2n − 1). And this means you’ll need 2n − 1 rounds. A tournament design will be a grid with n rows, one for each location, and 2n − 1 columns, one for each round.

Let’s pause there for just a second and make sure this is possible. We can certainly assign n(2n − 1) games to a grid of size n by 2n − 1. But can we do it in such a way that no team needs to be in two locations at the same time? It turns out we can.

Now we’d like to introduce one more requirement: we’d like to balance the games over the locations. By that we mean we’d like a team to play no more than twice in the same location. A team has to play at least once in each location, but not every team can play twice in each location. If every team played once in each location, we’d have n² games, and if every team played twice in each location we’d have 2n² games, but

n² < n(2n − 1) < 2n²

for n greater than 1. If n = 1, this is all trivial, because in that case we would have two teams. They simply play each other and that’s the tournament!

We can’t have each team play exactly the same number of times in each location, but can we have each team play either once or twice in each location? If so, we call that a balanced tournament design.

Now the question becomes for which values of n can we have a balanced tournament design involving 2n teams. This is called a BTD(n) design.

We said above that this is possible if n = 1: two teams just play each other somewhere. For n = 2, it is not possible to have a balanced tournament design: some team will have to play all three games in the same location.

It is possible to have a balanced tournament design for n = 3. Here’s a solution:

{{af, ce, bc, de, bd}, {be, df, ad, ac, cf}, {cd, ab, ef, bf, ae}}

In fact this is the solution aside from relabeling the teams. That is, given any balanced tournament design involving six teams, there is a way to label the teams so that you have the design above. Said another way, there is one design in BTD(3) up to isomorphism.

There are balanced tournament designs for all positive n except for n = 2. And in general there are a lot of them. There are 47 designs in BTD(4) up to isomorphism [1].

Related posts

[1] The CRC Handbook of Combinatorial Designs. Colbourn and Dinitz editors. CRC Press, 1996.

 

Generating functions for polynomial sequences

The previous post looked at a generating function for a specific polynomial sequence. This post will look at generating functions for polynomial sequences in general. (There’s an alternating term in the previous post that isn’t polynomial, but we’ll address that too.)

The starting point for this post is a simple observation:

x\left(\frac{d}{dx} x^n \right) = n x^n

If we let xD be the operator that differentiates a function then multiplies the result by x, we have

(xD) x^n = n x^n

We can apply xD m times, each time multiplying xn by a factor of n.

(xD)^m x^n = n^m x^n

And more generally, for any polynomial p(x) we have

p(xD) x^n = p(n) x^n

Now let S be a set of integers and form a generating function F(x) by summing xn over n in S.

p(xD) x^n = p(n) x^n

Then we have

\begin{align*} p(xD)F(x) &= p(xD)\sum_S x^n \\ &= \sum_S p(xD)x^n \\ &= \sum_S p(n)x^n \end{align*}

In words, multiplying the nth term of a generating function by p(n) is the same as operating on the generating function by p(xD).

Example

The previous post computed the generating function of

Z_n = \frac{(-1)^n(3n+6) + 2n^3 + 12n^2 + 25n - 6}{12}

using Mathematica. Here we will compute the generating function again using the result derived below.

Before we computed

g(x) = \sum_{n=1}^\infty Z_nx^n

by summing over the positive integers. But Zn is not quite a polynomial function of n. Aside from the alternating term it is a cubic polynomial in n. The alternating term is a polynomial in n if we restrict ourselves to even values of n, and it is another polynomial if we restrict ourselves to odd values of n.

Define

\begin{align*} F_1(x) &= \frac{2n^3 + 12n^2 + 25n - 6}{12} \\ F_2(x) &= \frac{(3n+6)}{12} \\ F_3(x) &= -\frac{(3n+6)}{12} \\ \end{align*}

Then we have

\sum_{n} Z_nx^n = \sum_{n} F_1(n)x^n + \sum_{n \text{ even}} F_2(n)x^n + \sum_{n \text{ odd}} F_3(n)x^n

for positive integer n, splitting our original generating function into three generating functions, each summed over a different set of integers.

Define

\begin{align*} g_1(x) &= \sum_{n > 1} x^n = \frac{x}{1-x}\\ g_2(x) &= \sum_{n > 1,\, n \text{ even}}^\infty x^n = \frac{x^2}{1 - x^2}\\ g_3(x) &= \sum_{n > 1,\, n \text{ odd}}^\infty x^n = \frac{x}{1 - x^2}\\ \end{align*}

Then

g(x) = F_1(xD)g_1(x) + F_3(xD)g_3(x) + F_3(xD)g_3(x)

If we expand the line above, we should get the same expression for g(x) as in the previous post.

Sampling with replacement until you’ve seen everything

Suppose you have a standard deck of 52 cards. You pull out a card, put it back in the deck, shuffle, and pull out another card. How long would you expect to do this until you’ve seen every card?

Here’s a variation on the same problem. Suppose you’re a park ranger keeping data on tagged wolves in a national park. Each day you catch a wolf at random, collect some data, and release it. If there are 100 wolves in the park, how long until you’ve seen every wolf at least once?

We’ll solve this problem via simulation, then exactly.

Simulation

The Python code to solve the problem via simulation is straight forward. Here we’ll simulate our ranger catching and releasing one of 100 wolves. We’ll repeat our simulation five times to get an idea how much the results vary.

    import numpy as np
    from numpy.random import default_rng

    rng = default_rng()
    num_runs = 5
    N = 100 # number of unique items

    for run in range(num_runs):
        tally = np.zeros(N,dtype=int)
        trials = 0
        while min(tally) == 0:
            tally[rng.integers(N)] += 1
            trials += 1
        print(trials)

When I ran this, I got 846, 842, 676, 398, and 420.

Exact solution

Suppose we have a desk of N cards, and we sample the deck with replacement M times. We can select the first card N ways. Ditto for the second, third, and all the rest of the cards. so there are NM possible sequences of card draws.

How many of these sequences include each card at least once? To answer the question, we look at Richard Stanley’s twelvefold way. We’re looking for the number of ways to allocate M balls into N urns such that each urn contains at least one ball. The answer is

N! S(M, N)

where S(M, N) is the Stirling number of the second kind with parameters M and N [1].

So the probability of having seen each card at least once after selecting M cards randomly with replacement is

N! S(M, N) / NM.

Computing Stirling numbers

We’ve reduced our problem to the problem of computing Stirling numbers (of the second kind), so how do we do that?

We can compute Stirling numbers in terms of binomial coefficients as follows.

S(n,k) = \frac{1}{k!}\sum_{i = 0}^k (-1)^i {k \choose i} (k-i)^n

Now we have a practical problem if we want to use this formula for larger values of n and k. If we work with floating point numbers, we’re likely to run into overflow. This is the perennial with probability calculations. You may need to compute a moderate-sized number as the ratio of two enormous numbers. Even though the final result is between 0 and 1, the intermediate results may be too large to store in a floating point number. And even if we don’t overflow, we may lose precision because we’re working with an alternating sum.

The usual way to deal with overflow is to work on a log scale. But that won’t work for Stirling numbers because we’re adding terms together, not multiplying them.

One way to solve our problem—not the most efficient way but a way that will work—is to work with integers. This is easy in Python because integers by default can be arbitrarily large.

There are two ways to compute binomial coefficients in Python: scipy.special.binom and math.comb. The former uses floating point operations and the latter uses integers, so we need to use the latter.

We can compute the numerator of our probability with

    from math import comb
    def k_factorial_stirling(n, k):
        return sum((-1)**i * comb(k, i)*(k-i)**n for i in range(k+1))    

Then our probability is

    k_factorial_stirling(M, N) / N**M

If we compute the probability that our ranger has seen all 100 wolves after catching and releasing 500 times, we get 0.5116. So the median number of catch and release cycles is somewhere around 500, which is consistent with our simulation above.

Note that the numerator and denominator in our calculation are both on the order of 101000 while the largest floating point number is on the order of 10308, so we would have overflowed if we’d used binom instead of comb.

Related posts

[1] It’s unfortunate that these were called the “second kind” because they’re the kind that come up most often in application.

Image: “Fastening a Collar” by USFWS Mountain Prairie is licensed under CC BY 2.0 .