Power method and centrality

A few days ago I wrote about eigenvector centrality, a way of computing which nodes in a network are most important. Rather than simply looking for the most highly connected nodes, it looks for nodes that are highly connected to nodes that are highly connected. It’s the idea behind Google’s PageRank algorithm.

Adjacency matrices

One way to capture the structure of a graph is its adjacency matrix A. Each element aij of this matrix equals 1 if there is an edge between the ith and jth node and a 0 otherwise. If you square the matrix A, the (i, j) entry of the result tells you how many paths of length 2 are between the ith and jth nodes. In general, the (i, j) entry of An tells you how many paths of length n there are between the corresponding nodes.

Power method

Calculating eigenvector centrality requires finding an eigenvector associated with the largest eigenvalue of A. One way to find such an eigenvector is the power method. You start with a random initial vector and repeatedly multiply it by A. This produces a sequence of vectors that converges to the eigenvector we’re after.

Conceptually this is the same as computing An first and multiplying it by the random initial vector. So not only does the matrix An count paths of length n, for large n it helps us compute eigenvector centrality.

Theoretical details

Now for a little fine print. The power method will converge for any starting vector that has some component in the direction of the eigenvector you’re trying to find. This is almost certainly the case if you start with a vector chosen at random. The power method also requires that the matrix has a single eigenvector of largest magnitude and that its associated eigenspace have dimension 1. The post on eigenvector centrality stated that these conditions hold, provided the network is connected.

Computational practicalities

In principle, you could use calculate eigenvector centrality by computing An for some large n. In practice, you’d never do that. For a square matrix of size N, a matrix-vector product takes O(N²) operations, whereas squaring A requires O(N³) operations. So you would repeatedly apply A to a vector rather than computing powers of A.

Also, you wouldn’t use the power method unless A is sparse, making it relatively efficient to multiply by A. If most of the entries of A are zeros, and there is an exploitable pattern to where the non-zero elements are located, you can multiply A by a vector with far less than N² operations.

The search for the perfect prompt


Anyone with more than casual experience with ChatGPT knows that prompt engineering is a thing. Minor or even trivial changes in a chatbot prompt can have significant effects, sometimes even dramatic ones, on the output [1]. For simple requests it may not make much difference, but for detailed requests it could matter a lot.

Industry leaders said they thought this would be a temporary limitation. But we are now a year and a half into the GPT-4 era, and it’s still a problem. And since the number of possible prompts has scaling that is exponential in the prompt length, it can sometimes be hard to find a good prompt given the task.

One proposed solution is to use search procedures to automate the prompt optimization / prompt refinement process. Given a base large language model (LLM) and an input (a prompt specification, commonly with a set of prompt/answer pair samples for training), a search algorithm seeks the best form of a prompt to use to elicit the desired answer from the LLM.

This approach is sometimes touted [2] as a possible solution to the problem. However, it is not without  limitations.

A main one is cost. With this approach, one search for a good prompt can take many, many trial-and-error invocations of the LLM, with cost measured in dollars compared to the fraction of a cent cost of a single token of a single prompt. I know of one report of someone who does LLM prompting with such a tool full time for his job, at cost of about $1,000/month (though, for certain kinds of task, one might alternatively seek a good prompt “template” and reuse that across many near-identical queries, to save costs).

This being said, it would seem that for now (depending on budget) our best option for difficult prompting problems is to use search-based prompt refinement methods. Various new tools have come come out recently (for example, [3-6]). The following is a report on some of my (very preliminary) experiences with a couple of these tools.

PromptAgent

The first is PromptAgent [5]. It’s a research code available on GitHub. The method is based on Monte Carlo tree search (MCTS), which tries out multiple chains of modification of a seed prompt and pursues the most promising. MCTS can be a powerful method, being part of the AlphaGo breakthrough result in 2016.

I ran one of the PromptAgent test problems using GPT-4/GPT-3.5 and interrupted it after it rang up a couple of dollars in charges. Looking at the logs, I was somewhat amazed that it generated long detailed prompts that included instructions to the model for what to pay close attention to, what to look out for, and what mistakes to avoid—presumably based on inspecting previous trial prompts generated by the code.

Unfortunately, PromptAgent is a research code and not fully productized, so it would take some work to adapt to a specific user problem.

DSPy

DSPy [6] on the other hand is a finished product available for general users. DSPy is getting some attention lately not only as a prompt optimizer but also more generally as a tool for orchestrating multiple LLMs as agents. There is not much by way of simple examples for how to use the code. The web site does have an AI chatbot that can generate sample code, but the code it generated for me required significant work to get it to behave properly.

I ran with the MIPRO optimizer which is most well-suited to prompt optimization. My experience with running the code was that it generated many random prompt variations but did not do in-depth prompt modifications like PromptAgent. PromptAgent does one thing, prompt refinement, and must do it well, unlike DSPy which has multiple uses. DSPy would be well-served to have implemented more powerful prompt refinement algorithms.

Conclusion

I would wholeheartedly agree that it doesn’t seem right for an LLM would be so dependent on the wording of a prompt. Hopefully, future LLMs, with training on more data and other improvements, will do a better job without need for such lengthly trial-and-error processes.

References

[1]  “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” https://openreview.net/forum?id=RIu5lyNXjT

[2] “AI Prompt Engineering Is Dead” (https://spectrum.ieee.org/prompt-engineering-is-dead, March 6, 2024

[3]  “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing,” https://openreview.net/forum?id=OXv0zQ1umU

[4] “Large Language Models as Optimizers,” https://openreview.net/forum?id=Bb4VGOWELI

[5] “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization,” https://openreview.net/forum?id=22pyNMuIoa

[6] “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” https://openreview.net/forum?id=sY5N0zY5Od

 

Eigenvector centrality

A basic question to ask about a network is which nodes are most important. A simple way of answering this question would be to say the most well-connected nodes, i.e. the nodes with the most edges. This approach is known as degree centrality. It’s not a bad place to start. It’s easy to understand and easy to compute. It may be adequate for some purposes, but it’s often helpful to take a more sophisticated approach known as eigenvector centrality.

Examples

Let’s look at a few motivating examples before we get into the details.

Social network marketing

If you wanted to advertise something, say a book you’ve written, and you’re hoping people will promote it on Twitter. Would you rather get a shout out from someone with more followers or someone with less followers? All else being equal, more followers is better. But even better would be a shout out from someone whose followers have a lot of followers.

Deans and flight attendants

Suppose you’re at a graduation ceremony. Your mind starts to wander after the first few hundred people walk across the stage, and you start to think about how a cold might spread through the crowd. The dean handing out diplomas could be a superspreader because he’s shaking hands with everyone as they receive their diplomas. But an inconspicuous parent in the audience may also be a superspreader because she’s a flight attendant and will be in a different city every day for the next few days. And not only is she a traveler, she’s in contact with travelers.

Search engines

Ranking web pages according to the number of inbound links they have was a good idea because this takes advantage of revealed preferences: instead of asking people what pages they recommend, you can observe what pages they implicitly recommend by linking to them. An even better idea was Page Rank, weighing inbound links by how many links the linking pages have.

Eigenvector centrality

The idea of eigenvector centrality is to give each node a rank proportional to the sum of the ranks of the adjacent nodes. This may seem circular, and it kinda is.

To know the rank of a node, you have to know the ranks of the nodes linking to it. But to know their ranks, you have to know the ranks of the nodes linking to them, etc. There is no sequential solution to the problem because you’d end up in an infinite regress, even for a finite network. But their is a simultaneous solution, considering all pages at once.

You want to assign to the ith node a rank xi proportional to the sum of the ranks of all adjacent nodes. A more convenient way to express this to compute the weighted sum of the ranks of all nodes, with weight 1 for adjacent nodes and weight 0 for non-adjacent nodes. That is, you want to compute

x_i = \frac{1}{\lambda} \sum a_{ij} x_k

where the a‘s are the components of the adjacency matrix A. Here aij equals 1 if there is an edge between nodes i and j and it equals 0 otherwise.

This is equivalent to looking for solutions to the system of equations

Ax = \lambda x

where A is the adjacency matrix and x is the vector of node ranks. If there are N nodes, then A is an N × N matrix and x is a column vector of length N.

In linear algebra terminology, x is an eigenvector of the adjacency matrix A, hence the name eigenvector centrality.

Mathematical questions

There are several mathematical details to be concerned about. Does the eigenvalue problem defining x have a solution? Is the solution unique (up to scalar multiples)? What about the eigenvalue λ? Does the adjacency matrix have multiple eigenvalues, which would mean there are multiple eigenvectors?

If A is the adjacency matrix for a connected graph, then there is a unique eigenvalue λ of largest magnitude, there is a unique corresponding eigenvector x, and all the components of x are non-negative. It is often said that this is a result of the Perron–Frobenius theorem, but there’s a little more to it than that because you need the hypothesis that the graph is connected.

The matrix A is non-negative, but not necessarily positive: some entries may be zero. But if the graph is connected, there’s a path between any node i and another node j, which means for some power of A, the ij entry of A is not zero. So although A is not necessarily positive, some power of A is positive. This puts us in the positive case of the Perron–Frobenius theorem, which is nicer than the non-negative case.

This section has discussed graphs, but Page Rank applies to directed graphs because links have a direction. But similar theorems hold for directed graphs.

Related posts

Fitting a line without an intercept term

The other day I was looking at how many lumens LED lights put out per watt. I found some data on Wikipedia, and as you might expect the relation between watts and lumens is basically linear, though not quite.

If you were to do a linear regression on the data you’d get a relation

lumens = a × watts + b

where the intercept term b is not zero. But this doesn’t make sense: a light bulb that is turned off doesn’t produce light, and it certainly doesn’t produce negative light. [1]

You may be able fit the regression and ignore b; it’s probably small. But what if you wanted to require that b = 0? Some regression software will allow you to specify zero intercept, and some will not. But it’s easy enough to compute the slope a without using any regression software.

Let x be the vector of input data, the wattage of the LED bulbs. And let y be the corresponding light output in lumens. The regression line uses the slope a that minimizes

(a xy)² = a² x · x − 2a x · y + y · y.

Setting the derivative with respect to a to zero shows

a = x · y / x · x

Now there’s more to regression than just line fitting. A proper regression analysis would look at residuals, confidence intervals, etc. But the calculation above was good enough to conclude that LED lights put out about 100 lumens per watt.

It’s interesting that making the model more realistic, i.e. requiring b = 0, is either a complication or a simplification, depending on your perspective. It complicates using software, but it simplifies the math.

Related posts

[1] The orange line in the image above is the least squares fit for the model y = ax, but it’s not quite the same line you’d get if you fit the model y = ax + b.

Solutions to tan(x) = x

I read something recently that said in passing that the solutions to the equation tan x = x are the zeros of the Bessel function J3/2. That brought two questions to mind. First, where have I seen the equation tan x = x before? And second, why should its solutions be the roots of a Bessel function.

The answer to the first question is that I wrote about the local maxima of the sinc function three years ago. That post shows that the derivative of the sinc function sin(x)/x is zero if and only if x is a fixed point of the tangent function.

As for why that should be connected to zeros a Bessel function, that one’s pretty easy. In general, Bessel functions cannot be expressed in terms of elementary functions. But the Bessel functions whose order is an integer plus ½ can.

For integer n,

J_{n+{\frac{1}{2}}}(x)= (-1)^n \left(\frac{2}{\pi}\right)^{\frac{1}{2}} x^{n+{\frac{1}{2}}} \left(\frac{1}{x}\frac{d}{dx}\right)^n\left(\frac{\sin x}{x}\right)

So when n = 1, we’ve got the derivative of sinc right there in the definition.

Computing logarithms of complex numbers

The previous post showed how to compute logarithms using tables. It gives an example of calculating a logarithm to 15 figures precision using tables that only allow 4 figures of precision for inputs.

Not only can you bootstrap tables to calculate logarithms of real numbers not given in the tables, you can also bootstrap a table of logarithms and a table of arctangents to calculate logarithms of complex numbers.

One of the examples in Abramowitz and Stegun (Example 7, page 90) is to compute log(2 + 3i). How could you do that with tables? Or with a programming language that doesn’t support complex numbers?

What does this even mean?

Now we have to be a little careful about what we mean by the logarithm of a complex number.

In the context of real numbers, the logarithm of a real number x is the real number y such that ey = x. This equation has a unique solution if x is positive and no solution otherwise.

In the context of complex numbers, a logarithm of the complex number z is any complex number w such that ew = z. This equation has no solution if z = 0, and it has infinitely many solutions otherwise: for any solution w, w + 2nπi is also a solution for all integers n.

Solution

If you write the complex number z in polar form

z = r eiθ

then

log(z) = log(r) + iθ.

The proof is immediate:

elog(r) + iθ = elog(r) eiθ = r eiθ.

So computing the logarithm of a complex number boils down to computing its magnitude r and its argument θ.

The equation defining a logarithm has a unique solution if we make a branch cut along the negative real axis and restrict θ to be in the range −π < θ ≤ π. This is called the principal branch of log, sometimes written Log. As far as I know, every programming language that supports complex logarithms uses the principal branch implicitly. For example, in Python (NumPy), log(x) computes the principal branch of the log function.

Example

Going back to the example mentioned above,

log(2 + 3i) = log( √(2² + 3²) ) + arctan(3/2) = ½ log(13) + arctan(3/2) i.

This could easily be computed by looking up the logarithm of 13 and the arc tangent of 3/2.

The exercise in A&S actually asks the reader to calculate log(±2 ± 3i). The reason for the variety of signs is to require the reader to pick the value of θ that lies in the range −π < θ ≤ π. For example,

log(−2 + 3i) =  = ½ log(13) + (π − arctan(3/2)) i.

Using a table of logarithms

My favorite quote from Richard Feynman is his remark that “nearly everything is really interesting if you go into it deeply enough.” This post will look at something that seems utterly trivial—looking up numbers in a table—and show that there’s much more to it when you dig a little deeper.

More than just looking up numbers

Before calculators were common, function values would be looked up in a table. For example, here is a piece of a table of logarithms from Abramowitz and Stegun, affectionately known as A&S.

But you wouldn’t just “look up” logarithm values. If you needed to know the value of a logarithm at a point where it is explicitly tabulated, then yes, you’d simply look it up. If you wanted to know the log of 1.754, then there it is in the table. But what if, for example, you wanted to know the log of 1.7543?

Notice that function values are given to 15 significant figures but input values are only given to four significant figures. If you wanted 15 sig figs in your output, presumably you’d want to specify your input to 15 sig figs as well. Or maybe you only needed 10 figures of precision, in which case you could ignore the rightmost column of decimal places in the table, but you still can’t directly specify input values to 10 figures.

Lagrange interpolation

If you go to the bottom of the column of A&S in the image above, you see this:

What’s the meaning of the mysterious square bracket expression? It’s telling you that for the input values in the range of this column, i.e. between 1.750 and 1.800, the error using linear interpolation will be less than 4 × 10−8, and that if you want full precision, i.e. 15 sig figs, then you’ll need to use Lagrange interpolation with 5 points.

So going back to the example of wanting to know the value of log(1,7543), we could calculate it using

0.7 × log(1.754) + 0.3 × log(1.755)

and expect the error to be less than 4 × 10−8.

We can confirm this with a little Python code.

>>> from math import log
>>> exact = log(1.7543)
>>> approx = 0.7*log(1.754) + 0.3*log(1.755)
>>> exact - approx
3.411265947494968e-08

Python uses double precision arithmetic, which is accurate to between 15 and 16 figures—more on that here—and so the function calls above are essentially the same as the tabulated values.

Now suppose we want the value of x = 1.75430123456789. The hint in square brackets says we should use Lagrange interpolation at five points, centered at the nearest tabulated value to x. That is, we’ll use the values of log at 1.752, 1.753, 1.754, 1.755, and 1.756 to compute the value of log(x).

Here’s the Lagrange interpolation formula, given in A&S as equation 25.2.15.

We illustrate this with the following Python code.

def interpolate(fs, p, h):
    s = (p**2 - 1)*p*(p-2)*fs[0]/24
    s -= (p - 1)*p*(p**2 - 4)*fs[1]/6
    s += (p**2 - 1)*(p**2 - 4)*fs[2]/4
    s -= (p + 1)*p*(p**2 - 4)*fs[3]/6
    s += (p**2 - 1)*p*(p + 2)*fs[4]/24
    return s

xs = np.linspace(1.752, 1.756, 5)
fs = np.log(xs)
h = 0.001
x = 1.75430123456789
p = (x - 1.754)/h

print(interpolate(fs, p, h))
print(np.log(x))

This prints

0.5620706206909348
0.5620706206909349

confirming that the interpolated value is indeed accurate to 15 figures.

Lagrange interpolation takes a lot of work to carry out by hand, and so sometimes you might use other techniques, such as transforming your calculation into one for which a Taylor series approximation converges quickly. In any case, sophisticated use of numerical tables was not simply a matter of looking things up.

Contemporary applications

A book of numerical tables enables you to do calculations without a computer. More than that, understanding how to do calculations without a computer helps you program calculations with a computer. Computers have to evaluate functions somehow, and one way is interpolating tabulated values.

For example, you could think of a digital image as a numerical table, the values of some ideal analog image sampled at discrete points. The screenshots above are interpolated: the HTML specifies the width to be less than that of the original screenshots,. You’re not seeing the original image; you’re seeing a new image that your computer has created for you using interpolation.

Interpolation is a kind of compression. A&S would be 100 billion times larger if it tabulated functions at 15 figure inputs. Instead, it tabulated functions for 4 figure inputs and gives you a recipe (Lagrange interpolation) for evaluating the functions at 15 figure inputs if you desire. This is a very common pattern. An SVG image, for example, does not tell you pixel values, but gives you equations for calculating pixel values at whatever scale is needed.

Related posts

Duct tape value creation

Excerpt from from John Carmack’s review of the book Bullshit Jobs.

He talks about how software developers bemoan duct taping systems together, and would rather work on core technologies. He thinks it is some tragic failure, that if only wise system design was employed, you wouldn’t be doing all the duct taping.

Wrong.

Every expansion in capabilities opens up the opportunity to duct tape it to new areas, and this is where a lot of value creation happens. Eventually, when a sufficient amount of duct tape is found in an area, it is an opportunity for systemic redesigns, but you don’t wait for that before grabbing newly visible low hanging fruit!

The realistic alternative to duct tape and other aesthetically disappointing code is often no code.

Related posts

Condition on your data

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Common sense says it matters what you saw. If 148 people were randomized to Design A, and every single one of them bought your product, while 10 out of the 152 people randomized to Design B bought your product, common sense says you should call Design A the winner and push it into production ASAP.

But what if you saw somewhat better results for Design A? You can have some confidence that Design A is better, though not as much as the nominal confidence level of the full experiment. This is what your (frequentist) statistician was trying to protect you from.

When your statistician designed your experiment, he obviously didn’t know what data you’d see, so he designed a process that would be reliable in a certain sense. When you looked at the data early, you violated the process, and so now your actual practice no longer has the probability of success initially calculated.

But you don’t care about the process; you want to know whether to deploy Design A or Design B. And you saw the data that you saw. Particularly in the case where the results were lopsidedly in favor of Design A, your gut tells you that you know what to do next. You might reasonably say “I get what you’re saying about repeated experiments and all that. (OK, not really, but let’s say I do.) But look what happened? Design A is a runaway success!”

In formal terms, your common sense is telling you to condition on the observed data. If you’ve never studied Bayesian statistics you may not know exactly what that means and how to calculate it, but it’s intuitively what you’ve done. You’re making a decision based on what you actually saw, not on the basis of a hypothetical sequence of experiments you didn’t run and won’t run.

Bayesian statistics does formally what your intuition does informally. This is important because even though your intuition is a good guide in extreme cases, it can go wrong when things are less obvious. As I wrote about recently, smart people can get probability very wrong, even when their intuition is correct in some instances.

If you’d like help designing experiments or understanding their results, we can help.

Can you look at experimental results along the way or not?

A B testing

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

Of course there’s nothing morally or legally wrong with looking at interim data, but is there anything statistically wrong? That depends on what you do with your observation.

There are basically two statistical camps, Frequentist and Bayesian. (There are others, but these two account for the lion’s share.) Frequentists say no, you should not look at interim data, unless you apply something called alpha spending. Bayesians, on the other hand, say go ahead. Why shouldn’t you look at your data? I remember one Bayesian colleague mocking alpha spending as being embarrassing.

So who is right, the Frequentists or the Bayesians? Both are right, given their respective criteria.

If you want to achieve success, as defined by Frequentists, you have to play by Frequentist rules, alpha spending and all. Suppose you design a hypothesis test to have a confidence level α, and you look at the data midway through an experiment. If the results are conclusive at the midpoint, you stop early. This procedure does not have a confidence level α. You would have to require stronger evidence for early stopping, as specified by alpha spending.

The Bayesian interprets the data differently. This approach says to quantify what is believed before conducting the experiment in the form of a prior probability distribution. Then after each data point comes in, you update your prior distribution using Bayes’ theorem to produce a posterior distribution, which becomes your new prior distribution. That’s it. At every point along the way, this distribution captures all that is known about what you’re testing. Your planned sample size is irrelevant, and the fact that you’ve looked at your data is irrelevant.

Now regarding your A/B test, why are you looking at the data early? If it’s simply out of curiosity, and cannot effect your actions, then it doesn’t matter. But if you act on your observation, you change the Frequentist operating characteristics of your experiment.

Stepping back a little, why are you conducting your test in the first place? If you want to make the correct decision with probability 1 − α in an infinite sequences of experiments, then you need to take into account alpha spending, or else you will lower that probability.

But if you don’t care about a hypothetical infinite sequence of experiments, you may find the Bayesian approach more intuitive. What do you know right at this point in the experiment? It’s all encapsulated in the posterior distribution.

Suppose your experiment size is a substantial portion of your potential visitors. You want to experiment with graphics, yes, but you also want to make money along the way. You’re ultimately concerned with profit, not publishing a scientific paper on your results. Then you could use a Bayesian approach to maximize your expected profit. This leads to things like “bandits,” so called by analogy to slot machines (“one-armed bandits”).

If you want to keep things simple, and if the experiment size is negligible relative to your expected number of visitors, just design your experiment according to Frequentist principles and don’t look at your data early.

But if you have good business reasons to want to look at the data early, not simply to satisfy curiosity, then you should probably interpret the interim data from a Bayesian perspective. As the next post explains, the Bayesian approach aligns well with common sense.

I’d recommend taking either a Frequentist approach or a Bayesian approach, but not complicating things by hybrid approaches such as alpha spending or designing Bayesian experiments to have desired Frequentist operating characteristics. The middle ground is more complicated and prone to subtle mistakes, though we can help you navigate this middle ground if you need to.

If you need help designing, conducting, or interpreting experiments, we can help. If you want/need to look at interim results, we can show you how to do it the right way.

Related posts