Looking at Your Data

What to do first after scoping out and starting a data science project?

I’ve started an unsupervised learning project based on textual data. The first thing I like to do is actually look at the data. Is it noisy? What are the features—complex feature engineering needed? How heterogeneous? What generalization and overfitting challenges?

Analysis can take many forms: actually looking at the numbers, using visualization tools, Excel spreadsheet, Jupyter notebooks with Matplotlib, computing various statistics on the whole dataset or portions of it.

Some may believe this is not important. Just throw a barrage of classification or regression methods at the data, treat the data as a black box. Of course testing on a suite of ML methods is not a bad thing. But I can’t imagine not using every avenue available, including looking at the data. I’m certainly not alone in this view (see for example herehere and here).

I spent a few hours developing a simple custom data viewer for my problem that colored different parts of the textual data to give insight as to what was going on. I used ChatGPT to develop parts of this tool; some of it was incorrect and needed fixing, but having at least a draft of the code definitely saved time. Seeing the actual data in person was insightful and generated ideas for solving the problem.

While inspecting the data can help identify issues, it also risks biasing the modeling process by imposing assumptions that a flexible model might otherwise uncover on its own. One must also beware of data leakage. That being said—in general I think understanding as much as you can about the data is not a bad thing.

Practical consequences of tokenization details

I recently ran across the article Something weird is happening with LLMs and chess. One of the things it mentions is how the a minor variation in a prompt can have a large impact on the ability of an LLM to play chess.

One extremely strange thing I noticed was that if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space) and let the model generate the space itself. Huh?

The author goes on to explain that tokenization probably explains the difference. The intent is to get the LLM to predict the next move, but the extra space confuses the model because it is tokenized differently than the spaces in front of the e’s. The trailing space is tokenized as an individual character, but the spaces in front of the e’s are tokenized with the e’s. I wrote about this a couple days ago in the post The difference between tokens and words.

For example, ChatGPT will tokenize “hello world” as [15339, 1917] and “world hello” as [14957, 24748]. The difference is that the first string is parsed as “hello” and ” world” while the latter is parsed as “world” and ” hello”. Note the spaces attached to the second word in each case.

The previous post was about how ChatGPT tokenizes individual Unicode characters. It mentions UTF-16, which is itself an example of how tokenization matters. The string “UTF-16” it will be represented by three tokens, one each for “UTF”, “-“, and “16”. But the string “UTF16” will be represented by two tokens, one for “UTF” and one for “16”. The string “UTF16” might be more likely to be interpreted as a unit, a Unicode encoding.

ChatGPT tokens and Unicode

I mentioned in the previous post that not every Unicode character corresponds to a token in ChatGPT. Specifically I’m looking at gpt-3.5-turbo in tiktoken. There are 100,256 possible tokens and 155,063 Unicode characters, so the pigeon hole principle says not every character corresponds to a token.

I was curious about the relationship between tokens and Unicode so I looked into it a little further.

Low codes

The Unicode characters U+D800 through U+DFFF all map to a single token, 5809. This is because these are not really characters per se but “surrogates,” code points that are used in pairs to represent other code points [1]. They don’t make sense in isolation.

The character U+FFFD, the replacement character �, also corresponds to 5809. It’s also not a character per se but a way to signal that another character is not valid.

Aside from the surrogates and the replacement characters, every Unicode character in the BMP, characters up to U+FFFF, has a unique representation in tokens. However, most require two or three tokens. For example, the snowman character ☃ is represented by two tokens: [18107, 225].

Note that this discussion is about single characters, not words. As the previous post describes, many words are tokenized as entire words, or broken down into units larger than single characters.

High codes

The rest of the Unicode characters, those outside the BMP, all have unique token representations. Of these, 3,404 are represented by a single token, but the rest require 2, 3, or 4 tokens. The rocket emoji, U+1F680, for example, is represented by three tokens: [9468, 248, 222].

Rocket U+1F680 [9468, 248, 222]

[1] Unicode was originally limited to 16 bits, and UFT-16 represented each character with a 16-bit integer. When Unicode expanded to beyond 216 characters, UTF-16 used pairs of surrogates, one high surrogate and one low surrogate, to represent code points higher than U+FFFF.

The difference between tokens and words

Large language models operate on tokens, not words, though tokens roughly correspond to words.

A list of words would not be practical. There is no definitive list of all English words, much less all words in all languages. Still, tokens correspond roughly to words, while being more flexible.

Words are typically turned into tokens using BPE (byte pair encoding). There are multiple implementations of this algorithm, giving different tokenizations. Here I use the tokenizer gpt-3.5-turbo used in GPT 3.5 and 4.

Hello world!

If we look at the sentence “Hello world!” we see that it turns into three tokens: 9906, 1917, and 0. These correspond to “Hello”, ” world”, and “!”.

In this example, each token corresponds to a word or punctuation mark, but there’s a little more going on. It is true that 0 is simply the token for the exclamation mark—we’ll explain why in a moment—it’s not quite true to say 9906 is the token for “hello” and 1917 is the token for “world”.

Many to one

In fact 1917 is the token for ” world”. Note the leading space. The token 1917 represents the word “world,” not capitalized and not at the beginning of a sentence. At the beginning of a sentence, “World” would be tokenized as 10343. So one word may correspond to several different tokens, depending on how the word is used.

One to many

It’s also true that a word may be broken into several tokens. Consider the sentence “Chuck Mangione plays the flugelhorn.” This sentence turns into 9 tokens, corresponding to

“Chuck”, “Mang”, “ione”, ” plays”, ” fl”, “ug”, “el”, “horn”, “.”

So while there is a token for the common name “Chuck”, there is no token for the less common name “Mangione”. And while there is a single token for ” trumpet” there is no token for the less common “flugelhorn.”

Characters

The tokenizer will break words down as far as necessary to represent them, down to single letters if need be.

Each ASCII character can be represented as a token, as well as many Unicode characters. (There are 100256 total tokens, but currently 154,998 Unicode characters, so not all Unicode characters can be represented as tokens.)

Update: The next post dives into the details of how Unicode characters are handled.

The first 31 ASCII characters are non-printable control characters, and ASCII character 32 is a space. So exclamation point is the first printable, non-space character, with ASCII code 33. The rest of the printable ASCII characters are tokenized as their ASCII value minus 33. So, for example, the letter A, ASCII 65, is tokenized as 65 − 33 = 32.

Tokenizing a dictionary

I ran every line of the american-english word list on my Linux box through the tokenizer, excluding possessives. There are 6,015 words that correspond to a single token, 37,012 that require two tokens, 26,283 that require three tokens, and so on. The maximum was a single word, netzahualcoyotl, that required 8 tokens.

The 6,015 words that correspond to a single token are the most common words in English, and so quite often a token does represent a word. (And maybe a little more, such as whether the word is capitalized.)

A simpler GELU activation function approximation

The GELU (Gaussian Error Linear Units) activation function was proposed in [1]. This function is x Φ(x) where Φ is the CDF of a standard normal random variable. As you might guess, the motivation for the function involves probability. See [1] for details.

The GELU function is not too far from the more familiar ReLU, but it has advantages that we won’t get into here. In this post I wanted to look at approximations to the GELU function.

Since an implementation of Φ is not always available, the authors provide the following approximation:

\text{GELU(x)} \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} (x + 0.044715x^3) \right) \right)

I wrote about a similar but simpler approximation for Φ a while back, and multiplying by x gives the approximation

\text{GELU}(x) \approx 0.5x(1 + \tanh 0.8x)

The approximation in [1] is more accurate, though the difference between the exact values of GELU(x) and those of the simpler approximation are hard to see in a plot.

Since model weights are not usually needed to high precision, the simpler approximation may be indistinguishable in practice from the more accurate approximation.

Related posts

[1] Dan Hendrycks, Kevin Gimpel. Gaussian Error Linear Units (GELUs). Available on arXiv.

Can AI Models Reason: Is Data All You Need?

Many are voicing concern that the world is running out of data and that this will be a blocker to progress toward smarter AI models. One paper in fact projects timelines for when we will run out.

AI researchers are looking for ways to adapt.  Nvidia has trained a specific model to generate synthetic data for training other models. Some use this approach, though using AI-generated data to train AI is not without risk.

Others have asked a bigger question, namely, is something fundamentally missing in our approach that relies so heavily on data. Certainly the bitter lesson thesis and the position long advocated by Geoffrey Hinton argue for a data-first approach with “as few” prior assumptions as possible (though every model has a bias).

But it’s currently simply unknown whether just adding more data and compute will do the trick for achieving general intelligence or whether something else is needed. Neurosymbolic approaches are being experimented with, in various forms. But it’s unclear whether these can scale up to the level needed. And the frontier labs, laser-focused on the current paradigm, may not have adequate time or resources to investigate high-risk/high-reward alternatives.

From a theoretical standpoint, sometimes more data is simply not enough. As discussed in a previous post, some problems in mathematics and engineering require exponentially large amount of data to train neural network models. Exponentials can work in your favor, but also can work against you (think of the Tower of Hanoi problem or the Wheat and Chessboard problem). Some problems on certain models cannot be solved by any amount of data available in the entire universe.

The requirements for solving these problems can grow much more quickly than expected. The strength of neural networks, their flexibility, their universal approximation property, can also be a weakness. It can take so much data to nail down all the parameters so that the model is completely error free. Thankfully, many other problems that people want to solve (such as human language modeling) are fundamentally lower dimensional and thus less vulnerable to this problem.

We just don’t know whether the current data-hungry approach will be enough—or whether we’ll need to learn another bitter lesson.

Can AI models reason like a human?

We’re awaiting the release of OpenAI’s o3 model later this month. Its performance is impressive on very hard benchmarks like SWE-bench Verified, Frontier Math and the ARC AGI benchmark (discussed previously in this blog).

And yet at the same time some behaviors of the frontier AI models are very concerning.

Their performance on assorted math exams is outstanding, but they make mistakes doing simple arithmetic, like wrongly multiplying numbers that are a few digits long. Performance of the o1 preview model on the difficult Putnam math exam is excellent but drops precipitously under simple changes like renaming constants and variables in the problem statement.

Similarly, when o1 is applied to a planning benchmark expressed in standardized language, it performs well, but accuracy falls apart when applied to a mathematically equivalent planning problem in a different domain. And also, a given AI model applied to the simple ROT13 cipher can have wildly different performance based on the value of the cipher key, suggesting the models don’t really understand the algorithm.

It was the best of times, it was the worst of times, . . .

What is going on here?

For years now, some have made claims of “human-level performance” for various deep learning algorithms. And as soon as one party starts making claims like this, it’s hard for the others to resist doing the same.

The confusion is that, from a certain point of view, the claim of “human-level” is true—but the definition of “human-level” is fraught.

Here, “human-level” is taken to mean achievement of some high score on a benchmark set, ostensibly exceeding some human performance measure on the same benchmark. However, a single AI model can vary wildly in capability across behaviors—“smart” compared to humans in some ways, “dumb” in others.

For humans, test-taking is a proxy for measuring a range of skills and abilities. And even for humans it is not always an accurate proxy. A person can perform very well on academic tests and very poorly on the job, or vice versa.

And the capability ratios for AI models are very different still, in ways we don’t fully understand. So, outscoring humans on a software engineering benchmark doesn’t mean the AI has the whole panoply of coding skills, decision-making abilities, software architecture design savvy, etc., needed to be a competent software engineer.

It’s no surprise that recent articles (below) show a growing perception of the limitations of AI benchmarks as currently conceived.

Ways forward

Perhaps we should consider developing requirements like the following before claiming human-level reasoning performance of an AI model:

  • It should be able to “explain its work” at any level of detail to another human (just like a human can), in a way that that human can understand.
  • It should be able to give answers without “hallucinating” or “confabulating” (yes, humans can hallucinate too, but most occupations would not be well-served by an employee who hallucinates on the job).
  • It should be able to reliably and consistently (100% of the time) do things that we routinely expect a human or computer to do accurately (like add or multiply two numbers accurately, for things like filling out tax returns or doing engineering calculations to build an airplane).
  • It should be frank and honest in assessing its level of certainty about an answer it gives (no gaslighting).
  • It should be able to solve a trivial perturbation of a given problem with the same ease as the original problem (to the same extent that a human can).
  • As someone has said, it should be able to do, without specific training, what a 5 year old can do without specific training.
  • This one sounds good, from Emmett Shear: “AGI [artificial general intelligence] is the ability to generalize [without special training by a human] to an adversarially chosen new benchmark.”

AI models are fantastic and amazing tools—and best used when one has eyes wide open about their limitations.

Have you had problems with AI model performance? If so, please share in the comments.

References

Rethinking AI benchmarks: A new paper challenges the status quo of evaluating artificial intelligence, https://venturebeat.com/ai/rethinking-ai-benchmarks-a-new-paper-challenges-the-status-quo-of-evaluating-artificial-intelligence/

Rethink reporting of evaluation results in AI, https://www.science.org/doi/10.1126/science.adf6369, https://eprints.whiterose.ac.uk/198211/

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence, https://arxiv.org/abs/2402.09880

Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless, https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless

Why we must rethink AI benchmarks, https://bdtechtalks.com/2021/12/06/ai-benchmarks-limitations/

AI and the Everything in the Whole Wide World Benchmark, https://arxiv.org/abs/2111.15366

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices, https://arxiv.org/abs/2411.12990

Goodhart’s Law states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.

Can AI models reason: Just a stochastic parrot?

OpenAI has just released its full o1 model—a new kind of model that is more capable of multi-step reasoning than previous models. Anthropic, Google and others are no doubt working on similar products. At the same time, it’s hotly debated in many quarters whether AI models actually “reason” in a way similar to humans.

Emily Bender and her colleagues famously described large language models as nothing more than “stochastic parrots“—systems that simply repeat their training data blindly, based on a statistical model, with no real understanding (reminiscent of the Chinese Room experiment). Others have made similar comments, describing LLMs as “n-gram models on steroids” or a “fancy extrapolation algorithm.

There is of course some truth to this. AI models sometimes generate remarkable results and yet lack certain basic aspects of understanding that might inhibit their sometimes generation of nonsensical results. More to the point of “parroting” the training data, recent work from Yejin Choi’s group has shown how LLMs at times will cut and paste snippets from its various training documents, almost verbatim, to formulate its outputs.

Are LLMs (just) glorified information retrieval tools?

The implication of these concerns is that an LLM can “only” repeat back what it was taught (albeit with errors). However this view does not align with the evidence. LLM training is a compression process in which new connections between pieces of information are formed that were not present in the original data. This is evidenced both mathematically and anecdotally. In my own experience, I’ve gotten valid answers to such obscure and detailed technical question that it is hard for me to believe would exist in any training data in exactly that form. Whether you would call this “reasoning” or not might be open to debate, but regardless of what you call it, it is something more than just unadorned information retrieval like a “stochastic parrot.”

What is your experience? Let us know in the comments.