Natural language processing represents words as high-dimensional vectors, on the order of 100 dimensions. For example, the glove-wiki-gigaword-50
set of word vectors contains 50-dimensional vectors, and the the glove-wiki-gigaword-200
set of word vectors contains 200-dimensional vectors.
The intent is to represent words in such a way that the angle between vectors is related to similarity between words. Closely related words would be represented by vectors that are close to parallel. On the other hand, words that are unrelated should have large angles between them. The metaphor of two independent things being orthogonal holds almost literally as we’ll illustrate below.
Cosine similarity
For vectors x and y in two dimensions,
where θ is the angle between the vectors. In higher dimensions, this relation defines the angle θ in terms of the dot product and norms:
The right-hand side of this equation is the cosine similarity of x and y. NLP usually speaks of cosine similarity rather than θ, but you could always take the inverse cosine of cosine similarity to compute θ. Note that cos(0) = 1, so small angles correspond to large cosines.
Examples
For our examples we’ll use gensim with word vectors from the glove-twitter-200
model. As the name implies, this data set maps words to 200-dimensional vectors.
Note that word embeddings differ in the data they were trained on and the algorithm used to produce the vectors. The examples below could be very different using a different source of word vectors.
First some setup code.
import numpy as np import gensim.downloader as api word_vectors = api.load("glove-twitter-200") def norm(word): v = word_vectors[word] return np.dot(v, v)**0.5 def cosinesim(word0, word1): v = word_vectors[word0] w = word_vectors[word1] return np.dot(v, w)/(norm(word0)*norm(word1))
Using this mode, the cosine similarity between “dog” and “cat” is 0.832, which corresponds to about a 34° angle. The cosine similarity between “dog” and “wrench” is 0.145, which corresponds to an angle of 82°. A dog is more like a cat than like a wrench.
The similarity between “dog” and “leash” is 0.487, not because a dog is like a leash, but because the word “leash” is often used in the same context as the word “dog.” The similarity between “cat” and “leash” is only 0.328 because people speaking of leashes are more likely to also be speaking about a dog than a cat.
The cosine similarity between “uranium” and “walnut” is only 0.0054, corresponding to an angle of 89.7°. The vectors associated with the two words are very nearly orthogonal because the words are orthogonal in the metaphorical sense.
Note that opposites are somewhat similar. Uranium is not the opposite of walnut because things have to have something in common to be opposites. The cosine similarity of “expensive” and “cheap” is 0.706. Both words are adjectives describing prices and so in some sense they’re similar, though they have opposite valence. “Expensive” has more in common with “cheap” than with “pumpkin” (similarity 0.192).
The similarity between “admiral” and “general” is 0.305, maybe less than you’d expect. But the word “general” is kinda general: it can be used in more contexts than military office. If you add the vectors for “army” and “general”, you get a vector that has cosine similarity 0.410 with “admiral.”
Seems like you’d hope for the cosine similarity between “expensive” and “cheap” to be -1. They are one and the same concept seen from opposite directions. By deriving the vectors empirically, you might have to deal with some inaccuracy in your empirical vectors, so that the cosine similarity might end up as -0.706. But it’s strange that it should be +0.706. Is there something restricting word vectors to a portion of space where their cosine similarities must be positive? (If not, …) What are some examples of vectors with an angle of more than 90 degrees?
It does seem at first that the vector for “cheap” might be the negative of the vector for “expensive.” But word vectors classify words by hundreds of criteria. Then these are projected down to a lower-dimensional space where the vectors are not so sparse.
“Cheap” and “expensive” are opposites as far as they are applied to prices. But they’re similar in many ways. They’re both adjectives. They’re both financial terms. They’re likely to be used in the same contexts, and often appear close together in text.
It’s hard to image one word being the opposite of another word by hundreds of criteria, or even say five criteria. But you might ask for the interaction (say sum or dot product) with some other vector to be opposite. Say “expensive” and “cheap” are opposite when applied to prices.
My understanding is that people expect this to measure similarity/difference in “meaning”, and it doesn’t. It measures the degree to which two words are “used in the same context” among all their occurrences in their training corpus.
An example that really drives the difference home would be “stormy petrel” words – words that are only used in one specific phrase. Think of the word “shrift”. It will be very close to the word “short”, because “short shrift” is the only way it will appear.
There is a on-line game named “Semantle” that is based in these word vector similarities. I’ve found it very irritating to play precisely because the closeness does not correspond to similar meaning.
Those are good points. I’m intrigued by your idea of hoping that “cheap” and “expensive” will become opposites when we restrict ourselves to some particular conceptual space.
The obvious space that guarantees that that will be true is the line from one of those vectors to the other. (Parameterized as (“cheap” + “expensive”)/2 + t*(“cheap” – “expensive”), or similarly?)
Is there anything interesting we can say about words that exist in the space close to that line?
And it seems like, from the way you describe it, words can be related (with “high” cosine similarity) or unrelated (with ~zero cosine similarity), but probably not antirelated. Do words with negative cosine similarity exist in the model?