Swish, mish, and serf

Posted on 6 August 2023 by John

Swish, mish, and serf are neural net activation functions. The names are fun to say, but more importantly the functions have been shown to improve neural network performance by solving the “dying ReLU problem.” This happens when a large number of node weights become zero during training and do not contribute further to the learning process. The weights may be “trying” to be negative, but the activation function won’t allow it. The swish family of activation functions allows weights to dip negative.

Softplus can also be used as an activation function, but our interest in softplus here is as part of the definition of mish and serf. Similarly, we’re interested in the sigmoid function here because it’s used in the definition of swish.

Softplus, softmax, softsign

Numerous functions in deep learning are named “soft” this or that. The general idea is to “soften” a non-differentiable function by approximating it with a smooth function.

The plus function is defined by

$x^+ =<br /> \left\{<br /> \begin{array}{ll}<br /> x & \mbox{if } x \geq 0 \\<br /> 0 & \mbox{if } x < 0 \end{array} \right.$

In machine learning the plus is called ReLU (rectified linear unit) but x⁺ is conventional mathematical notation.

The softplus function approximates x⁺ by

$\text{softplus}(x) = \log(1 + \exp(x))$

We can add a parameter to the softplus function to control how soft it is.

$\text{softplus}(x; k) = \frac{1}{k}\log\left(1 + \exp(kx)\right)$

As the parameter k increases, softmax becomes “harder.” It more closely approximates x⁺ but at the cost of the derivative at 0 getting larger.

We won’t need other “soft” functions for the rest of the post, but I’ll mention a couple more while we’re here. Softmax is a smooth approximation to the max function

$\text{softmax}(x, y) = \log\left(\exp(x) + \exp(y)\right)$

and softsign is a smooth approximation to the sign function.

$\text{softsign}(x) = \frac{x}{1 + |x|}$

which is differentiable at the origin, despite the absolute value in the denominator. We could add a sharpness parameter k to the softmax and softsign functions like we did above.

Sigmoid

The sigmoid function, a.k.a. the logistic curve, is defined by

$\sigma(x) = \frac{\exp(x)}{1 + \exp(x)} = \frac{1}{1 + \exp(-x)}$

The second and third expressions are clearly equal, but I prefer to think in terms of the former, but the latter is better for numerical calculation because it won’t overflow for large x.

The sigmoid function is the derivative of the softplus function. We could define a sigmoid function with a sharpness parameter k by takind the derivative of the softplus function with sharpness parameter k.

Swish

The Swish function was defined in [1] as

$\text{swish}(x) = x \, \sigma(x)$

Like softplus, the Swish function is asymptotically 0 as x → −∞ and asymptotically equal to x as x → ∞. That is, for large |x|, swish(x) is approximately x⁺. However, unlike softplus, swish(x) is not monotone. It is negative for negative x, having a minimum around -1.278 [2]. The fact that the activation function can dip into negative territory is what addresses “the dying ReLU problem,”

Mish

The Mish function was defined in [3] as

$\text{mish}(x) = x \, \tanh(\,\text{softplus}(x)\,)$

The mish function has a lot in common with the swish function, but performs better, at least on the examples the author presents.

The mish function is said to be part of the swish function family, as is the serf function below.

Serf

The serf function [4] is another variation on the swish function with similar advantages.

$\text{serf}(x) = x \, \text{erf}(\,\text{softplus}(x)\,)$

The name “serf” comes from “log-Softplus ERror activation Function” but I’d rather pronounce it “zerf” from “x erf” at the beginning of the definition.

The serf function is hard to distinguish visually from the mish function; I left it out of the plot above to keep the plot from being cluttered. Despite the serf and mish functions being close together, the authors in [4] says the former outperforms the latter on some problems.

Numerical evaluation

The most direct implementations of the softplus function can overflow, especially when implemented in low-precision floating point such as 16-bit floats. This means that the mish and serf functions could also overflow.

For large x, exp(x) could overflow, and so computing softplus as

   log(1 + exp(x))

could overflow while the exact value would be well within the range of representable numbers. A simple way to fix this would be to have the softmax function return x when x is so large that exp(x) would overflow.

As mentioned above, the expression for the sigmoid function with exp(-x) in the denominator is better for numerical work than the version with exp(x) in the numerator.

[1] Swish: A Self-Gated Activation Function. Prajit Ramachandran∗, Barret Zoph, Quoc V. arXiv:1710.05941v1

[2] Exact form of the minimum location and value in terms of the Lambert W function discussed in the next post.

[3] Mish: A Self Regularized Non-Monotonic Activation Function. Diganta Misra. arXiv:1908.08681v3

[4] SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function. Sayan Nag, Mayukh Bhattacharyya. arXiv:2108.09598v3

Filtering on how words are being used

Posted on 18 July 2023 by John

Yesterday I wrote about how you could use the spaCy Python library to find proper nouns in a document. Now suppose you want to refine this and find proper nouns that are the subjects of sentences or proper nouns that are direct objects.

This post was motivated by a project in which I needed to pull out company names from a large amount of text, and it was important to know how the company name was being used.

Dependency labels

Tokens in spaCy have a dependency label attribute dep (or dep_ for its string representation). Dependency labels tell you how a word is being used. For example, dobj tells you the word is being used as a direct object, and nsubj tells you its being used as a nominal subject.

In yesterday’s post the line

    if tok.pos_ == "PROPN":
        print(tok)

filtered tokens to look for proper nouns. We could modify the script to also tell us how the proper nouns are being used by printing tok.dep_.

There are three proper nouns in the opening paragraph of Moby Dick: Ishmael, November, and Cato.

Call me Ishmael. … whenever it is a damp, drizzly November in my soul … With a philosophical flourish Cato throws himself upon his sword …

If we run

    if tok.pos_ == "PROPN":
        print(tok, tok.dep_)

on the first paragraph we get

    Ishmael oprd
    November attr
    Cato nsubj

but it’s not obvious what the output means. If we wrap tok.dep_ with spacy.explain we get a more verbose explanation.

    Ishmael object predicate
    November attribute
    Cato nominal subject

Pulling out subjects

Now suppose we wanted to pull out words that are subjects. We could filter on tok.dep_ == "nsubj" but there are more kinds of subjects than just nominal subjects. There are six kinds of subjects:

nsubj: nominal subject
nsubjpass: nominal passive subject
csubj: clausal subject
csubjpass: clausal passive subject
agent: agent
expl: expletive

Finding the range of possible values for dependency labels takes some digging. I don’t believe it’s in the spaCy documentation per se, but if you’re persistent you’ll find a link this list or the paper it came from.

Searching for proper nouns

Posted on 17 July 2023 by John

Suppose you want to find all the proper nouns in a document. You could grep for every word that starts with a capital letter with something like

    grep '\b[A-Z]\w+'

but this would return the first word of each sentence in addition to the words you’re after.

You could grep for capitalized words that are not preceded by a period or question mark followed by a space.

    grep -P '(?<![.?] )\b[A-Z]\w+'

That’s possibly better, but it misses proper nouns at the beginning of a sentence.

You might be able to accomplish what you’re after by tinkering with regular expressions, but it would be better to use a library that has some idea of what a proper noun is.

NLP with spaCy

The Python natural language processing library spaCy classifies words by part of speech, and so could in particular search for proper nouns.

Here’s an example using the opening lines of Moby Dick.

    import spacy
    nlp = spacy.load("en_core_web_lg")

    doc = nlp("Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul ... I account it high time to get to sea as soon as I can.")

    for tok in doc:
        if tok.pos_ == "PROPN":
            print(tok)

This will print Ishmael and November only. It does not print words at the beginning of a sentence such as Call or Some even though they are capitalized. When spaCy got to the line

Queequeg was George Washington cannibalistically developed.

it detected that Queequeg is a proper noun. Presumably the model can tell this from context, because the word precedes the verb was and not because it knows Queeqeug is proper name.

When I changed November to november spaCy was still able to detect that november was a proper noun. When I downcased Ishmael it did not detect that ishmael was a proper noun, presumably because Ishmael is an uncommon name. When I changed the text to “Call me tim” the library did recognize tim as a proper noun.

When I fed spaCy the sentence

I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook.

the library thought that Commodore, Captain, and Cook were proper nouns. If I downcase these words, spaCy does not flag them as proper nouns.

When processing the line

For as in this world,head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle

spaCy correctly flagged Commodore as a proper noun in this instance. Also, it did not classify Pythagorean as a proper noun; the word is proper but not a noun, i.e. it’s a proper adjective.

TANSTAAFL

My script above has only six lines of code. But it depends on a library that uses a 588 MB language model. [1]

[1] “TANSTAALF” stands for “There ain’t no such thing as a free lunch.” It comes from The Moon is a Harsh Mistress by Heinlein.

Incidentally, when I fed “The term TANSTAAFL comes from The Moon is a Harsh Mistress by Heinlein.” to spaCy, it flagged Harsh and Mistress as proper nouns.

When I fed it “The term TANSTAAFL comes from ‘The moon is a harsh mistress’ by Heinlein.” the library correctly tagged harsh as an adjective and mistress as a (non-proper) noun.

Neural networks