{"id":246993,"date":"2026-04-30T09:51:08","date_gmt":"2026-04-30T14:51:08","guid":{"rendered":"https:\/\/www.johndcook.com\/blog\/?p=246993"},"modified":"2026-04-30T13:39:40","modified_gmt":"2026-04-30T18:39:40","slug":"derivative-of-relu","status":"publish","type":"post","link":"https:\/\/www.johndcook.com\/blog\/2026\/04\/30\/derivative-of-relu\/","title":{"rendered":"Three ways to differentiate ReLU"},"content":{"rendered":"<p>When a function is not differentiable in the classical sense there are multiple ways to compute a generalized derivative. This post will look at three generalizations of the classical derivative, each applied to the ReLU (rectified linear unit) function. The ReLU function is a commonly used <a href=\"https:\/\/www.johndcook.com\/blog\/2023\/07\/01\/activation-functions\/\">activation function<\/a> for neural networks. It&#8217;s also called the ramp function for obvious reasons.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter\" src=\"https:\/\/www.johndcook.com\/ReLU_plot.png\" \/><\/p>\n<p>The function is simply\u00a0<em>r<\/em>(<em>x<\/em>) = max(0,\u00a0<em>x<\/em>).<\/p>\n<h2>Pointwise derivative<\/h2>\n<p>The <strong>pointwise<\/strong> derivative would be 0 for\u00a0<em>x<\/em> &lt; 0, 1 for <em>x<\/em> &gt; 0, and undefined at\u00a0<em>x<\/em> = 0. So except at 0, the pointwise derivative of the ramp function is the <a href=\"https:\/\/www.johndcook.com\/blog\/2010\/01\/06\/heaviside\/\">Heaviside<\/a> function.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" style=\"background-color: white;\" src=\"https:\/\/www.johndcook.com\/heaviside3.svg\" alt=\"H(x) = \\left\\{ \\begin{array}{ll} 1 &amp; \\mbox{if } x \\geq 0 \\\\ 0 &amp; \\mbox{if } x &lt; 0 \\end{array} \\right.\" width=\"173\" height=\"48\" \/><br \/>\nIn a real analysis course, you&#8217;d simply say\u00a0<em>r<\/em>\u2032(<em>x<\/em>) =<em>H<\/em>(<em>x<\/em>) because functions are only defined up to equivalent modulo sets of measure zero, i.e. the definition at\u00a0<em>x<\/em> = 0 doesn&#8217;t matter.<\/p>\n<h2>Distributional derivative<\/h2>\n<p>In\u00a0<strong>distribution theory<\/strong> you&#8217;d identify the function\u00a0<em>r<\/em>(<em>x<\/em>) with the distribution whose action on a test function \u03c6 is<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" style=\"background-color: white;\" src=\"https:\/\/www.johndcook.com\/ramp_action.svg\" alt=\"\\langle r, \\varphi \\rangle = \\int_{-\\infty}^\\infty r(x)\\, \\varphi(x) \\, dx\" width=\"207\" height=\"45\" \/><\/p>\n<p>Then the derivative of <em>r<\/em> would be the distribution <em>r<\/em>\u2032 satisfying<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" style=\"background-color: white;\" src=\"https:\/\/www.johndcook.com\/ramp_derivative_action.svg\" alt=\"\\langle r^{\\prime}, \\varphi\\rangle = -\\langle r, \\varphi^{\\prime} \\rangle\" width=\"132\" height=\"19\" \/><\/p>\n<p>for all smooth functions \u03c6 with compact support. You can prove using integration by parts that the above equals the integral of \u03c6 from 0 to \u221e, which is the same as the action of <em>H<\/em>(<em>x<\/em>) on \u03c6.<\/p>\n<p>In this case the distributional derivative of\u00a0<em>r<\/em> is the same as the pointwise derivative of\u00a0<em>r<\/em> interpreted as a distribution. This does not happen in general. For example, the pointwise derivative of\u00a0<em>H<\/em> is zero but the distributional derivative of <em>H<\/em> is \u03b4, the Dirac delta distribution.<\/p>\n<p>For more on distributional derivatives, see <a href=\"https:\/\/www.johndcook.com\/blog\/2009\/10\/25\/how-to-differentiate-a-non-differentiable-function\/\">How to differentiate a non-differentiable function<\/a>.<\/p>\n<h2>Subgradient<\/h2>\n<p>The subgradient of a function <em>f<\/em> at a point <em>x<\/em>, written \u2202<em>f<\/em>(<em>x<\/em>), is the set of slopes of tangent lines to the graph of\u00a0<em>f<\/em> at\u00a0<em>x<\/em>. If\u00a0<em>f<\/em> is differentiable at\u00a0<em>x<\/em>, then there is only one slope, namely\u00a0<em>f<\/em>\u2032(<em>x<\/em>), and we typically say the subgradient of <em>f<\/em> at\u00a0<em>x<\/em> is simply <em>f<\/em>\u2032(<em>x<\/em>) when strictly speaking we should say it is the one-element set {<em>f<\/em>\u2032(<em>x<\/em>)}.<\/p>\n<p>A line tangent to the graph of the ReLU function at a negative value of <em>x<\/em> has slope 0, and a tangent line at a positive\u00a0<em>x<\/em> has slope 1. But because there&#8217;s a sharp corner at\u00a0<em>x<\/em> = 0, a tangent at this point could have any slope between 0 and 1.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" style=\"background-color: white;\" src=\"https:\/\/www.johndcook.com\/subgradient.svg\" alt=\"\\partial f(x) = \\left\\{ \\begin{array}{cl} 1 &amp; \\text{if } x &gt; 0 \\\\&lt;br \/&gt; \\left[0,1\\right] &amp; \\text{if } x = 0 \\\\&lt;br \/&gt; 0 &amp; \\text{if } x &lt; 0 \\end{array} \\right.\" width=\"214\" height=\"73\" \/><\/p>\n<p>My dissertation was full of subgradients of convex functions. This made me uneasy because subgradients are not real-valued functions; they&#8217;re set-valued functions. Most of the time you can blithely ignore this distinction, but there&#8217;s always a nagging suspicion that it&#8217;s going to bite you unexpectedly.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When a function is not differentiable in the classical sense there are multiple ways to compute a generalized derivative. This post will look at three generalizations of the classical derivative, each applied to the ReLU (rectified linear unit) function. The ReLU function is a commonly used activation function for neural networks. It&#8217;s also called the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[9],"tags":[],"class_list":["post-246993","post","type-post","status-publish","format-standard","hentry","category-math"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/posts\/246993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/comments?post=246993"}],"version-history":[{"count":0,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/posts\/246993\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/media?parent=246993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/categories?post=246993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/tags?post=246993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}