What is the Sigmoid Function?
A Sigmoid function is a mathematical function which has a characteristic S-shaped curve. There are a number of common sigmoid functions, such as the logistic function, the hyperbolic tangent, and the arctangent. In machine learning, the term sigmoid function is normally used to refer specifically to the logistic function, also called the logistic sigmoid function.
All sigmoid functions have the property that they map the entire number line into a small range such as between 0 and 1, or -1 and 1, so one use of a sigmoid function is to convert a real value into one that can be interpreted as a probability.
Â
One of the most widely used sigmoid functions is the logistic function, which maps any real value to the range (0, 1). Note the characteristic S-shape which gave sigmoid functions their name (from the Greek letter sigma).
Sigmoid functions have become popular in deep learning because they can be used as an activation function in an artificial neural network. They were inspired by the activation potential in biological neural networks.
Sigmoid functions are also useful for many machine learning applications where a real number needs to be converted to a probability. A sigmoid function placed as the last layer of a machine learning model can serve to convert the model's output into a probability score, which can be easier to work with and interpret.
Sigmoid functions are an important part of a logistic regression model. Logistic regression is a modification of linear regression for two-class classification, and converts one or more real-valued inputs into a probability, such as the probability that a customer will purchase a product. The final stage of a logistic regression model is often set to the logistic function, which allows the model to output a probability.
Sigmoid Function Formula
All sigmoid functions are monotonic and have a bell-shaped first derivative. There are several sigmoid functions and some of the best-known are presented below.
Three of the commonest sigmoid functions: the logistic function, the hyperbolic tangent, and the arctangent. All share the same basic S shape.
Logistic Sigmoid Function Formula
One of the commonest sigmoid functions is the logistic sigmoid function. This is often referred to as the Sigmoid Function in the field of machine learning. The logistic sigmoid function is defined as follows:
Mathematical definition of the logistic sigmoid function, a common sigmoid function
The logistic function takes any real-valued input, and outputs a value between zero and one.
Hyperbolic Tangent Function Formula
Another common sigmoid function is the hyperbolic function. This maps any real-valued input to the range between -1 and 1.
Mathematical definition of the hyperbolic tangent
Arctangent Function Formula
A third alternative sigmoid function is the arctangent, which is the inverse of the tangent function.
The arctangent function
The arctangent function maps any real-valued input to the range −π/2 to Ï€/2.Â
In the below graphs we can see both the tangent curve, a well-known trigonometric function, and the arctangent, its inverse:
Example Calculation of Logistic Sigmoid Function
Taking the logistic sigmoid function, we can evaluate the value of the function at several key points to understand the function's form.
At x = 0, the logistic sigmoid function evaluates to:
This is useful for the interpretation of the sigmoid as a probability in a logistic regression model, because it shows that a zero input results in an output of 0.5, indicating equal probabilities of both classes.
At x = 1, we find a slightly larger value:
and by x = 5, the value of the sigmoid function becomes very close to 1.
In fact, in the limit of x tending towards infinity, the sigmoid function converges to 1, and towards -1 in the case of negative infinity, but the derivative of the function never reaches zero. These are very useful properties of the sigmoid function, as it tends towards a limit but always has a nonzero gradient.
Example Calculation of Hyperbolic Tangent Function
Similarly, we can calculate the value of the tanh function at these key points. Rather than being centered around 0.5, the tanh function is centered at 0.
At 1, the tanh function has increased relatively much more rapidly than the logistic function:
And finally, by 5, the tanh function has converged much more closely to 1, within 5 decimal places:
In fact, both the hyperbolic tangent and arctangent functions converge much more rapidly than the logistic sigmoid function.Â
Example calculation of the Arctangent Function
We can evaluate the arctangent function at the same points to see where it converges:
Note that in contrast to the other two sigmoid functions shown above, the arctangent converges to π/2 rather than 1. Furthermore, the arctangent converges more slowly, as at x = 5 it is not even close to its final value. Only by quite large numbers, such as x = 5000, does the arctangent get very close to π/2.
Summary of three sigmoid functions
We can compare the key properties of the three sigmoid functions shown above in a table:
Sigmoid function | Logistic function | tanh | arctan |
Value in the limit x →-∞ | 0 | -1 | -π/2 |
Value at x = 0 | 0.5 | 0 | 0 |
Value in the limit x →∞ | 1 | 1 | π/2 |
Converges | Fast | Very fast | Very slow |
Sigmoid Function vs. ReLU
In modern artificial neural networks, it is common to see in place of the sigmoid function, the rectifier, also known as the rectified linear unit, or ReLU, being used as the activation function. The ReLU is defined as:
Definition of the rectifier activation function
Graph of the ReLU function
The ReLU function has several main advantages over a sigmoid function in a neural network. The main advantage is that the ReLU function is very fast to calculate. In addition, an activation potential in a biological neural network does not continue to change for negative inputs, so the ReLU seems closer to the biological reality if a goal is to mimic biological systems.
In addition, for positive x the ReLU function has a constant gradient of 1, whereas a sigmoid function has a gradient that rapidly converges towards 0. This property makes neural networks with sigmoid activation functions slow to train. This phenomenon is known as the vanishing gradient problem. The choice of ReLU as an activation function alleviates this problem because the gradient of the ReLU is always 1 for positive x and so the learning process will not be slowed down by the gradient becoming small.
However, the zero gradient for negative x can pose a similar problem, known as the zero gradient problem, but it is possible to compensate for this by adding a small linear term in x to give the ReLU function a nonzero slope at all points.
Applications of Sigmoid Function
Logistic sigmoid function in logistic regression
A key area of machine learning where the sigmoid function is essential is a logistic regression model. A logistic regression model is used to estimate the probability of a binary event, such as dead vs alive, sick vs well, fraudulent vs honest transaction, etc. It outputs a probability value between 0 and 1.
In logistic regression, a logistic sigmoid function is fit to a set of data where the independent variable(s) can take any real value, and the dependent variable is either 0 or 1.
For example, let us imagine a dataset of tumor measurements and diagnoses. Our aim is to predict the probability of a tumor spreading, given its size in centimeters.
Some measurements of tumor dimensions and outcomes
Plotting the entire dataset, we have a general trend that, the larger the tumor, the more likely it is to have spread, although there is a clear overlap of both classes in the range 2.5 cm to 3.5 cm:
A plot of tumor outcomes versus tumor dimensions
Using logistic regression, we can model the tumor status y (0 or 1) as a function of tumor size x using the logistic sigmoid formula:
where we need to find the optimal values m and b, which allow us to shift and stretch the sigmoid curve to match the data.
In this case, fitting the sigmoid curve gives us the following values:
We can put these values back into the sigmoid formula and plot the curve:
This means that, for example, given a tumor of size 3cm, our logistic regression model would predict the probability of this tumor spreading as:
Intuitively, this makes sense. In the original data, we can see that the tumors around 3cm are more or less evenly distributed between both classes.
Let us consider a tumor of size 6 cm. All tumors in the original dataset of size 4 cm or greater had spread, so we would expect that our model would return a high likelihood of the tumor spreading:
The model has returned a probability very close to 1, indicating the near certainty that y = 1.
This shows how sigmoid functions, and the logistic function in particular, are extremely powerful for probability modeling.
Why is the logistic function used in logistic regression, and not another sigmoid function?
The reason that the logistic function is used in logistic regression, and none of the other sigmoid variants, is not just due to the fact that it conveniently returns values between 0 and 1. Logistic regression is derived from the assumption that data in both classes is normally distributed.
Let us imagine that non-spreading tumors and spreading tumors each follow a normal distribution. The non-spreading tumors are normally distributed with mean 1.84 cm and standard deviation 1 cm, and the spreading tumors are normally distributed with mean 4.3 cm, also with standard deviation 1 cm. We can plot both the probability density function of both these normal distributions:
At each point we can calculate the odds ratio of the two distributions, which is the probability density function of the spread tumors divided by the sum of both probability density functions (non-spreading + spread tumors):
Plotting the odds ratio as a function of x, we can see that the result is the original logistic sigmoid curve.
The reason that the logistic function is chosen for logistic regression is due to an assumption we are modeling two classes which are both normally distributed, and the logistic function naturally arises from the ratio of normal probability density functions.
Sigmoid function as activation function in artificial neural networks
An artificial neural network consists of several layers of functions, layered on top of each other:
A feedforward neural network with two hidden layers
Each layer typically contains some weights and biases and functions like a small linear regression. A crucial part of the layers is also the activation function.
Formula for the first hidden layer of a feedforward neural network, with weights denoted by WÂ and biases by b, and activation function g.
However, if every layer in the neural network were to contain only weights and biases, but no activation function, the entire network would be equivalent to a single linear combination of weights and biases. In other words, the formula for the neural network could be factorized and simplified down to a simple linear regression model. Such a model would be able to pick up very simple linear dependencies but unable to perform the impressive tasks that neural networks are renowned for, such as image and voice recognition.
Activation functions were introduced between layers in neural networks in order to introduce a non-linearity. Originally sigmoid functions such as the logistic function, arctangent, and hyperbolic tangent were used, and today ReLU and its variants are very popular. All activation functions serve the same purpose: to introduce a non-linearity into the network. Sigmoid functions were chosen as some of the first activation functions thanks to their perceived similarity with the activation potential in biological neural networks.
Thanks to the use of a sigmoid function at various points within a multi-layer neural network, neural networks can be built to have successive layers pick up on ever more sophisticated features of an input example.
Sigmoid Function History
In 1798, the English cleric and economist Thomas Robert Malthus published a book under a pseudonym called An Essay on the Principle of Population, asserting that the population was increasing in a geometric progression (doubling every 25 years) while food supplies were increasing arithmetically, and that the difference between the two was due to cause widespread famine.
In the late 1830s, the Belgian mathematician Pierre François Verhulst was experimenting with different ways of modeling population growth, and wanted to account for the fact that a population's growth is ultimately self-limiting, and does not increase exponentially forever. Verhulst chose the logistic function as a logical adjustment to the simple exponential model, in order to model the slowing down of a population's growth which occurs when a population begins to exhaust its resources.
Over the next century, biologists and other scientists began to use the sigmoid function as a standard tool for modeling population growth, from bacterial colonies to human civilizations.
In 1943, Warren McCulloch and Walter Pitts developed an artificial neural network model using a hard cutoff as an activation function, where a neuron outputs 1 or 0 depending on whether its input is above or below a threshold.
In 1972, the biologists Hugh Wilson and Jack Cowan at the University of Chicago were attempting to model biological neurons computationally and published the Wilson–Cowan model, where a neuron sends a signal to another neuron if it receives a signal greater than an activation potential. Wilson and Cowan chose the logistic sigmoid function to model the activation of a neuron as a function of a stimulus.
From the 1970s and 1980s onwards, a number of researchers began to use sigmoid functions in formulations of artificial neural networks, taking inspiration from biological neural networks. In 1998, Yann LeCun chose the hyperbolic tangent as an activation function in his groundbreaking convolutional neural network LeNet, which was the first to be able to recognize handwritten digits to a practical level of accuracy.
In recent years, artificial neural networks have moved away from sigmoid functions in favor of the ReLU function, since all the variants of the sigmoid function are computationally intensive to calculate, and the ReLU provides the necessary nonlinearity to take advantage of the depth of the network, while also being very fast to compute.
References
Malthus, An Essay on the Principle of Population, 1798
Verhulst, Notice sur la loi que la population suit dans son accroissement (1838)
McCulloch and Pitts, A logical calculus of the ideas immanent in nervous activity (1943)
Wilson and Cowan, Excitatory and inhibitory interactions in localized populations of model neurons (1972)