Naoki Yokoyama

About Me Research Tutorials Side Projects CV

## Activation Functions

### Why Do We Need Activation Functions?

In general, a standard neuron learns a weight and a bias, which are used to transform its input into its output.
But a neural net made up entirely of these neurons is essentially a single linear function, there is no point.
An activation function is responsible for deciding to “activate” certain nodes; turning them on or off.
With an adequate net architecture, this introduction of nonlinearity will allow our net to go from a simple linear function to a universal function approximator!
### What Does an Activation Function Look Like?

Simplest activation: 0 or 1 if the output passes a certain, learned threshold.
Problem: makes it difficult to learn weights.
Why? Backpropagation relies heavily on gradients to learn weights, which are non-existent for this activation function.
### So we need activation functions that provide *useful* gradients...

Sigmoids are popular because they are both nonlinear and differentiable. They squash the input into a value between 0 and 1, similar to a boolean function.
However, the output is always positive, which may not be desirable.
Tanh activation is similar to sigmoid, but is centered about 0 (range is [-1, 1]), solving the problem of positive-only outputs.
Similar to convolutional layers, a sliding window is used, usually 2x2.
But both activations have the vanishing gradient problem; towards the extremities of their outputs, they become flat, making the gradient vanish and preventing weights from being updated.
### The ReLU

The Rectified Linear Unit (ReLU) is both nonlinear and differentiable.
The ReLU function is just max{0,x}; it will pass the input if it is positive, and 0 if not.
This ensures that the gradient exists for values larger than 0 (see image).
However, if the input is lower than 0, it may again result in a dead neuron that no longer learns.
Leaky ReLU fixes this by adding a small gradient below 0, giving the neuron a chance to revive over time if the backpropagation wills it do so.
### Softmax

Returns a vector that sums to 1.
Useful for outputting probabilities for multiple classes, i.e classifying an image as a type of animal.
However, some frameworks choose to stick to independent logistic classifiers (sigmoids) to dispense probabilities for each class, such as YOLOv3.

About Me Research Tutorials Side Projects CV

*ReLU and Leaky ReLU*