Naoki Yokoyama

### Why Do We Need Activation Functions? • In general, a standard neuron learns a weight and a bias, which are used to transform its input into its output.
• But a neural net made up entirely of these neurons is essentially a single linear function, there is no point.
• An activation function is responsible for deciding to “activate” certain nodes; turning them on or off.
• With an adequate net architecture, this introduction of nonlinearity will allow our net to go from a simple linear function to a universal function approximator!
• ### What Does an Activation Function Look Like?

• Simplest activation: 0 or 1 if the output passes a certain, learned threshold.
• Problem: makes it difficult to learn weights.
• Why? Backpropagation relies heavily on gradients to learn weights, which are non-existent for this activation function.
• ### So we need activation functions that provide useful gradients... • Sigmoids are popular because they are both nonlinear and differentiable. They squash the input into a value between 0 and 1, similar to a boolean function.
• However, the output is always positive, which may not be desirable.
• Tanh activation is similar to sigmoid, but is centered about 0 (range is [-1, 1]), solving the problem of positive-only outputs.
• Similar to convolutional layers, a sliding window is used, usually 2x2.
• But both activations have the vanishing gradient problem; towards the extremities of their outputs, they become flat, making the gradient vanish and preventing weights from being updated.
• ### The ReLU ReLU and Leaky ReLU

• The Rectified Linear Unit (ReLU) is both nonlinear and differentiable.
• The ReLU function is just max{0,x}; it will pass the input if it is positive, and 0 if not.
• This ensures that the gradient exists for values larger than 0 (see image).
• However, if the input is lower than 0, it may again result in a dead neuron that no longer learns.
• Leaky ReLU fixes this by adding a small gradient below 0, giving the neuron a chance to revive over time if the backpropagation wills it do so.
• ### Softmax

• Returns a vector that sums to 1.
• Useful for outputting probabilities for multiple classes, i.e classifying an image as a type of animal.
• However, some frameworks choose to stick to independent logistic classifiers (sigmoids) to dispense probabilities for each class, such as YOLOv3.