*Would you like to see this cheatsheet in your native language? You can help us translating it on GitHub!*

#### CS 229 - Machine Learning

# Deep Learning cheatsheet

*By Afshine Amidi and Shervine Amidi*

## Neural Networks

Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.

**Architecture** ― The vocabulary around neural networks architectures is described in the figure below:

By noting $i$ the $i^{th}$ layer of the network and $j$ the $j^{th}$ hidden unit of the layer, we have:

where we note $w$, $b$, $z$ the weight, bias and output respectively.

**Activation function** ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:

Sigmoid |
Tanh |
ReLU |
Leaky ReLU |

$g(z)=\displaystyle\frac{1}{1+e^{-z}}$ | $g(z)=\displaystyle\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}$ | $g(z)=\textrm{max}(0,z)$ | $g(z)=\textrm{max}(\epsilon z,z)$ with $\epsilon\ll1$ |

**Cross-entropy loss** ― In the context of neural networks, the cross-entropy loss $L(z,y)$ is commonly used and is defined as follows:

**Learning rate** ― The learning rate, often noted $\alpha$ or sometimes $\eta$, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.

**Backpropagation** ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight $w$ is computed using chain rule and is of the following form:

As a result, the weight is updated as follows:

**Updating weights** ― In a neural network, weights are updated as follows:

- __Step 1__: Take a batch of training data.

- __Step 2__: Perform forward propagation to obtain the corresponding loss.

- __Step 3__: Backpropagate the loss to get the gradients.

- __Step 4__: Use the gradients to update the weights of the network.

**Dropout** ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability $p$ or kept with probability $1-p.$

## Convolutional Neural Networks

**Convolutional layer requirement** ― By noting $W$ the input volume size, $F$ the size of the convolutional layer neurons, $P$ the amount of zero padding, then the number of neurons $N$ that fit in a given volume is such that:

**Batch normalization** ― It is a step of hyperparameter $\gamma, \beta$ that normalizes the batch $\{x_i\}$. By noting $\mu_B, \sigma_B^2$ the mean and variance of that we want to correct to the batch, it is done as follows:

## Recurrent Neural Networks

**Types of gates** ― Here are the different types of gates that we encounter in a typical recurrent neural network:

Input gate |
Forget gate |
Gate |
Output gate |

Write to cell or not? | Erase a cell or not? | How much to write to cell? | How much to reveal cell? |

**LSTM** ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.

## Reinforcement Learning and Control

The goal of reinforcement learning is for an agent to learn how to evolve in an environment.

### Definitions

**Markov decision processes** ― A Markov decision process (MDP) is a 5-tuple $(\mathcal{S},\mathcal{A},\{P_{sa}\},\gamma,R)$ where:

- $\mathcal{S}$ is the set of states

- $\mathcal{A}$ is the set of actions

- $\{P_{sa}\}$ are the state transition probabilities for $s\in\mathcal{S}$ and $a\in\mathcal{A}$

- $\gamma\in[0,1[$ is the discount factor

- $R:\mathcal{S}\times\mathcal{A}\longrightarrow\mathbb{R}$ or $R:\mathcal{S}\longrightarrow\mathbb{R}$ is the reward function that the algorithm wants to maximize

**Policy** ― A policy $\pi$ is a function $\pi:\mathcal{S}\longrightarrow\mathcal{A}$ that maps states to actions.

*Remark: we say that we execute a given policy $\pi$ if given a state $s$ we take the action $a=\pi(s)$.*

**Value function** ― For a given policy $\pi$ and a given state $s$, we define the value function $V^{\pi}$ as follows:

**Bellman equation** ― The optimal Bellman equations characterizes the value function $V^{\pi^*}$ of the optimal policy $\pi^*$:

*Remark: we note that the optimal policy $\pi^*$ for a given state $s$ is such that:*

**Value iteration algorithm** ― The value iteration algorithm is in two steps:

1) We initialize the value:

2) We iterate the value based on the values before:

**Maximum likelihood estimate** ― The maximum likelihood estimates for the state transition probabilities are as follows:

**Q-learning** ― $Q$-learning is a model-free estimation of $Q$, which is done as follows: