Study back propagation and implement gradient descent.

Implement dropout.

Cross entropy is an alternative to quadratic cost function for faster learning.

Softmax is a different activation(output) function. An alternative to Sigmoid. Sum of outputs is always 1. Hence can be thought of as a probability distribution. In a Sigmoid layer, output activations won't always sum to 1.

2 good combinations in NN are : Softmax + Log likelihood cost & Sigmoid + Quadratic cost

Usually Softmax + Log Likelihood is good for multi class classification problems.

Validation_data vs. test_data

Validation_data for tuning hyper parameters like learning rate

test_data for evaluation

Best way to avoid over fitting is to have larger training sets.

Regularization is another way to prevent over fitting since it pushes towards smaller weights. It means small changes in inputs will yield small changes in output. If the weights are large, small changes in input may result in large changes in output. So it's helping the model avoid the effects of noise.

L2 Regularization - add weight^2 to cost

L2 Regularization - add |weight| to cost

You could train multiple Neural networks and do a voting on their results.

Similarly, there is

Expand the data set - for images add rotations/scaling/elastic distortions, for speech - vary the speed up/down, add noise

Explore Gaussian.

In a multi layer NN, initial layers' learning can explode or vanish - the learning rates may be too high as compared to others or too low.

Implement dropout.

Cross entropy is an alternative to quadratic cost function for faster learning.

Softmax is a different activation(output) function. An alternative to Sigmoid. Sum of outputs is always 1. Hence can be thought of as a probability distribution. In a Sigmoid layer, output activations won't always sum to 1.

2 good combinations in NN are : Softmax + Log likelihood cost & Sigmoid + Quadratic cost

Usually Softmax + Log Likelihood is good for multi class classification problems.

Validation_data vs. test_data

Validation_data for tuning hyper parameters like learning rate

test_data for evaluation

**Avoiding overfitting**Best way to avoid over fitting is to have larger training sets.

Regularization is another way to prevent over fitting since it pushes towards smaller weights. It means small changes in inputs will yield small changes in output. If the weights are large, small changes in input may result in large changes in output. So it's helping the model avoid the effects of noise.

L2 Regularization - add weight^2 to cost

L2 Regularization - add |weight| to cost

You could train multiple Neural networks and do a voting on their results.

Similarly, there is

**Dropout**in which you remove half the neurons at a time which helps you adjust the weights in an average way.Expand the data set - for images add rotations/scaling/elastic distortions, for speech - vary the speed up/down, add noise

**Weight Initialization**Explore Gaussian.

**Vanishing Gradient**In a multi layer NN, initial layers' learning can explode or vanish - the learning rates may be too high as compared to others or too low.

**Convolutional networks****Local receptive fields**, stride length**Shared weights and biases**- all neurons in a hidden layer will have same weights and biases. So that all of them can detect the same feature at different locations. They are protected against translational changes. An image shifted slightly to right or left of something is still the image of the same thing.- Map from input layers to hidden layers is called feature map. Shared weights and biases constitute a kernel/filter.
- One input layer can be mapped to multiple hidden layers. That enables detection of multiple features.
- Later layers could be pooling layers - map 2x2 inputs to one neuron/pixel.

**Recurrent Neural Networks(RNNs)**

- Output of a neuron might be determined by its earlier value. Time based. Might fit Speech and Natural language problems.

**Deep Belief Networks (DBNs)**

- Generative - not only recognize digits, but able to produce them as well.
- Able to do unsupervised learning too.
- Restricted Boltzmann machines are a key component of DBNs.

**What's going on with NNs**

- Playing video games
- NLP