The number of neurons in the first layer equals the number of input variables (features), the number of neurons in the middle layers are hyperparameters (that you decide for optimization) and the number of neurons in the last layer is equal to the number of desired outputs required from the network. We also use a loss function to calculate the deviation between the actual values and the output of our network, this helps in learning the weights of a neural network.
Once you are familiar with the basic terminology of neural networks mentioned above. It is not difficult to visualize the training process for a neural network. We start by building a neural network keeping the number of neurons in each layer according to the rules mentioned in the paragraph before this. The number of hidden layers in a neural network is also a hyperparameter just like the number of neurons in a hidden layer, we will discuss this later.
Once, we have built our neural network. Our next step is to initialize the weights and the biases, we try to keep this initialization random so that all our neuron learn something different. However, we do not need this randomness in case of biases and they can be initialized with zeros. These weights and biases are also called learnable parameters as well because we learn them over the cause of training.
Now, to learn these parameters we first need to generate some output so that we can compare it with the actual values and start learning from them. To calculate the output of the network we need the output of each neuron which can be calculated by first multiplying each input coming to the neuron with the weights assigned for each input, summing the products, adding the bias term to the sum and then applying some non-linearity.
This function can be represented by an equation like this f(x1,x2,...,xn) = RELU(w1*x1 + w2*x2 + ... + wn*xn + b). Where RELU is the non-linear function, Ws are the weights, Xs are the inputs and b is a bias term. After calculating this result, we pass on the result to each neuron in the next layer.
This process as a whole is known as a forward pass. We use this to predict results for given inputs to a neural network. The forward pass is sufficient when, predicting results but for the training purposes, we need to go further and devise or use a loss function to calculate a deviation between our prediction and actual data. There are a variety of choices for loss functions that we can pick from or devise our own depending on the nature of the problem we are dealing with.
The actual learning starts right after the calculation of the loss, our goal is to minimize this loss and the approach we use for this is called backpropagation. What we do in backpropagation is that we take the partial derivatives of the loss with respect to each weight and each bias and try to update the weights and biases in the opposite direction of the derivatives because we want to decrease the loss.
We usually use this formula to update the weights: Wnew = Wold - alpha*(partial derivative of loss with respect to Wold). All other variables are self-explanatory except for alpha which is known as the learning rate. This learning rate basically defines the magnitude of the step that we are going to take to update the weights, we do not want this alpha to be big so that we do not keep oscillating and miss the values of Ws that give the best result.
This does not mean that we should always keep alpha very very small because doing this can really slow down our learning as the step becomes very small to produce any significant improvement.
Advanced Practices for Training Neural Networks
So far, the process is just the bare minimum
required to train a neural network
. But to train a neural network for good results requires some certain procedures to be followed. Besides, following these steps, you also need to have sound knowledge of the subject to make certain decisions
. These decisions can include increasing or decreasing the value of learning-rate looking at the behavior of the learning curve
, deciding the architecture of the network and decide whether the networking in overfitting or underfitting during the training
. Neural networks and AI are actually a new area in the blockchain industry could be represented in decentralized systems
Splitting of Dataset for Validation and Testing
To start with it is a very good idea to split our dataset into three splits
a training set
, a validation set
, and a test set
. These splits help us in keeping track of our model's learning, a good model is one that performs relatively well on all three splits. The training set
is used to make the model learn by providing expected outputs, validation split
is used to monitor the learning of model on this split compared to the training set, this helps in fine-tuning the hyperparameters and the test split
is used to test the accuracy or any other criteria for the model to see its performance on unseen data.
Underfitting and Overfitting
Once we start the training it is helpful to plot a graph for training loss and validation loss after each iteration. We can decide from looking at these curves whether the model in underfitting, overfitting or is a good fit for the data. If the training loss is not decreasing after two to three iterations or is increasing and all other calculations like partial derivatives are correct then it is underfitting the training data.
This means that the network is not complex enough to map this relationship and probably needs more neurons and layers to be included. Another problem that a neural network can suffer from is overfitting this is the case when the validation loss is higher than the training loss. This case usually occurs when we make our neural network overly complex such that it becomes specific to the training data and does not generalize the overall data. This model will perform poorly in real life settings.
Remedies for overfitting
There are many ways to get rid of this overfitting problem
. These include but are not limited to regularization
, reducing the complexity
of the network and augmenting data. Regularization
is the process of penalizing the weights for being high in magnitude, it is carried out by adding a term in the loss function
which is directly proportional to the magnitude of the weights.
As we train our neural network we get a higher loss for higher values of the weights and in turn, the values for these weights are penalized and are kept small in magnitude by the model to avoid the higher losses. Dropout is a technique which is used during the training of a neural network to avoid dependency on certain neurons in the network. It is done by choosing a random probability value and that number of random neurons in left out during training in each iteration.