Sent Successfully.
Home / Blog / Interview Questions / Top AI & Deep Learning Interview Questions and Answers
Top AI & Deep Learning Interview Questions and Answers
Table of Content
- Can L2 Regularization be Balanced to Weight Decay? If yes, describe.
- Define Internal Covariate Shift, Explain how to resolve it
- Explain how to Handle the Mismatches in the Input Dimension and the Filter Dimensions
- Explain about Exploding and Vanishing Gradients
- How to fix Exploding Gradients?
- How to remediate Vanishing gradients?
- Describe ways to detect vanishing gradient problems?
- What do you mean by internal covariate shift? Explain how to resolve it?
- What do you mean by non-trainable parameters in a neural network? Give examples
- What are the different types of gradient descent, explain
-
Can L2 Regularization be Balanced to Weight Decay? If yes, describe.
Let us assume our cost function is C(w) and our penalization c | w | 2. The iterations while performing gradient descent will follow the pattern:
w = w -grad(C)(w) — 2cw = (1–2c)w — grad(C)(w)
If you observe, the weight is multiplied by a factor < 1. Hence, this can be thought of as weight decay.
-
Define Internal Covariate Shift, Explain how to resolve it
It is widely known that training Deep Learning networks with multiple (sometimes 10s of them) layers is very difficult because they will be highly sensitive to the initial weights and configuration.
When weights have updated the composition of the inputs to layers in the network can change after each mini-batch and it could be one of the possible reasons for its tough nature.
Learn the core concepts of Data Science Course video on YouTube:
As a result of this, the learning algorithm will continue to chase a moving target forever. The internal covariate shift is known as the deviation in the input layers distribution.
To remediate this problem, batch normalization can be used. This technique refers to the standardization of the inputs after each batch before passing it through the next layer.
-
Explain how to Handle the Mismatches in the Input Dimension and the Filter Dimensions
Since the input size and filter sizes are not a machine, we need to use a technique called Padding. This is the method of adding 0s to the input matrix so the filter input size and filter sizes match.
Image dimension = (n, n) = 3 X 3
Filter Dimension = (f,f) = 5 X 5
Padding = 1 (add 1 pixel all around the edges with value 0)
The output dimension will become (n+2p-f+1) X (n+2p-f+1) = 1 X 1
-
Explain about Exploding and Vanishing Gradients
By taking small steps, the gradient descent algorithm works by decreasing the occurrence of error towards the local/global minima.
As per the results obtained from these steps, the weights and biases are updated in the network.
Exploding gradients - this occurs when the gradients become too large causing the weights and biases to overflow or become NaN values.
Vanishing gradients - this scenario is the exact opposite of the exploding gradient problem. Here, the steps become so small that the updates to the weights and biases are negligible. In this case, the network never reaches the minimum value.
-
How to fix Exploding Gradients?
There are multiple approaches to fix the exploding gradients problem-
- Reduce the complexity - one factor could be that there may be too many layers in the network. Reducing the number of hidden layers could be one way of approach.
- Apply gradient clipping technique - Set a threshold value and automatically clip gradients if the norm exceeds it.
- Weight regularization - apply penalties to the weights in the network. Typically, both L1 or L2 regularization techniques can be used effectively to eliminate the problem of exploding gradients
-
How to remediate Vanishing gradients?
The problem of vanishing gradients occurs in feed-forward networks (FFN) when the error that is back propagated through the hidden layers from the final layer on to the input layer, it becomes so exponentially small that there is no significance in updating the weights. To fix this the easiest way is to use the ReLU (Rectified Linear Unit) as the activating function.
Click here to learn Data Science in Hyderabad
-
Describe ways to detect vanishing gradient problems?
Vanishing gradients occur in FFN (feed-forward networks) when the error that is back propagated becomes too small to force any significant updates in the weights.
- Using ReLU may prevent vanishing gradients from happening but it is not a guarantee.
- Another possible way would be to review the average size of the gradients during the training process using Tensorboard. Keras API provides something called the Tensorboard callback which can then be used to capture and document the average gradient for each layer. All of these stats can be analyzed via the Tensorboard interface to determine whether the network is suffering from a vanishing gradient problem.
-
What do you mean by internal covariate shift? Explain how to resolve it?
It is widely known that training deep learning networks with multiple (sometimes 10s of them) layers is very difficult because they will be highly sensitive to the initial weights and configuration.
A probable reason for the tough nature is that the composition of the inputs to layers in the network can possibly change after each mini-batch when the weights are updated.As a result of this, the learning algorithm will continue to chase a moving target forever. Internal covariate shift is known as the deviation in the input layers distribution.
To remediate this problem, batch normalization can be used. This technique refers to standardization of the inputs after each batch before passing it through the next layer.
-
What do you mean by non-trainable parameters in a neural network? Give examples
Consider an artificial neural network (ANN) with a multi-layer perceptron (MLP) model. The architecture is 128-500-500-2 : input size=128, hidden layers=2, # of neurons in hidden layers=500, output layer=2 (2 class classification).
In such a scenario, the non-trainable parameters could be
- The # of hidden layers
- The # of neurons per hidden layers
This is because, the values of the non-trainable parameters cannot be optimized with training data. When the model goes through back-propagation and updates the weights, that won't impact the hidden layers or the number of neurons in each layer which would be fixed for those epochs or model runs.
-
What are the different types of gradient descent, explain
Stochastic Gradient Descent - In this variant, the gradients are calculated and weights updated for one training sample at a time
Batch Gradient Descent - in this variant, the gradients are calculated and the weights are updated for the entire datasets in a single shot
Mini-Batch Gradient Descent - in this variant, the gradients are calculated in batches of fixed batch size (which is a hyper parameter that could be tuned for optimal performance later). This variant is the best method to ensure proper utilization of computational resources.
Click here to learn Data Science in Bangalore