30 Data Science Q & A for Freshers
- What is Data Science?
Data Science is the area of study which involves extracting insights from vast amounts of data using various scientific methods, algorithms, and processes. It helps you to discover hidden patterns from the raw data. The term Data Science has emerged because of the evolution of mathematical statistics, data analysis, and big data.
2. What is the Difference between Data Science and Machine Learning?
Data Science is a combination of algorithms, tools, and machine learning technique which helps you to find common hidden patterns from the given raw data. Whereas Machine learning is a branch of computer science, that deals with system programming to automatically learn and improve with experience.
3. Name three types of biases that can occur during sampling.
In the sampling process, there are three types of biases, which are:
- Selection bias
- Under coverage bias
- Survivorship bias
4. Discuss Decision Tree algorithm.
A decision tree is a popular supervised machine learning algorithm. It is mainly used for Regression and Classification. It allows breaks down a dataset into smaller subsets. The decision tree can able to handle both categorical and numerical data.
5. Discuss Artificial Neural Networks.
Artificial Neural networks (ANN) are a special set of algorithms that have revolutionized machine learning. It helps you to adapt according to changing input. So the network generates the best possible result without redesigning the output criteria.
6. What is Back Propagation?
Back-propagation is the essence of neural net training. It is the method of tuning the weights of a neural net depend upon the error rate obtained in the previous epoch. Proper tuning of the helps you to reduce error rates and to make the model reliable by increasing its generalization.
7. What is a Random Forest?
Random forest is a machine learning method which helps you to perform all types of regression and classification tasks. It is also used for treating missing values and outlier values.
8. What is the importance of having a selection bias?
Selection Bias occurs when there is no specific randomization achieved while picking individuals or groups or data to be analyzed. It suggests that the given sample does not exactly represent the population which was intended to be analyzed.
9. What Is Logistic Regression?
Logistic regression is a form of predictive analysis. It is used to find the relationships that exist between a dependent binary variable and one or more independent variables by employing a logistic regression equation.
10. What Is a Decision Tree?
Decision trees are a tool used to classify data and determine the possibility of defined outcomes in a system. The base of the tree is known as the root node. The root node branches out into decision nodes based on the various decisions that can be made at each stage. Decision nodes flow into lead nodes, which represent the consequence of each decision.
11. What Is Pruning in a Decision Tree Algorithm?
Pruning a decision tree is the process of eliminating non-critical subtrees so that the data under consideration is not overfitted. In pre-pruning, the tree is pruned as it is being constructed, following criteria like the Gini Index or information gain metrics. Post-pruning entails pruning a tree from the bottom up after it has been constructed.
12. How Do You Treat Outlier Values?
Outlets are often filtered out during data analysis if they don’t fit certain criteria. You can set up a filter in the data analysis tool you’re using to automatically eliminate outliers. However, there are instances where outliers can reveal insights about low-percentage possibilities. In that case, analysts might group outliers and study them separately.
13. Explain Normal Distribution.
A normal distribution is a probability distribution where the values are symmetric on either side of the mean of the data. This implies that values closer to the mean are more common than values that are further away from it.
14. What Is Deep Learning?
Deep learning is a subset of machine learning concerned with supervised, unsupervised, and semi-supervised learning based on artificial neural networks.
15. What Is an RNN (Recurrent Neural Network)?
A recurrent neural network is a kind of artificial neural network where the connections between nodes are based on a time series. RNNs are the only form of neural networks with internal memory and are often used for speech recognition applications.
16. How will you explain linear regression to a non-tech person?
Linear Regression is a statistical technique of measuring the linear relationship between the two variables. By linear relationship, we mean that an increase in a variable would lead to increase in the other variable and a decrease in one variable would lead to attenuation in the second variable as well. Based on this linear relationship, we establish a model that predicts the future outcomes based on an increase in one variable.
17. How will you handle missing values in data?
There are several ways to handle missing values in the given data-
- Dropping the values
- Deleting the observation (not always recommended).
- Replacing value with the mean, median and mode of the observation.
- Predicting value with regression
- Finding appropriate value with clustering
18. How are KNN and K-means clustering different?
Firstly, KNN is a supervised learning algorithm. In order to train this algorithm, we require labeled data. K-means is an unsupervised learning algorithm that looks for patterns that are intrinsic to the data. The K in KNN is the number of nearest data points. On the contrary, the K in K-means specify the number of centroids.
19. Explain ROC curve.
Receiver Operating Characteristic is a measurement of the True Positive Rate (TPR) against False Positive Rate (FPR). We calculate True Positive (TP) as TPR = TP/ (TP + FN). On the contrary, false positive rate is determined as FPR = FP/FP+TN where where TP = true positive, TN = true negative, FP = false positive, FN = false negative.
20. How is AUC different from ROC?
AUC curve is a measurement of precision against the recall. Precision = TP/ (TP + FP) and TP/(TP + FN). This is in contrast with ROC that measures and plots True Positive against False positive rate.
21. Why don’t gradient descent methods always converge to the same point?
This is because, in some cases, they reach to local or local optima point. The methods don’t always achieve global minima. This is also dependent on the data, the descent rate and origin point of descent.
22. Explain A/B testing.
To perform a hypothesis testing of a randomized experiment with two variables A and B, we make use of A/B testing. A/B testing is used to optimize web-pages based on user preferences where small changes are added to web-pages that are delivered to a sample of users. Based on their reaction to the web-page and reaction of the rest of the audience to the original page, we can carry out this statistical experiment.
23. What is box Cox transformation?
In order to transform the response variable so that the data meets its required assumptions, we make use of Box Cox Transformation. With the help of this technique, we can transform non-normal dependent variables into normal shapes. We can apply a broader number of tests with the help of this transformation.
24. What is meant by ‘curse of dimensionality’? How can we solve it?
While analyzing the dataset, there are instances where the number of variables or columns are in excess. However, we are required to only extract significant variables from the group. For example, consider that there are a thousand features. However, we only need to extract handful of significant features. This problem of having numerous features where we only need a few is called ‘curse of dimensionality’.
There are various algorithms for dimensionality reduction like PCA (Principal Component Analysis).
25. What is the difference between recall and precision?
Recall is the fraction of instances that have been classified as true. On the contrary, precision is a measure of weighing instances that are actually true. While recall is an approximation, precision is a true value that represents factual knowledge.
26. What is correlation and covariance in statistics?
Correlation is defined as the measure of the relationship between two variables. If two variables are directly proportional to each other, then its positive correlation. If the variables are indirectly proportional to each other, it is known as a negative correlation. Covariance is the measure of how much two random variables vary together.
27. What is ‘Naive’ in a Naive Bayes?
A naive bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. Basically, it’s “naive” because it makes assumptions that may or may not turn out to be correct.
28. How can you select k for k-means?
The two methods to calculate the optimal value of k in k-means are:
- Elbow method
- Silhouette score method
Silhouette score is the most prevalent while determining the optimal value of k.
29. How is Memory Managed in Python?
Memory Management in Python involves a private heap containing all Python objects and data structures. The management of this private heap is ensured internally by the Python memory manager.
30. What is a recall?
Recall gives the rate of true positives with respect to the sum of true positives and false negatives. It is also known as true positive rate.
Bonus:
31. What are lambda functions?
A lambda function is a small anonymous function. A lambda function can take any number of arguments, but can only have one expression.