0% found this document useful (0 votes)
2 views

Machine Learning

The document discusses key concepts in machine learning, including when to use classification versus regression, how ROC curves work, and the role of support vectors in SVMs. It also covers encoding techniques like One-hot and Label Encoding, ensemble methods like bagging and boosting, and the importance of cross-validation. Additionally, it explains outlier detection methods, the significance of the P-value, and the assumptions of linear regression.

Uploaded by

aryamuna20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning

The document discusses key concepts in machine learning, including when to use classification versus regression, how ROC curves work, and the role of support vectors in SVMs. It also covers encoding techniques like One-hot and Label Encoding, ensemble methods like bagging and boosting, and the importance of cross-validation. Additionally, it explains outlier detection methods, the significance of the P-value, and the assumptions of linear regression.

Uploaded by

aryamuna20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1. When should you use classification over regression?

Classification and regression are two main types of supervised learning in machine
learning, used for different types of prediction problems based on the nature of the target
variable.
1. Nature of the Target Variable: Classification is used when the target variable is
categorical or discrete. The goal is to predict the class or category to which a data point
belongs. For example, determining whether an email is spam or not, or classifying
images into labels like “cat,” “dog,” or “bird.

“Regression is used when the target variable is continuous or numeric, and the goal is
to predict a real-valued number, such as predicting house prices or temperatures.

2. Output Type: Classification outputs discrete labels. For instance, a classifier might
output labels such as “positive” or “negative,” or multiple classes like digits 0–9.
Regression outputs a continuous numerical value, such as 120,000 (house price) or 23.5
(temperature).
3. Problem Examples: Use classification for problems like medical diagnosis (disease vs.
no disease), sentiment analysis (positive/negative), or fraud detection (fraudulent/not
fraudulent). Use regression for predicting quantities like stock prices, sales forecasting,
or predicting fuel consumption.
4. 4. Evaluation Metrics: Classification models are evaluated using metrics such as
accuracy, precision, recall, F1-score, and ROC-AUC. Regression models use metrics
like Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared.

You should use classification when your problem requires assigning data points to distinct
categories, and use regression when you want to predict a continuous numeric value. Choosing
the correct approach depends primarily on the nature of the output you need from your model.

2. Explain how a ROC curve works.?


A Receiver Operating Characteristic (ROC) curve is a graphical representation that visualizes
the performance of a binary classification model across different thresholds. It plots the True
Positive Rate (TPR) against the False Positive Rate (FPR). The curve essentially shows how
well a model can distinguish between positive and negative classes at varying levels of
confidence.
How it Works:
o Model Predictions: The classification model generates predictions, which are
typically scores or probabilities indicating the likelihood of a positive outcome.
o Adjusting Thresholds: By varying the classification threshold, the model's
predictions are altered.
o Calculating TPR and FPR: For each threshold, the TPR and FPR are calculated
based on the model's predictions and the actual labels.
o Plotting the Curve: The (FPR, TPR) pairs are plotted, creating the ROC curve.

The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR).
Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a
random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the
curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

Note that the ROC does not depend on the class distribution. This makes it useful for evaluating
classifiers predicting rare events such as diseases or disasters. In contrast, evaluating performance
using accuracy (TP +TN)/(TP + TN + FN + FP) would favor classifiers that always predict a negative
outcome for rare events.

3. What are Support Vectors in SVMs?


Support Vector Machine (SVM)
Support Vector Machine (SVM) is a popular supervised machine learning algorithm mainly
used for classification problems. It works by finding the best boundary (called a hyperplane)
that separates data points of different classes. The goal of SVM is to maximize the margin,
which is the distance between the hyperplane and the closest data points from each class.
These closest points are called support vectors, and they are important because the position
of the hyperplane depends on them.
SVM is very effective when the data is clear and separated by a margin. However, many real-
world datasets are not linearly separable—which means you cannot draw a straight line to
separate classes. To solve this, SVM uses something called kernels.
Kernels in SVM
A kernel function helps SVM by transforming the data into a higher-dimensional space
where it is easier to find a separating hyperplane. Instead of working with the original data
directly, kernels allow SVM to learn complex boundaries.
Some common kernels used in SVM are:
• Linear Kernel:
This is the simplest kernel. It is used when data can be separated by a straight line or
hyperplane in the original space. It does not change the data.
• Polynomial Kernel:
This kernel transforms the data into a polynomial feature space. It can create curved
boundaries. The degree of the polynomial decides how complex the curve is.
• Radial Basis Function (RBF) Kernel (also called Gaussian Kernel):
This kernel maps data into an infinite-dimensional space. It works well when the data is not
linearly separable and has complex shapes or clusters.
• Sigmoid Kernel:
This kernel behaves like the activation function used in neural networks. It can model some
non-linear problems but is less common than RBF or polynomial kernels.
• SVM finds the best boundary with the largest margin between classes.
• Support vectors are the key points closest to the boundary.
• Kernels help SVM handle non-linear data by transforming it into higher dimensions.
• Common kernels are Linear, Polynomial, RBF, and Sigmoid.

10 . Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of the given
dataset?

In machine learning and data preprocessing, categorical data must be converted into numerical form to
be used effectively by models. Two common encoding techniques are Label Encoding and One-Hot
Encoding.

Label Encoding assigns a unique integer to each category of a feature. For example, a "Color" feature
with categories "Red," "Blue," and "Green" can be encoded as 0, 1, and 2 respectively. This method
replaces the original categorical column with a single integer column, so the dimensionality of the
dataset remains unchanged. Label Encoding is simple and memory-efficient but introduces an implicit
ordinal relationship among categories that may not exist. This can mislead algorithms, particularly
linear models, into assuming a hierarchy between categories.

On the other hand, One-Hot Encoding converts each category into a separate binary column. For the
same "Color" example, this results in three new columns: Is_Red, Is_Blue, and Is_Green, each
containing 1 if the row corresponds to that category and 0 otherwise. This increases the dimensionality
of the dataset, as one categorical column with k unique values becomes k binary columns. One-Hot
Encoding preserves the nominal nature of categories and does not impose any order, making it suitable
for models sensitive to numerical relationships or those assuming feature independence. However, it
can lead to high-dimensional and sparse datasets.

Label Encoding is dimensionally efficient but can misrepresent category relationships, while One-Hot
Encoding preserves category independence at the cost of increased dimensionality. The choice depends
on the nature of the data and the machine learning algorithm used. Proper encoding improves model
accuracy and interpretability.

11 . What is bagging and boosting in Machine Learning?

Bagging and Boosting are ensemble learning techniques used to improve model performance.

• Bagging (Bootstrap Aggregating): It trains multiple models independently on different


random samples of the data (with replacement) and combines their predictions by voting or
averaging. This reduces variance and helps prevent overfitting. Example: Random Forest.

• Boosting: It trains models sequentially, where each new model focuses on correcting errors
made by previous models by giving more weight to difficult cases. This reduces bias and
improves accuracy. Example: AdaBoost, Gradient Boosting.

12 . What is Cross-validation in Machine Learning?

Cross-validation is a statistical technique used in machine learning to evaluate and improve the
performance of a model. It involves splitting the dataset into multiple subsets, training the model on
some subsets, and testing it on the remaining ones. This helps in assessing how well the model
generalizes to unseen data.

Purpose of Cross-Validation:

• To check the model's performance on independent data.

• To avoid overfitting or underfitting.

• To make the most use of limited data by reusing it for both training and testing.

Types of Cross-Validation:
1. Hold-out Method:

o Split the data into two sets: training and testing.

o Simple but may not give consistent results if data is limited.

2. k-Fold Cross-Validation (Most common):

o The data is divided into k equal parts.

o The model is trained on k-1 parts and tested on the remaining one.

o The process is repeated k times, each time with a different test part.

o Final result = average performance from all folds.

3. Stratified k-Fold Cross-Validation

o Similar to k-Fold but maintains the class distribution in each fold.


o Useful for imbalanced classification problems.

4. Leave-One-Out Cross-Validation (LOOCV):

o Each data point is used once as a test set, while the rest are training data.

o Very thorough but computationally expensive.

5. Time Series Cross-Validation:

o For time-dependent data.

o Ensures that training data always comes before test data chronologically.

Advantages:

• Provides a more accurate estimate of model performance.

• Reduces the chances of bias from one random train-test split.

• Helps in model selection and tuning hyperparameters.

Disadvantages:

• Can be computationally intensive, especially for large datasets.

• Not suitable for all types of data (e.g., time series needs special handling).

Cross-validation is an essential tool in the machine learning workflow. It helps ensure that the model
is not just good at learning from the training data but also performs well on new, unseen data,
making it a reliable method for model evaluation
19.What is a True positive rate and a false positive rate?
1. True Positive Rate (TPR)
Also known as Sensitivity or Recall, it measures the proportion of actual positives that are correctly
identified by the model.

2. False Positive Rate (FPR)


It measures the proportion of actual negatives that are incorrectly identified as positives by the mode

20. What do you mean by a Bag of Words (BOW)?

• Bag of Words is a fundamental technique used in Natural Language Processing (NLP) to


represent text data in a numerical form so that machine learning algorithms can process it.
• In BOW, a text (like a sentence or document) is represented as a "bag" of its words, meaning
the order or grammar of words is completely ignored.
• It creates a vocabulary list of all unique words from the entire text dataset.
• Each text sample is then converted into a vector that records the frequency (count) of each
vocabulary word in that sample.
• For example, if the vocabulary contains the words ["cat", "dog", "love"], and a sentence is "I
love my dog," the BOW vector might look like [0,1,1], showing 0 occurrences of "cat," 1 of
"dog," and 1 of "love."
• This vector representation enables machine learning models to analyse text data numerically.
• Although simple and easy to implement, BOW does not capture the context or word order,
which can be a limitation.

21 . What type of node is considered Pure in the decision tree?


Node in Decision Tree:

✓ A pure node contains only samples from a single class (all data points belong to the same
category).
✓ It represents perfect classification at that node, so no further splitting is needed.
✓ Purity is measured using metrics like Gini index and Entropy, where a pure node has a value of
zero (no impurity).
✓ When a node is pure, it becomes a leaf node and is assigned the class label of its samples.
✓ Pure nodes help improve tree accuracy but very deep trees with many pure nodes may cause
overfitting.

29 . Explain some cases where k-Means clustering fails to give good results?

k-Means is an unsupervised learning algorithm used to partition data into k distinct clusters based on
similarity. It works by minimizing the distance between data points and their assigned cluster
centroids.

Failures:

➢ Non-spherical clusters: k-Means assumes clusters are spherical, so it performs poorly on


irregular shapes.
➢ Uneven sizes/densities: It misclassifies data when clusters vary in size or density.
➢ Outliers & poor initialization: Sensitive to outliers and initial centroid placement, which can
lead to incorrect clustering.

30 . What are the assumptions of linear regression?

a) Linearity: There is a linear relationship between the independent and dependent variables.
b) Homoscedasticity: The variance of errors is constant across all levels of the independent
variable(s).

37 . 'People who bought this also bought…' recommendations seen on Amazon are a result of which
algorithm?

Collaborative Filtering Algorithm:

This algorithm suggests items based on the preferences and behaviours of many users. For example, if
users who bought item A also bought item B, then item B will be recommended to others who buy
item A.

28 . What is the difference between Cost Function vs Gradient Descent?


Aspect Cost Function Gradient Descent

A function that measures the error or An optimization algorithm that minimizes


Definition difference between predicted and actual the cost function by updating model
values. parameters.

Quantify how well the model is Iteratively adjust parameters to reduce the
Purpose
performing. error.

A scalar value representing the model’s Updated model parameters after each
Output
error. iteration.

Role in
Provides a metric to optimize. The method used to minimize the metric.
Training

Mean Squared Error (MSE), Cross- Step-by-step parameter update using


Example
Entropy Loss. gradients.

Calculate error between predictions and Calculate gradient of cost function and
Process
actual targets. update weights accordingly.

To find parameters that minimize the cost


Goal To evaluate model performance.
function.

38 . How can you select k for k-means?

To select the best number of clusters k:

• Use the Elbow Method: Run k-means with different k values and plot the total within-cluster
sum of squares (inertia) against k. The point where the decrease in inertia slows down sharply
(forming an “elbow”) suggests the optimal k.

• Use the Silhouette Score: Calculate the average silhouette coefficient for different k values.
The k with the highest silhouette score indicates well-separated clusters.

39 . What is P-value?

• he P-value is the probability of observing data as extreme as (or more extreme than) the
sample results, assuming the null hypothesis is true.
• It helps measure the strength of evidence against the null hypothesis in a statistical test.
• A low P-value (usually < 0.05) suggests that the observed results are unlikely due to chance,
indicating statistical significance.
• A high P-value means there is insufficient evidence to reject the null hypothesis.

46 . Explain two different ways to detect outliers.?


Two Ways to Detect Outliers

1 Z-Score Method:
Calculate the Z-score for each value; points with Z-scores greater than 3 or less than -3 are
outliers because they lie far from the mean.

2 Interquartile Range (IQR) Method:


Calculate IQR as Q3 - Q1; points outside the range Q1−1.5×IQRQ1 - 1.5 \times
IQRQ1−1.5×IQR to Q3+1.5×IQRQ3 + 1.5 \times IQRQ3+1.5×IQR are outliers.

47 . What is SVM? Can you name some kernels used in SVM?

Support Vectors in SVMs:

• Support vectors are the data points closest to the decision boundary (hyperplane).

• They define the maximum margin — the widest gap between classes that the SVM tries to
achieve.

• The margin is the distance between two parallel lines on either side of the hyperplane, each
touching the nearest data points.

• Support vectors directly influence the position and orientation of the hyperplane.

• If a support vector is moved or removed, the hyperplane changes; other points have no effect.

• Mathematically, support vectors have non-zero Lagrange multipliers in the SVM


optimization problem.

• The model depends only on support vectors, making it robust and less sensitive to outliers
away from the margin.

• Using only support vectors reduces computational complexity and increases efficiency.

• Visually, support vectors lie on the margin boundaries parallel to the hyperplane.

• In summary, support vectors are the key points that “support” and define the optimal
separating hyperplane in SVM classification.

48 . Explain TF/IDF vectorization.?

For detailed explanation visit this page (https://www.geeksforgeeks.org/understanding-tf-idf-term-


frequency-inverse-document-frequency/)

TF-IDF (Term Frequency-Inverse Document Frequency)


TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural
language processing and information retrieval to evaluate the importance of a word in a document
relative to a collection of documents (corpus).

TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).

Term Frequency (TF): Measures how often a word appears in a document. A higher frequency
suggests greater importance. If a term appears frequently in a document, it is likely relevant to the
document’s content. Formula:

Inverse Document Frequency (IDF): Reduces the weight of common words across multiple
documents while increasing the weight of rare words. If a term appears in fewer documents, it is more
likely to be meaningful and specific.

TF-IDF Calculation

TF-IDF score is calculated by multiplying these two statistics:

Vectorization

When transforming text to vectors using TF-IDF, each document is represented as a vector. Each
dimension of the vector corresponds to a separate term from the corpus vocabulary.

The value in each dimension is the TF-IDF score of the term in the document. To handle the vast size
of the vocabulary and the sparsity of individual document vectors, implementations usually use sparse
matrix representations.

You might also like