Machine Learning
Machine Learning
Classification and regression are two main types of supervised learning in machine
learning, used for different types of prediction problems based on the nature of the target
variable.
1. Nature of the Target Variable: Classification is used when the target variable is
categorical or discrete. The goal is to predict the class or category to which a data point
belongs. For example, determining whether an email is spam or not, or classifying
images into labels like “cat,” “dog,” or “bird.
“Regression is used when the target variable is continuous or numeric, and the goal is
to predict a real-valued number, such as predicting house prices or temperatures.
2. Output Type: Classification outputs discrete labels. For instance, a classifier might
output labels such as “positive” or “negative,” or multiple classes like digits 0–9.
Regression outputs a continuous numerical value, such as 120,000 (house price) or 23.5
(temperature).
3. Problem Examples: Use classification for problems like medical diagnosis (disease vs.
no disease), sentiment analysis (positive/negative), or fraud detection (fraudulent/not
fraudulent). Use regression for predicting quantities like stock prices, sales forecasting,
or predicting fuel consumption.
4. 4. Evaluation Metrics: Classification models are evaluated using metrics such as
accuracy, precision, recall, F1-score, and ROC-AUC. Regression models use metrics
like Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared.
You should use classification when your problem requires assigning data points to distinct
categories, and use regression when you want to predict a continuous numeric value. Choosing
the correct approach depends primarily on the nature of the output you need from your model.
The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR).
Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a
random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the
curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
Note that the ROC does not depend on the class distribution. This makes it useful for evaluating
classifiers predicting rare events such as diseases or disasters. In contrast, evaluating performance
using accuracy (TP +TN)/(TP + TN + FN + FP) would favor classifiers that always predict a negative
outcome for rare events.
10 . Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of the given
dataset?
In machine learning and data preprocessing, categorical data must be converted into numerical form to
be used effectively by models. Two common encoding techniques are Label Encoding and One-Hot
Encoding.
Label Encoding assigns a unique integer to each category of a feature. For example, a "Color" feature
with categories "Red," "Blue," and "Green" can be encoded as 0, 1, and 2 respectively. This method
replaces the original categorical column with a single integer column, so the dimensionality of the
dataset remains unchanged. Label Encoding is simple and memory-efficient but introduces an implicit
ordinal relationship among categories that may not exist. This can mislead algorithms, particularly
linear models, into assuming a hierarchy between categories.
On the other hand, One-Hot Encoding converts each category into a separate binary column. For the
same "Color" example, this results in three new columns: Is_Red, Is_Blue, and Is_Green, each
containing 1 if the row corresponds to that category and 0 otherwise. This increases the dimensionality
of the dataset, as one categorical column with k unique values becomes k binary columns. One-Hot
Encoding preserves the nominal nature of categories and does not impose any order, making it suitable
for models sensitive to numerical relationships or those assuming feature independence. However, it
can lead to high-dimensional and sparse datasets.
Label Encoding is dimensionally efficient but can misrepresent category relationships, while One-Hot
Encoding preserves category independence at the cost of increased dimensionality. The choice depends
on the nature of the data and the machine learning algorithm used. Proper encoding improves model
accuracy and interpretability.
Bagging and Boosting are ensemble learning techniques used to improve model performance.
• Boosting: It trains models sequentially, where each new model focuses on correcting errors
made by previous models by giving more weight to difficult cases. This reduces bias and
improves accuracy. Example: AdaBoost, Gradient Boosting.
Cross-validation is a statistical technique used in machine learning to evaluate and improve the
performance of a model. It involves splitting the dataset into multiple subsets, training the model on
some subsets, and testing it on the remaining ones. This helps in assessing how well the model
generalizes to unseen data.
Purpose of Cross-Validation:
• To make the most use of limited data by reusing it for both training and testing.
Types of Cross-Validation:
1. Hold-out Method:
o The model is trained on k-1 parts and tested on the remaining one.
o The process is repeated k times, each time with a different test part.
o Each data point is used once as a test set, while the rest are training data.
o Ensures that training data always comes before test data chronologically.
Advantages:
Disadvantages:
• Not suitable for all types of data (e.g., time series needs special handling).
Cross-validation is an essential tool in the machine learning workflow. It helps ensure that the model
is not just good at learning from the training data but also performs well on new, unseen data,
making it a reliable method for model evaluation
19.What is a True positive rate and a false positive rate?
1. True Positive Rate (TPR)
Also known as Sensitivity or Recall, it measures the proportion of actual positives that are correctly
identified by the model.
✓ A pure node contains only samples from a single class (all data points belong to the same
category).
✓ It represents perfect classification at that node, so no further splitting is needed.
✓ Purity is measured using metrics like Gini index and Entropy, where a pure node has a value of
zero (no impurity).
✓ When a node is pure, it becomes a leaf node and is assigned the class label of its samples.
✓ Pure nodes help improve tree accuracy but very deep trees with many pure nodes may cause
overfitting.
29 . Explain some cases where k-Means clustering fails to give good results?
k-Means is an unsupervised learning algorithm used to partition data into k distinct clusters based on
similarity. It works by minimizing the distance between data points and their assigned cluster
centroids.
Failures:
a) Linearity: There is a linear relationship between the independent and dependent variables.
b) Homoscedasticity: The variance of errors is constant across all levels of the independent
variable(s).
37 . 'People who bought this also bought…' recommendations seen on Amazon are a result of which
algorithm?
This algorithm suggests items based on the preferences and behaviours of many users. For example, if
users who bought item A also bought item B, then item B will be recommended to others who buy
item A.
Quantify how well the model is Iteratively adjust parameters to reduce the
Purpose
performing. error.
A scalar value representing the model’s Updated model parameters after each
Output
error. iteration.
Role in
Provides a metric to optimize. The method used to minimize the metric.
Training
Calculate error between predictions and Calculate gradient of cost function and
Process
actual targets. update weights accordingly.
• Use the Elbow Method: Run k-means with different k values and plot the total within-cluster
sum of squares (inertia) against k. The point where the decrease in inertia slows down sharply
(forming an “elbow”) suggests the optimal k.
• Use the Silhouette Score: Calculate the average silhouette coefficient for different k values.
The k with the highest silhouette score indicates well-separated clusters.
39 . What is P-value?
• he P-value is the probability of observing data as extreme as (or more extreme than) the
sample results, assuming the null hypothesis is true.
• It helps measure the strength of evidence against the null hypothesis in a statistical test.
• A low P-value (usually < 0.05) suggests that the observed results are unlikely due to chance,
indicating statistical significance.
• A high P-value means there is insufficient evidence to reject the null hypothesis.
•
1 Z-Score Method:
Calculate the Z-score for each value; points with Z-scores greater than 3 or less than -3 are
outliers because they lie far from the mean.
• Support vectors are the data points closest to the decision boundary (hyperplane).
• They define the maximum margin — the widest gap between classes that the SVM tries to
achieve.
• The margin is the distance between two parallel lines on either side of the hyperplane, each
touching the nearest data points.
• Support vectors directly influence the position and orientation of the hyperplane.
• If a support vector is moved or removed, the hyperplane changes; other points have no effect.
• The model depends only on support vectors, making it robust and less sensitive to outliers
away from the margin.
• Using only support vectors reduces computational complexity and increases efficiency.
• Visually, support vectors lie on the margin boundaries parallel to the hyperplane.
• In summary, support vectors are the key points that “support” and define the optimal
separating hyperplane in SVM classification.
TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).
Term Frequency (TF): Measures how often a word appears in a document. A higher frequency
suggests greater importance. If a term appears frequently in a document, it is likely relevant to the
document’s content. Formula:
Inverse Document Frequency (IDF): Reduces the weight of common words across multiple
documents while increasing the weight of rare words. If a term appears in fewer documents, it is more
likely to be meaningful and specific.
TF-IDF Calculation
Vectorization
When transforming text to vectors using TF-IDF, each document is represented as a vector. Each
dimension of the vector corresponds to a separate term from the corpus vocabulary.
The value in each dimension is the TF-IDF score of the term in the document. To handle the vast size
of the vocabulary and the sparsity of individual document vectors, implementations usually use sparse
matrix representations.