0% found this document useful (0 votes)

5 views

Machine learning

The document outlines several lab experiments focused on implementing machine learning algorithms, including FIND-S, Candidate-Elimination, ID3, Backpropagation for neural networks, and Naïve Bayes classifiers. Each experiment provides a structured approach with steps for data handling, algorithm implementation, and demonstration using sample CSV files. The document also includes conceptual Python and Java code snippets for practical application of the algorithms.

Uploaded by

ajaysharma173a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Machine learning

Uploaded by

ajaysharma173a

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

LAB EXPERIMENT : 1

Question: Implement and demonstrate the FIND-S algorithm for

finding the most specific hypothesis based on a given set of training
data samples. Read the training data from a .CSV file.

Answer:

Implementation Steps:

* Read CSV Data: Use Python’s pandas library to read the training data from
a specified .csv file. The CSV should have columns representing features and
the last column representing the class label (e.g., ‘Yes’ for positive, ‘No’ for
negative).

* Initialize Hypothesis: Create a variable to hold the most specific

hypothesis. Initially, this will be a list of None values, one for each feature.

* Iterate Through Positive Examples: Loop through each row (training

example) in the DataFrame. If the class label for the current example is
positive:

* If the hypothesis is still in its initial None state, set the hypothesis to the
feature values of this first positive example.

* If the hypothesis has been set, compare each feature value of the current
positive example with the corresponding value in the current hypothesis:

* If they are the same, keep the value in the hypothesis.

* If they are different, generalize the hypothesis at that position to ‘?’

(representing any possible value).

* Return Hypothesis: After processing all positive examples, the final

hypothesis will be the most specific hypothesis consistent with them.

Demonstration:

 Create a Sample CSV (finds_data.csv):

Color,Shape,Size,Class

Red,Circle,Small,Yes

Red,Square,Large,Yes

Red,Circle,Large,Yes

Blue,Circle,Small,No
 Python Code:

Import pandas as pd

Def find_s_algorithm(file_path):

Df = pd.read_csv(file_path)

Features = df.iloc[:, :-1].values.tolist()

Labels = df.iloc[:, -1].values.tolist()

Num_attributes = len(features[0])

Hypothesis = [None] * num_attributes

For i, example in enumerate(features):

If labels[i] == ‘Yes’:

If hypothesis[0] is None:

Hypothesis = list(example)

Else:

For j in range(num_attributes):

If hypothesis[j] != example[j]:

Hypothesis[j] = ‘?’

Return hypothesis

# Demonstrate

File = ‘finds_data.csv’

Specific_hypothesis = find_s_algorithm(file)

Print(f”Most specific hypothesis found by FIND-S: {specific_hypothesis}”)

LAB EXPERIMENT : 2
Question: For a given set of training data examples stored in a .CSV
file, implement and demonstrate the Candidate-Elimination
algorithm to output a description of the set of all hypotheses
consistent with the training examples.

Answer:

Implementation Steps:

* Read CSV Data: Use pandas to read the training data.

* Initialize Boundaries:

* Initialize the specific boundary S with the most specific hypothesis: [[‘∅’,
‘∅’, ...]] (where ‘∅’ represents any value).

* Initialize the general boundary G with the most general hypothesis: [[‘?’,
‘?’, ...]].

* Iterate Through Examples: For each training example:

* Positive Example:

* Remove hypotheses in G inconsistent with the example.

* For each hypothesis s in S inconsistent with the example, find its

minimal generalizations h consistent with the example. Add these h to a new
S_next. Update S with the minimal hypotheses in S_next (remove more
general ones).

* Negative Example:

* Remove hypotheses in S inconsistent with the example.

* For each hypothesis g in G inconsistent with the example, find its

minimal specializations h consistent with the example. Add these h to a new
G_next. Update G with the minimal hypotheses in G_next (remove more
specific ones).

* Output Version Space: The version space is defined by the hypotheses

between S and G. Output the final S and G sets.

Demonstration:

 Create a Sample CSV (candidate_elimination_data.csv):

Sky,AirTemp,Humidity,Wind,Water,Forecast,EnjoySport

Sunny,Warm,Normal,Strong,Warm,Same,Yes
Sunny,Warm,High,Strong,Warm,Same,Yes

Rainy,Cold,High,Strong,Warm,Change,No

Sunny,Warm,High,Weak,Warm,Same,Yes

 Python Code (Conceptual – requires detailed helper functions):

Import pandas as pd

Def is_consistent(hypothesis, example):

# ... (implementation to check if hypothesis matches example)

Def generalize_hypothesis(hypothesis, example):

# ... (implementation to find minimal generalizations)

Def specialize_hypothesis(hypothesis, example):

# ... (implementation to find minimal specializations)

Def candidate_elimination(file_path):

Df = pd.read_csv(file_path)

Features = df.iloc[:, :-1].values.tolist()

Labels = df.iloc[:, -1].values.tolist()

Num_attributes = len(features[0])

S = [[‘∅’] * num_attributes]

G = [[‘?’] * num_attributes]

For i, example in enumerate(features):

Label = labels[i]

If label == ‘Yes’:
S_next = []

For s in S:

If not is_consistent(s, example):

Generalizations = generalize_hypothesis(s, example)

For h in generalizations:

# ... (add h to S_next if it’s minimal and consistent)

Else:

S_next.append(s)

# ... (update S, remove more general hypotheses)

# ... (update G, remove inconsistent hypotheses)

Else: # label == ‘No’

G_next = []

For g in G:

If is_consistent(g, example):

Specializations = specialize_hypothesis(g, example)

For h in specializations:

# ... (add h to G_next if it’s minimal and consistent)

Else:

G_next.append(g)

# ... (update G, remove more specific hypotheses)

# ... (update S, remove inconsistent hypotheses)

Return S, G

LAB EXPERIMENT : 3
Question: Write a program to demonstrate the working of the
decision tree based ID3 algorithm. Use an appropriate data set for
building the decision tree and apply this knowledge to classify a
new sample.

Answer:

Implementation Steps:

* Choose Dataset: Select a dataset suitable for classification (e.g., a simple

play tennis dataset).

* Implement Entropy and Information Gain: Create functions to calculate the

entropy of a dataset and the information gain of an attribute.

* ID3 Algorithm (Recursive):

* Create a function that takes the data and available attributes.

* Base Cases: If all examples have the same class or no attributes left,
return a leaf node.

* Recursive Step:

* Select the attribute with the highest information gain as the splitting
attribute.

* Create a node for this attribute.

* For each value of the splitting attribute, create a branch and recursively
call the ID3 function on the subset of data corresponding to that value
(excluding the splitting attribute).

* Build Tree: Call the ID3 function on the entire dataset.

* Classify New Sample: Traverse the tree based on the attribute values of
the new sample to reach a leaf node, which gives the classification.

Demonstration:

 Create a Sample CSV (id3_data.csv):

Outlook,Temperature,Humidity,Windy,PlayTennis

Sunny,Hot,High,False,No

Sunny,Hot,High,True,No

Overcast,Hot,High,False,Yes
Rainy,Mild,High,False,Yes

Rainy,Cool,Normal,False,Yes

Rainy,Cool,Normal,True,No

Overcast,Cool,Normal,True,Yes

Sunny,Mild,High,False,No

Sunny,Cool,Normal,False,Yes

Rainy,Mild,Normal,False,Yes

Sunny,Mild,Normal,True,Yes

Overcast,Mild,High,True,Yes

Overcast,Hot,Normal,False,Yes

Rainy,Mild,High,True,No

 Python Code (Conceptual):

Import pandas as pd

Import math

Def entropy(data):

# ... (calculate entropy)

Def information_gain(data, attribute):

# ... (calculate information gain)

Def id3(data, target_attribute, attributes):

# ... (recursive ID3 algorithm)

Def classify(tree, sample):

# ... (classify a new sample using the tree)

# Demonstrate

File = ‘id3_data.csv’

Df = pd.read_csv(file)

Target = ‘PlayTennis’

Attributes = [col for col in df.columns if col != target]

Tree = id3(df, target, attributes)

Print(“Decision Tree:”, tree)

New_sample = {‘Outlook’: ‘Sunny’, ‘Temperature’: ‘Mild’, ‘Humidity’: ‘High’,

‘Windy’: ‘False’}

Prediction = classify(tree, new_sample)

Print(f”Prediction for {new_sample}: {prediction}”)

LAB EXPERIMENT : 4
Question: Build an Artificial Neural Network by implementing the
Backpropagation algorithm and test the same using appropriate
data sets.

Answer:

Implementation Steps:

* Define Network Architecture: Specify the number of input, hidden, and

output layers and neurons.

* Initialize Weights and Biases: Randomly initialize weights and biases.

* Forward Propagation: Implement the feedforward process to calculate the

output of the network for a given input. Use an activation function (e.g.,
sigmoid).

* Calculate Error: Compute the error between the network’s output and the
target output using a loss function (e.g., Mean Squared Error).

* Backward Propagation: Calculate the gradients of the loss function with

respect to the weights and biases using the chain rule.

* Update Weights and Biases: Adjust the weights and biases in the direction
that reduces the error, using a learning rate.

* Training: Iterate through the training data multiple times (epochs).

* Testing: Evaluate the trained network on a separate test dataset.

Demonstration:

* Choose a Dataset: Use a simple dataset like XOR or a small classification

dataset.

* Python Code (Conceptual using NumPy):

Import numpy as np

Def sigmoid(x):

Return 1 / (1 + np.exp(-x))

Def sigmoid_derivative(x):

Return x * (1 – x)
Class NeuralNetwork:

Def init(self, input_size, hidden_size, output_size, learning_rate=0.1):

# ... (initialize weights and biases)

Def forward(self, inputs):

# ... (forward propagation)

Def backward(self, inputs, targets, outputs):

# ... (backward propagation to calculate gradients)

Def update_weights(self):

# ... (update weights and biases)

Def train(self, training_inputs, training_outputs, epochs):

For epoch in range(epochs):

# ... (forward, backward, update for each training example)

Def predict(self, inputs):

# ... (forward propagation for prediction)

# Demonstrate

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

Y = np.array([[0], [1], [1], [0]])

Nn = NeuralNetwork(input_size=2, hidden_size=2, output_size=1,

learning_rate=0.5)
LAB EXPERIMENT : 5

Question: Write a program to implement the naïve Bayesian

classifier for a sample training data set stored as a .CSV file.
Compute the accuracy of the classifier, considering few test data
sets.

Answer:

Implementation Steps:

* Read CSV Data: Use pandas to load training and testing data.

* Separate Features and Class: Divide data into features and target variable.

* Calculate Class Probabilities: For each class, calculate its prior probability.

* Calculate Likelihoods: For each feature, calculate the conditional

probability of each feature value given each class (using frequency for
categorical, assuming distribution for continuous). Apply smoothing (e.g.,
Laplace) for categorical features.

* Classification: For a new sample, calculate the posterior probability for

each class using Bayes’ theorem (assuming feature independence). Predict
the class with the highest posterior probability.

* Compute Accuracy: Compare the predicted classes with the actual classes
in the test set.

Demonstration:

 Create Sample CSVs (train_nb.csv, test_nb.csv):

# train_nb.csv

Color,Shape,Class

Red,Circle,Positive

Blue,Square,Negative

Red,Square,Positive

Blue,Circle,Negative

# test_nb.csv

Color,Shape,Class
Red,Square,Positive

Blue,Circle,Negative

 Python Code (Conceptual using pandas):

Import pandas as pd

Def naive_bayes_train(train_df):

# ... (calculate class probabilities and likelihoods)

Def naive_bayes_predict(model, test_sample):

# ... (calculate posterior probabilities and predict class)

Def accuracy(predictions, actual):

# ... (calculate accuracy)

# Demonstrate

Train_file = ‘train_nb.csv’

Test_file = ‘test_nb.csv’

Train_df = pd.read_csv(train_file)

Test_df = pd.read_csv(test_file)

Model = naive_bayes_train(train_df)

Predictions = []

For index, row in test_df.iterrows():

Prediction = naive_bayes_predict(model, row[:-1].to_dict())

Predictions.append(prediction)
Actual_classes = test_df[‘Class’].tolist()

Acc = accuracy(predictions, actual_classes)

Print(“Accuracy:”, acc)

Print(“Predictions:”, predictions)

Print(“Actual Classes:”, actual_classes)

LAB EXPERIMENT : 6
Question: Assuming a set of documents that need to be classified,
use the naïve Bayesian Classifier model to perform this task. Built-in
Java classes/API can be used to write the program. Calculate the
accuracy, precision, and recall for your data set.

Answer:

Implementation Steps (Java using Apache OpenNLP):

* Prepare Data: Organize documents into categories.

* Train Model:

* Use OpenNLP’s DocumentCategorizerTrainer to train a Naïve Bayes model

from the labeled documents. This involves tokenizing the text and calculating
probabilities of words given categories.

* Classify New Documents:

* Use the trained DocumentCategorizerME to predict the category of new,

unseen documents.

* Evaluate:

* Use a test set of labeled documents.

* Calculate accuracy (correctly classified / total).

* Calculate precision (true positives / (true positives + false positives)) for

each category.

* Calculate recall (true positives / (true positives + false negatives)) for

each category.

Demonstration (Conceptual Java):

Import opennlp.tools.doccat.*;

Import opennlp.tools.tokenize.*;

Import opennlp.tools.util.*;

Import java.io.*;

Import java.nio.charset.StandardCharsets;

Import java.util.Arrays;
Public class DocumentClassifier {

Public static void main(String[] args) throws IOException {

// Prepare training data (e.g., text files in directories named after

categories)

ObjectStream<DocumentSample> sampleStream =
getDocumentSamples(“path/to/training/data”);

// Train the Naive Bayes model

DoccatModel model = DocumentCategorizerME.train(“en”,

sampleStream, new NaiveBayesTrainer(), new TrainingParameters());
LAB EXPERIMENT : 7

Question: Write a program to construct a Bayesian network

considering medical data. Use this model to demonstrate the
diagnosis of heart patients using standard Heart Disease Data Set.
You can use Java/Python ML library classes/API.

Answer:

Here’s a Python program using the pgmpy library to construct a Bayesian

network for heart disease diagnosis. We’ll use a simplified model for
demonstration. For a real-world application, you’d need a more
comprehensive network structure and a larger, more representative dataset.

From pgmpy.models import BayesianNetwork

From pgmpy.factors.discrete import TabularCPD

From pgmpy.inference import VariableElimination

Import pandas as pd

From sklearn.model_selection import train_test_split

From sklearn.preprocessing import LabelEncoder

From pgmpy.estimators import MaximumLikelihoodEstimator

# Load the Heart Disease Dataset

Try:

Data = pd.read_csv(‘heart.csv’) # Assuming you have a ‘heart.csv’ file

Except FileNotFoundError:

Print(“Error: ‘heart.csv’ not found. Please make sure the file is in the
correct directory.”)

Exit()

# Preprocess the data (simplified for this example)

# Convert categorical features to numerical using Label Encoding

Categorical_cols = [‘Sex’, ‘ChestPainType’, ‘RestingECG’, ‘ExerciseAngina’,
‘ST_Slope’]

For col in categorical_cols:

Le = LabelEncoder()

Data[col] = le.fit_transform(data[col])

# Discretize continuous features (for simplicity in this example)

# In a real application, more sophisticated discretization might be needed

Data[‘Age_Group’] = pd.cut(data[‘Age’], bins=[0, 40, 55, 100],

labels=[‘Young’, ‘Middle’, ‘Old’])

Data[‘RestingBP_Group’] = pd.cut(data[‘RestingBP’], bins=[0, 120, 140, 200],

labels=[‘Normal’, ‘Elevated’, ‘High’])

Data[‘Cholesterol_Group’] = pd.cut(data[‘Cholesterol’], bins=[0, 200, 240,

600], labels=[‘Normal’, ‘Borderline’, ‘High’])

Data[‘MaxHR_Group’] = pd.cut(data[‘MaxHR’], bins=[0, 100, 150, 220],

labels=[‘Low’, ‘Average’, ‘High’])

Data[‘Oldpeak_Group’] = pd.cut(data[‘Oldpeak’], bins=[-1, 0, 2, 7],

labels=[‘Low’, ‘Medium’, ‘High’])

# Drop the original continuous features

Data_discrete = data.drop(columns=[‘Age’, ‘RestingBP’, ‘Cholesterol’,

‘MaxHR’, ‘Oldpeak’])

# Define the structure of the Bayesian Network (simplified)

# This structure reflects potential dependencies between variables

Model = BayesianNetwork([(‘Age_Group’, ‘HeartDisease’),

(‘Sex’, ‘HeartDisease’),

(‘ChestPainType’, ‘HeartDisease’),

(‘RestingBP_Group’, ‘HeartDisease’),
(‘Cholesterol_Group’, ‘HeartDisease’),

(‘FastingBS’, ‘HeartDisease’),

(‘RestingECG’, ‘HeartDisease’),

(‘MaxHR_Group’, ‘HeartDisease’),

(‘ExerciseAngina’, ‘HeartDisease’),

(‘Oldpeak_Group’, ‘HeartDisease’),

(‘ST_Slope’, ‘HeartDisease’)])

# Estimate the CPDs (Conditional Probability Distributions) from the data

Estimator = MaximumLikelihoodEstimator(model, data_discrete)

Model.fit(data_discrete, estimator=estimator)

# Verify the model structure and CPDs

Print(“Bayesian Network Edges:”, model.edges())

For cpd in model.get_cpds():

Print(“\nCPD for”, cpd.variable)

Print(cpd)

# Perform inference for diagnosis

Infer = VariableElimination(model)

# Example diagnosis: Predicting the probability of heart disease given some

observations

Print(“\n--- Diagnosis Example ---“)

Q = infer.query(variables=[‘HeartDisease’],

Evidence={‘Age_Group’: ‘Old’,

‘Sex’: 1, # Assuming 1 represents Male after Label Encoding

‘ChestPainType’: 0, # Assuming 0 represents typical angina

‘RestingBP_Group’: ‘Elevated’,

‘Cholesterol_Group’: ‘High’,

‘FastingBS’: 1,

‘RestingECG’: 0,

‘MaxHR_Group’: ‘Average’,

‘ExerciseAngina’: 1,

‘Oldpeak_Group’: ‘Medium’,

‘ST_Slope’: 2}) # Assuming 2 represents Down-sloping

Print(q)

Explanation:

* Import Libraries: We import necessary libraries from pgmpy for Bayesian

Networks, pandas for data handling, and sklearn for preprocessing.

* Load Data: The code assumes you have a CSV file named heart.csv
containing the heart disease dataset. You’ll need to replace this with the
actual path to your dataset.

* Preprocess Data:

* Label Encoding: Categorical features are converted into numerical

representations using LabelEncoder. Bayesian networks in pgmpy often work
best with discrete variables.

* Discretization: Continuous features like ‘Age’, ‘RestingBP’, etc., are

discretized into bins to create categorical variables. The binning strategy
here is simple and for demonstration purposes. In a real application, you’d
likely use more informed binning or consider using algorithms that can
handle continuous variables directly within the Bayesian network framework
(though pgmpy’s discrete inference is more mature).

* The original continuous columns are dropped after discretization.

* Define Network Structure: We define the structure of the Bayesian network
by specifying the edges (dependencies) between the variables. This
structure is based on domain knowledge and potential relationships between
risk factors and heart disease. Note: This is a simplified structure. A more
accurate model would likely involve more complex relationships.

* Estimate CPDs: We use the MaximumLikelihoodEstimator to learn the

Conditional Probability Distributions (CPDs) for each node in the network
based on the provided data. The CPDs quantify the probability of each state
of a node given the states of its parent nodes.

* Verify Model: We print the edges of the network and the learned CPDs to
understand the model.

* Perform Inference: We use VariableElimination, an inference algorithm for

Bayesian networks, to query the probability of a specific variable (in this
case, ‘HeartDisease’) given some evidence (observed values for other
variables). The query method returns a probability distribution over the
possible states of the queried variable.

To run this code:

 Install Libraries:

Pip install pandas scikit-learn pgmpy

* Download Dataset: Obtain a standard Heart Disease Dataset (e.g., from

UCI Machine Learning Repository or Kaggle) and save it as heart.csv in the
same directory as your Python script, or update the file path in the code.

* Run the Script: Execute the Python script.

The output will show the structure of the learned Bayesian network and the
probability of heart disease given the specified evidence.
LAB EXPERIMENT : 8

Question: Apply EM algorithm to cluster a set of data stored in

a .CSV file. Use the same data set for clustering using k-Means
algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML
library classes/API in the specified.

Answer:

Here’s a Python program using scikit-learn to apply the EM algorithm

(Gaussian Mixture Model) and k-Means algorithm to cluster data from a CSV
file. We’ll then compare the results.

Import pandas as pd

From sklearn.mixture import GaussianMixture

From sklearn.cluster import KMeans

From sklearn.preprocessing import StandardScaler

From sklearn.metrics import silhouette_score

Import matplotlib.pyplot as plt

# Load the data from the CSV file

Try:

data = pd.read_csv('clustering_data.csv') # Replace with your CSV file

name

except FileNotFoundError:

print("Error: 'clustering_data.csv' not found.")

exit()

# Handle potential non-numeric columns by selecting only numeric ones

numeric_data = data.select_dtypes(include=['number'])

if numeric_data.empty:
print("Error: No numeric columns found in the CSV file for clustering.")

exit()

# Standardize the data (important for both algorithms)

scaler = StandardScaler()

scaled_data = scaler.fit_transform(numeric_data)

# Convert scaled data back to DataFrame for easier handling if needed

scaled_df = pd.DataFrame(scaled_data, columns=numeric_data.columns)

# --- EM Algorithm (Gaussian Mixture Model) ---

n_clusters_em = 3 # You might need to determine the optimal number of

clusters

gmm = GaussianMixture(n_components=n_clusters_em, random_state=42)

gmm_labels = gmm.fit_predict(scaled_data)

# Evaluate EM clustering

silhouette_em = silhouette_score(scaled_data, gmm_labels)

print(f"\n--- EM Clustering ---")

print(f"Silhouette Score (EM): {silhouette_em:.3f}")

# Add EM cluster labels to the DataFrame

scaled_df['EM_Cluster'] = gmm_labels

# --- k-Means Algorithm ---

n_clusters_kmeans = 3 # Using the same number of clusters for comparison

kmeans = KMeans(n_clusters=n_clusters_kmeans, random_state=42,
n_init=10) # Added n_init for stability

kmeans_labels = kmeans.fit_predict(scaled_data)

# Evaluate k-Means clustering

silhouette_kmeans = silhouette_score(scaled_data, kmeans_labels)

print(f"\n--- k-Means Clustering ---")

print(f"Silhouette Score (k-Means): {silhouette_kmeans:.3f}")

# Add k-Means cluster labels to the DataFrame

scaled_df['KMeans_Cluster'] = kmeans_labels

# --- Compare Results (Qualitative - Scatter Plot for 2D data) ---

if scaled_df.shape[1] >= 4: # Check if at least 2 original features exist

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)

scatter = plt.scatter(scaled_df.iloc[:, 0], scaled_df.iloc[:, 1],

c=scaled_df['EM_Cluster'], cmap='viridis')

plt.title(f'EM Clustering (Silhouette: {silhouette_em:.3f})')

plt.xlabel(scaled_df.columns[0])

plt.ylabel(scaled_df.columns[1])

plt.colorbar(scatter, label='Cluster')

plt.subplot(1, 2, 2)

scatter = plt.scatter(scaled_df.iloc[:, 0], scaled_df.iloc[:, 1],

c=scaled_df['KMeans_Cluster'], cmap='viridis')

plt.title(f'k-Means Clustering (Silhouette: {silhouette_kmeans:.3f})')

plt.xlabel(scaled_df.columns[0])

plt.ylabel(scaled_df.columns[1])

plt.colorbar(scatter, label='Cluster')

plt.tight_layout()

plt.show()

else:

print("\nNote: Cannot generate scatter plot as the data has less than 2
features.")

# --- Comment on Quality of Clustering ---

print("\n--- Comparison and Comments ---")

print("Silhouette Score provides a measure of how similar an object is to its

own cluster compared to other clusters.")

print(f"Higher Silhouette Score (closer to +1) indicates better-defined

clusters.")

print(f"EM Silhouette Score: {silhouette_em:.3f}")

print(f"k-Means Silhouette Score: {silhouette_kmeans:.3f}")

print("\nComments on Clustering Quality:")

if silhouette_em > silhouette_kmeans:

print("EM algorithm achieved a higher silhouette score, suggesting

potentially better-defined clusters compared to k-Means for this data.")

elif silhouette_kmeans > silhouette_em:

print("k-Means algorithm achieved a higher silhouette score, suggesting

potentially better-defined clusters compared to EM for this data.")

else:

print("Both EM and k-Means algorithms resulted in similar silhouette

scores.")
print("\nFurther considerations for comparison:")

print("- The underlying distribution of the data: EM assumes Gaussian

distributions for clusters, while k-Means assumes spherical and equally sized
clusters.")

print("- The number of clusters chosen: The quality of both algorithms is

highly dependent on the chosen 'n_clusters'. Techniques like the elbow
method or silhouette analysis can help determine an appropriate number.")

print("- Initialization: k-Means is sensitive to the initial placement of centroids

(addressed here by setting n_init). EM also has initialization sensitivities.")

print("- Cluster shapes and sizes: EM can handle clusters with different
shapes and sizes due to the covariance matrices it learns, while k-Means
struggles with non-spherical or differently sized clusters.")

Explanation:

* Import Libraries: We import necessary libraries from scikit-learn for

clustering algorithms, preprocessing, and evaluation, pandas for data
handling, and matplotlib for visualization.

* Load Data: The code assumes your data is in a CSV file named
clustering_data.csv. Replace this with the actual file name.

* Handle Non-Numeric Data: The code selects only numeric columns for
clustering as both EM and k-Means typically operate on numerical data.

* Standardize Data: Feature scaling (using StandardScaler) is crucial for both

algorithms to prevent features with larger ranges from dominating the
clustering process.

* EM Algorithm (Gaussian Mixture Model):

* We initialize a GaussianMixture model with a specified number of

components (n_components). This is the number of clusters we want to find.
You might need to experiment with different values or use techniques to
determine the optimal number of clusters.

* fit_predict() learns the Gaussian mixture model from the data and assigns
each data point to a cluster.
* We evaluate the clustering using the Silhouette Score, which measures
how well each data point fits into its assigned cluster compared to other
clusters. A higher score (closer to +1) indicates better clustering.

* k-Means Algorithm:

* We initialize a KMeans model with the same number of clusters for a fair
comparison. n_init is set to improve the stability of the algorithm by running
it multiple times with different initial centroid seeds.

* fit_predict() performs the k-Means clustering and assigns cluster labels.

* We again calculate the Silhouette Score for k-Means.

* Compare Results:

* Qualitative Comparison (Scatter Plot): If the data has at least two

features, we generate a scatter plot to visualize the clusters found by both
algorithms. This helps in visually assessing the quality and differences in the
clustering.

* Quantitative Comparison (Silhouette Score): We compare the Silhouette

Scores obtained by both algorithms.

* Comments: We provide comments on the quality of clustering based on

the Silhouette Scores and discuss the underlying assumptions and
strengths/weaknesses of each algorithm.

To run this code:

* Install Libraries:

pip install pandas scikit-learn matplotlib

* Create Data File: Create a CSV file named clustering_data.csv (or whatever
name you use in the code) with the data you want to cluster. Ensure it has
numeric columns.

* Run the Script: Execute the Python script.

The output will show the Silhouette Scores for both EM and k-Means
clustering, a scatter plot (if the data has at least two features), and
comments comparing the results and the algorithms.

Remember that the "better" clustering algorithm depends heavily on the

underlying structure and distribution of your data. EM is generally better at
handling clusters with different shapes and sizes (as it models each cluster
with a Gaussian distribution that has its own covariance matrix), while k-
Means works well for clusters that are roughly spherical and of similar size.

Robot Standards List
No ratings yet
Robot Standards List
5 pages
MANUAL (1)
No ratings yet
MANUAL (1)
34 pages
22K61A0618_removed_lab manual sasi cld
No ratings yet
22K61A0618_removed_lab manual sasi cld
25 pages
Machine Learning Lab Record: Dr. Sarika Hegde
No ratings yet
Machine Learning Lab Record: Dr. Sarika Hegde
23 pages
Machine Learning Lab (17CSL76)
No ratings yet
Machine Learning Lab (17CSL76)
48 pages
MANUAL (2)
No ratings yet
MANUAL (2)
33 pages
AD3461_ML Lab Manual
No ratings yet
AD3461_ML Lab Manual
54 pages
Lab Manual
No ratings yet
Lab Manual
25 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
43 pages
AD3461 ML lab manual
No ratings yet
AD3461 ML lab manual
32 pages
ML Lab Manual (1-9)
No ratings yet
ML Lab Manual (1-9)
37 pages
ad3461-ml-lab-manual-format-edited
No ratings yet
ad3461-ml-lab-manual-format-edited
45 pages
ML Experiments
No ratings yet
ML Experiments
22 pages
Machine Learning Laboratory Manual
No ratings yet
Machine Learning Laboratory Manual
11 pages
R20 Iii-Ii ML Lab Manual
100% (1)
R20 Iii-Ii ML Lab Manual
79 pages
ML Manual
No ratings yet
ML Manual
34 pages
Ml_Lab_Manual
No ratings yet
Ml_Lab_Manual
70 pages
Amit MLT1
No ratings yet
Amit MLT1
22 pages
Lab Manual
No ratings yet
Lab Manual
55 pages
MLlab Manual LIET
No ratings yet
MLlab Manual LIET
52 pages
15CSL76 Students
No ratings yet
15CSL76 Students
18 pages
ML_LAB Record_final
No ratings yet
ML_LAB Record_final
39 pages
IT ML Lab
No ratings yet
IT ML Lab
35 pages
AD3461-Machine Learning Lab Manual
No ratings yet
AD3461-Machine Learning Lab Manual
26 pages
ML LAB P-1
No ratings yet
ML LAB P-1
10 pages
ML Lab Programs
No ratings yet
ML Lab Programs
18 pages
ML MANUAL (1)
No ratings yet
ML MANUAL (1)
74 pages
Fedal #5
No ratings yet
Fedal #5
33 pages
Ml Lab Record
No ratings yet
Ml Lab Record
49 pages
Ad3461-ML Manual (1)
No ratings yet
Ad3461-ML Manual (1)
27 pages
MLT LAB1
No ratings yet
MLT LAB1
27 pages
201CS240-MLLABMANUAL
No ratings yet
201CS240-MLLABMANUAL
20 pages
ML lab manual
No ratings yet
ML lab manual
25 pages
ML Lab Manual - Ex No. 1 To 9
No ratings yet
ML Lab Manual - Ex No. 1 To 9
26 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
28 pages
ML LAB record[1]
No ratings yet
ML LAB record[1]
35 pages
Machine Learning Techniques Lab: Session: 2023-24, Even Semester
No ratings yet
Machine Learning Techniques Lab: Session: 2023-24, Even Semester
20 pages
DOC-20250509-WA0027.
No ratings yet
DOC-20250509-WA0027.
34 pages
ML New record (5)
No ratings yet
ML New record (5)
51 pages
ML Lab Record
No ratings yet
ML Lab Record
33 pages
Ml Lab Manual1 9
No ratings yet
Ml Lab Manual1 9
38 pages
MACHINE LEARNING LAB MANUAL (1)
No ratings yet
MACHINE LEARNING LAB MANUAL (1)
23 pages
Lab Manual Final
No ratings yet
Lab Manual Final
34 pages
original ML lab manual (1)
No ratings yet
original ML lab manual (1)
22 pages
ML Lab Observation
100% (1)
ML Lab Observation
44 pages
MLWP LAB Experiment's
No ratings yet
MLWP LAB Experiment's
11 pages
ML-LAB-MANUAL-R20
No ratings yet
ML-LAB-MANUAL-R20
77 pages
Jntuk R20 ML
No ratings yet
Jntuk R20 ML
43 pages
AIML
No ratings yet
AIML
12 pages
code mlt
No ratings yet
code mlt
9 pages
ML Lab Manual
No ratings yet
ML Lab Manual
14 pages
Machine Learning Lab Manual (15CSL76)
No ratings yet
Machine Learning Lab Manual (15CSL76)
30 pages
ML Lab Manual PDF
No ratings yet
ML Lab Manual PDF
9 pages
Edited - Edited - Final ML Lab Manual Version11
No ratings yet
Edited - Edited - Final ML Lab Manual Version11
83 pages
Machine Learning Lab Mannual CS 601
No ratings yet
Machine Learning Lab Mannual CS 601
30 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
31 pages
ML Ex1
No ratings yet
ML Ex1
12 pages
Machine Learninf File Final
No ratings yet
Machine Learninf File Final
45 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
..., Yim) - in A One-To-One Relationship, Each Value On (Qi1, Qi2, ..., Qin) Will
No ratings yet
..., Yim) - in A One-To-One Relationship, Each Value On (Qi1, Qi2, ..., Qin) Will
3 pages
Rac Proactive Monitoring Using Os Watcher
No ratings yet
Rac Proactive Monitoring Using Os Watcher
10 pages
INFO 1113 Projects
0% (1)
INFO 1113 Projects
11 pages
Configurationinfo - HMI - Wincc Auto Start After Abrupt Shutdown
No ratings yet
Configurationinfo - HMI - Wincc Auto Start After Abrupt Shutdown
6 pages
Homework 6
No ratings yet
Homework 6
2 pages
Web Design Development Company Profile
100% (4)
Web Design Development Company Profile
7 pages
ILWIS Tutorials
No ratings yet
ILWIS Tutorials
13 pages
Manual de Beamer
No ratings yet
Manual de Beamer
214 pages
MSC SimDesigner™ 2010.2 Workbench Edition For CATIA® V5 R19 Installation Guide
No ratings yet
MSC SimDesigner™ 2010.2 Workbench Edition For CATIA® V5 R19 Installation Guide
72 pages
OS-Course Material-UNIT-3
No ratings yet
OS-Course Material-UNIT-3
39 pages
Case Study Working Group MINUTES
No ratings yet
Case Study Working Group MINUTES
2 pages
COA Sanchit Sir Notes
No ratings yet
COA Sanchit Sir Notes
145 pages
How Apple Changed The World 2
No ratings yet
How Apple Changed The World 2
5 pages
Cardless Withdrawal Frequently Asked Questions
No ratings yet
Cardless Withdrawal Frequently Asked Questions
2 pages
QT 4.6 Style Sheets Reference
No ratings yet
QT 4.6 Style Sheets Reference
19 pages
MANUAL HP MFP M426c - 04683515 PDF
No ratings yet
MANUAL HP MFP M426c - 04683515 PDF
190 pages
Digital-Analog Converter Module Type AJ65SBT PDF
No ratings yet
Digital-Analog Converter Module Type AJ65SBT PDF
74 pages
Sdfs
No ratings yet
Sdfs
3 pages
Comp Self Assessment
No ratings yet
Comp Self Assessment
6 pages
(Lecture Notes in Computer Science 6547 Lecture Notes in Artificial Intelligence) Terrance Swift (Auth.), Salvador Abreu, Dietmar Seipel (Eds.) - Applications of Declarative Programming and Knowledge
No ratings yet
(Lecture Notes in Computer Science 6547 Lecture Notes in Artificial Intelligence) Terrance Swift (Auth.), Salvador Abreu, Dietmar Seipel (Eds.) - Applications of Declarative Programming and Knowledge
250 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
10 pages
Assignment Service Marketing
No ratings yet
Assignment Service Marketing
3 pages
MCNF Algorithms
No ratings yet
MCNF Algorithms
25 pages
1.technical Master Data (TDM)
No ratings yet
1.technical Master Data (TDM)
13 pages
FTL51 (H) PDF
No ratings yet
FTL51 (H) PDF
1 page
Chaos Theory IN Cryptography: Risna Dwi Hapsari Zata Atika Rubila Dwi Adawiyah
No ratings yet
Chaos Theory IN Cryptography: Risna Dwi Hapsari Zata Atika Rubila Dwi Adawiyah
17 pages
IpTIME PRO 54guserguide
No ratings yet
IpTIME PRO 54guserguide
81 pages
Controller Stepper 4-Axis+Drvr 32-Bit Rs485 PDF
No ratings yet
Controller Stepper 4-Axis+Drvr 32-Bit Rs485 PDF
60 pages
TEKELEC 910-6282-001 - Rev - A
No ratings yet
TEKELEC 910-6282-001 - Rev - A
265 pages