Assignment 1 set A Q.
Create ‘User’ Data set having 5 columns namely: User ID, Gender,
Age, Estimated Salary and Purchased. Build a logistic regression
model that can predict whether on the given parameter a person
will buy a car or not.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create the User dataset
data = {'User ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Gender': ['M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M'],
'Age': [28, 45, 23, 31, 37, 22, 33, 42, 29, 25],
'Estimated Salary': [35000, 55000, 18000, 65000, 75000, 20000,
84000, 92000, 32000, 58000],
'Purchased': [0, 1, 0, 1, 1, 0, 1, 1, 0, 1]}
df = pd.DataFrame(data)
# Convert Gender column to numeric format
df['Gender'] = pd.get_dummies(df['Gender'], drop_first=True)
# Split the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(df[['Gender', 'Age',
'Estimated Salary']],
df['Purchased'],
test_size=0.3,
random_state=0)
# Build the logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = lr.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
Assignment 1 Set B Q .1
Build a simple linear regression model for Fish Species Weight
Prediction.
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Load the dataset
df = pd.read_csv('Fish.csv')
# Preprocess the data
df = df.drop(['Species'], axis=1) # Drop the species column as it's
categorical
df = df.dropna() # Drop any rows with missing values
# Split the dataset into training and testing sets
X = df.drop(['Weight'], axis=1) # Features
y = df['Weight'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
# Train the model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Make predictions on the test set
y_pred = regressor.predict(X_test)
# Evaluate the model using the coefficient of determination (R^2
score)
r2 = r2_score(y_test, y_pred)
print("R^2 score:", r2)
# Plot the regression line and data points
plt.scatter(X_test, y_test, color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()
Assignment 1 Set B Q .2
Use the iris dataset. Write a Python program to view some basic
statistical details like percentile, mean, std etc. of the species of
'Iris-setosa', 'Iris-versicolor' and 'Iris-virginica'. Apply logistic
regression on the dataset to identify different species (setosa,
versicolor, verginica) of Iris flowers given just 4 features: sepal and
petal lengths and widths.. Find the accuracy of the model.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
# Convert to pandas dataframe
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
# Print basic statistical details of different species
print('Iris-setosa statistics:')
print(df[df['target'] == 0].describe())
print('Iris-versicolor statistics:')
print(df[df['target'] == 1].describe())
print('Iris-virginica statistics:')
print(df[df['target'] == 2].describe())
# Split data into train and test sets
X_train, X_test, y_train, y_test =
train_test_split(df[iris.feature_names], df['target'], test_size=0.2,
random_state=42)
# Fit logistic regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
# Predict on test set
y_pred = lr_model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
Assignment 2 Set B Q. 1
Download the Market basket dataset. Write a python program to
read the dataset and display its information. Preprocess the data
(drop null values etc.) Convert the categorical values into numeric
format. Apply the apriori algorithm on the above dataset to
generate the frequent itemsets and association rules.
# Importing required libraries
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Reading the dataset
data = pd.read_csv('market_basket.csv')
# Displaying the dataset information
print('Dataset Information:')
print(data.info())
# Preprocessing the data
data.dropna(inplace=True)
transactions = []
for i in range(len(data)):
transactions.append([str(data.values[i,j]) for j in range(0, 20)])
# Converting categorical values into numeric format
te = TransactionEncoder()
te_ary = te.fit_transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Applying Apriori algorithm
frequent_itemsets = apriori(df, min_support=0.01,
use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift",
min_threshold=1)
# Displaying the frequent itemsets and association rules
print('\nFrequent Itemsets:')
print(frequent_itemsets)
print('\nAssociation Rules:')
print(rules)
Assignment 2 Set B Q. 2
Download the groceries dataset. Write a python program to read
the dataset and display its information. Preprocess the data (drop
null values etc.) Convert the categorical values into numeric format.
Apply the apriori algorithm on the above dataset to generate the
frequent itemsets and association rules
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Load the dataset into a pandas dataframe
df = pd.read_csv('groceries.csv')
# Display information about the dataset
df.info()
# Preprocess the data by dropping any null values and converting
categorical values into numeric format
df.dropna(inplace=True)
te = TransactionEncoder()
te_ary = te.fit(df).transform(df)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Apply the Apriori algorithm on the preprocessed dataset
frequent_itemsets = apriori(df, min_support=0.01,
use_colnames=True)
association_rules = association_rules(frequent_itemsets,
metric="lift", min_threshold=1)
# Display the frequent itemsets and association rules in a readable
format
print(frequent_itemsets)
print(association_rules)
Assignment 2 Set A Q.2
Create your own transactions dataset and apply the above process
on your dataset.
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
transactions = [
['apple', 'banana', 'orange', 'grape'],
['apple', 'banana', 'grape'],
['apple', 'orange'],
['banana', 'orange', 'grape'],
['apple', 'banana', 'orange', 'kiwi'],
['orange', 'kiwi'],
['apple', 'banana', 'kiwi'],
['orange', 'grape', 'kiwi'],
['apple', 'orange', 'grape', 'kiwi'],
['apple', 'banana', 'orange', 'grape', 'kiwi']
]
# convert transactions to one-hot encoded format
te = TransactionEncoder()
one_hot = te.fit_transform(transactions)
# convert one-hot encoded format to dataframe
df = pd.DataFrame(one_hot, columns=te.columns_)
# generate frequent itemsets using Apriori algorithm
freq_itemsets = apriori(df, min_support=0.3, use_colnames=True)
# generate association rules
rules = association_rules(freq_itemsets, metric='confidence',
min_threshold=0.7)
# print frequent itemsets and association rules
print("Frequent Itemsets:")
print(freq_itemsets)
print("\nAssociation Rules:")
print(rules)
Assignment 3 Set A Q.1
Consider any text paragraph. Preprocess the text to remove any
special characters and digits. Generate the summary using
extractive summarization process
import re
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
# sample text paragraph
text = "This is a sample text paragraph. It contains some special
characters like % and digits like 123. The paragraph needs to be
summarized using extractive summarization process."
# preprocess text by removing special characters and digits
text = re.sub('[^a-zA-Z]', ' ', text)
# tokenize sentences
sentences = sent_tokenize(text)
# compute sentence scores using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(sentences)
scores = X.sum(axis=1)
# select top N sentences based on scores
N=2
idx = scores.argsort(axis=0)[::-1][:N]
summary = [sentences[i] for i in idx]
# print summary
print("Summary:")
for sentence in summary:
print(sentence)
Program 8
Consider text paragraph.So, keep working. Keep striving. Never
give up. Fall down seven times, get up eight. Ease is a greater
threat to progress than hardship. Ease is a greater threat to
progress than hardship. So, keep moving, keep growing, keep
learning. See you at work.Preprocess the text to remove any
special characters and digits. Generate the summary using
extractive summarization process.
from nltk.corpus import stopwords
from nltk.probability import FreqDist
text = "So, keep working. Keep striving. Never give up. Fall down
seven times, get up eight. Ease is a greater threat to progress than
hardship. Ease is a greater threat to progress than hardship. So,
keep moving, keep growing, keep learning. See you at work."
# Remove special characters and digits
processed_text = re.sub('[^A-Za-z]+', ' ', text)
print(processed_text)
# Tokenize sentences
sentences = sent_tokenize(processed_text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_sentences = []
for sentence in sentences:
words = sentence.split()
filtered_words = [word for word in words if word.lower() not in
stop_words]
filtered_sentence = ' '.join(filtered_words)
filtered_sentences.append(filtered_sentence)
# Calculate word frequency distribution and plot frequencies
words = processed_text.split()
fdist = FreqDist(words)
fdist.plot()
wordcloud = WordCloud(width=800, height=800,
background_color='white',
stopwords=stop_words,
min_font_size=10).generate(processed_text)
# plot the WordCloud image
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
Assignment 3 Set A Q.2
Consider any text paragraph. Remove the stopwords. Tokenize the
paragraph to extract words and sentences. Calculate the word
frequency distribution and plot the frequencies. Plot the wordcloud
of the text
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# download the stopwords corpus if not already downloaded
nltk.download('stopwords')
# sample text paragraph
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur."
# tokenize the text into words
words = word_tokenize(text)
# remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.casefold() not in
stop_words]
# tokenize the text into sentences
sentences = sent_tokenize(text)
# calculate the frequency distribution of the words
fdist = FreqDist(filtered_words)
# plot the frequency distribution of the words
fdist.plot()
# create a wordcloud of the most frequent words
wordcloud = WordCloud(width=800, height=800,
background_color='white', min_font_size=10).generate('
'.join(filtered_words))
# plot the wordcloud
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
Program 10
Download the movie_review.csv dataset from Kaggle by using the
following link
:https://www.kaggle.com/nltkdata/movie-review/version/3?select=m
ovie_review.csv to perform sentiment analysis on above dataset
and create a wordcloud.
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# download the stopwords corpus if not already downloaded
nltk.download('stopwords')
# read the dataset
df = pd.read_csv('movie_review.csv')
# instantiate the SentimentIntensityAnalyzer object
sia = SentimentIntensityAnalyzer()
# apply sentiment analysis to each review in the dataset and store
the results in a new column
df['sentiment'] = df['review'].apply(lambda x:
sia.polarity_scores(x)['compound'])
# print the number of positive, negative, and neutral reviews
print('Positive Reviews:', len(df[df['sentiment'] > 0]))
print('Negative Reviews:', len(df[df['sentiment'] < 0]))
print('Neutral Reviews:', len(df[df['sentiment'] == 0]))
# create a wordcloud of the most frequent words in the dataset
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = nltk.corpus.stopwords.words('english'),
min_font_size = 10).generate(' '.join(df['review']))
# plot the wordcloud
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
Program 11
Consider text paragraph."""Hello all, Welcome to Python
Programming Academy. Python Programming Academy is a nice
platform to learn new programming skills. It is difficult to get
enrolled in this Academy."""Remove the stopwords.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# download the stopwords corpus if not already downloaded
nltk.download('stopwords')
# set of english stopwords
stop_words = set(stopwords.words('english'))
# input text paragraph
text = "Hello all, Welcome to Python Programming Academy.
Python Programming Academy is a nice platform to learn new
programming skills. It is difficult to get enrolled in this Academy."
# tokenize the text paragraph into individual words
words = word_tokenize(text)
# remove the stopwords from the list of words
words_without_stopwords = [word for word in words if not
word.lower() in stop_words]
# join the words back into a string
text_without_stopwords = ' '.join(words_without_stopwords)
# print the text paragraph without stopwords
print(text_without_stopwords)
Program 12
Build a simple linear regression model for User Data.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
user_data = pd.read_csv('user_data.csv')
X = user_data[['age']] # independent variable
y = user_data['income'] # dependent variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)
simple_lr = LinearRegression()
simple_lr.fit(X_train, y_train)
y_pred = simple_lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
new_data = pd.DataFrame({'age': [35, 45, 55]})
simple_lr.predict(new_data)
Assignment 3 Set B Q.3
consider the following dataset :
https://www.kaggle.com/datasets/datasnaek/youtubenew?select=IN
videos.csv Write a Python script for the following : i. Read the
dataset and perform data cleaning operations on it. ii. ii. Find the
total views, total likes, total dislikes and comment count.
import pandas as pd
# Load the dataset
data = pd.read_csv('stats.csv')
# Drop any rows with missing values
data.dropna(inplace=True)
# Find the total views, likes, dislikes and comment count
total_views = data['views'].sum()
total_likes = data['likes'].sum()
total_dislikes = data['dislikes'].sum()
total_comments = data['comment_count'].sum()
# Print the results
print('Total views:', total_views)
print('Total likes:', total_likes)
print('Total dislikes:', total_dislikes)
print('Total comments:', total_comments)
Assignment 3 Set B Q.2
consider the following dataset :
https://www.kaggle.com/datasets/seungguini/youtube-commentsfor
-covid19-relatedvideos?select=covid_2021_1.csv Write a Python
script for the following : i. Read the dataset and perform data
cleaning operations on it. ii. ii. Tokenize the comments in words. iii.
Perform sentiment analysis and find the percentage of positive,
negative and neutral comments..
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Load the dataset
data = pd.read_csv('covid.csv')
# Drop any rows with missing values
data.dropna(inplace=True)
# Tokenize the comments in words
data['tokens'] = data['commentText'].apply(word_tokenize)
# Perform sentiment analysis
sid = SentimentIntensityAnalyzer()
data['sentiment'] = data['commentText'].apply(lambda x:
sid.polarity_scores(x)['compound'])
# Categorize the comments into positive, negative and neutral
based on sentiment score
data['sentiment_category'] = pd.cut(data['sentiment'], bins=3,
labels=['negative', 'neutral', 'positive'])
# Calculate the percentage of comments in each sentiment
category
sentiment_counts =
data['sentiment_category'].value_counts(normalize=True) * 100
print('Percentage of positive comments:',
sentiment_counts['positive'])
print('Percentage of negative comments:',
sentiment_counts['negative'])
print('Percentage of neutral comments:',
sentiment_counts['neutral'])
Program 15
Build a simple linear regression model for Car Dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the dataset
data = pd.read_csv('cars.csv')
# Split the dataset into features and target variable
X = data['mileage'].values.reshape(-1, 1)
y = data['price'].values.reshape(-1, 1)
# Split the dataset into training and testing sets with a 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Create a linear regression object
linreg = LinearRegression()
# Fit the training data to the model
linreg.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = linreg.predict(X_test)
# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
# Print the mean squared error
print('Mean squared error:', mse)
Program 16
Build a logistic regression model for Student Score Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the dataset
data = pd.read_csv('student_scores.csv')
# Split the dataset into features and target variable
X = data.drop(['Pass/Fail'], axis=1)
y = data['Pass/Fail']
# Split the dataset into training and testing sets with a 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Create a logistic regression object
logreg = LogisticRegression()
# Fit the training data to the model
logreg.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = logreg.predict(X_test)
# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
# Print the accuracy score
print('Accuracy:', accuracy)
Program 17
Create the dataset . transactions = [['eggs', 'milk','bread'], ['eggs',
'apple'], ['milk', 'bread'], ['apple', 'milk'], ['milk', 'apple', 'bread']] .
Convert the categorical values into numeric format.Apply the
apriori algorithm on the above dataset to generate the frequent
itemsets and association rules.
from sklearn.preprocessing import LabelEncoder
from apyori import apriori
transactions = [['eggs', 'milk', 'bread'],
['eggs', 'apple'],
['milk', 'bread'],
['apple', 'milk'],
['milk', 'apple', 'bread']]
# Create a LabelEncoder object
le = LabelEncoder()
# Loop through each transaction and encode the categorical values
for transaction in transactions:
le.fit(transaction)
transaction = le.transform(transaction)
print(transaction)
# Apply the Apriori algorithm with a minimum support of 0.5
results = list(apriori(transactions, min_support=0.5))
# Print the frequent itemsets and association rules
for item in results:
print(item)