R23 2-1 Python Lab 4 J5
R23 2-1 Python Lab 4 J5
UNIT – IV
Sample Experiments:
1) Write a program to sort words in a file and put them in another file. The output file
should have only lower-case words, so any upper-case words from source must be
lowered?
Program :
# Function to sort words in a file and write to another file
def sort_words_in_file(input_file, output_file):
try:
# Open the input file and read the content
with open(input_file, 'r') as infile:
# Read all lines, split into words, and convert to lowercase
words = infile.read().split()
words = [word.lower() for word in words]
except FileNotFoundError:
print(f"The file {input_file} does not exist.")
except Exception as e:
print(f"An error occurred: {e}")
# Example usage
input_file = 'source.txt' # Input file containing words
output_file = 'sorted_words.txt' # Output file to store sorted words
sort_words_in_file(input_file, output_file)
Explanation:
1. Reading the File: The program reads all the words from the input_file, splits them into
a list of words, and converts them to lowercase.
2. Sorting: The list of words is sorted alphabetically.
3. Writing to Output File: The sorted words are written to the output_file, one word per
line.
4. Error Handling: The program checks for file not found errors and handles other
potential exceptions.
You can modify the file paths (input_file, output_file) to match your setup.
except FileNotFoundError:
print(f"The file {file_name} does not exist.")
except Exception as e:
print(f"An error occurred: {e}")
# Example usage
file_name = 'source.txt' # Input file name
print_reverse_lines(file_name)
Explanation:
1. Reading the File: The program reads the file line by line using readlines().
2. Reversing the Line: For each line, it removes the newline character (rstrip()) and
reverses the string using slicing ([::-1]).
3. Printing: The reversed line is printed out.
4. Error Handling: It handles the case where the file may not exist and catches any
other errors.
To use this program, set the file_name variable to the path of your input file.
3) Python program to compute the number of characters, words and lines in a file?
Program:
# Function to compute characters, words, and lines in a file
def count_file_details(file_name):
try:
# Initialize counters for characters, words, and lines
num_characters = 0
num_words = 0
num_lines = 0
except FileNotFoundError:
print(f"The file {file_name} does not exist.")
except Exception as e:
print(f"An error occurred: {e}")
# Example usage
file_name = 'source.txt' # Input file name
count_file_details(file_name)
Explanation:
1. Reading the File: The file is read line by line using a for loop.
2. Character Counting: The length of each line (including spaces and newline
characters) is added to num_characters.
3. Word Counting: Each line is split into words using split(), and the number of words is
updated.
4. Line Counting: For each line in the file, num_lines is incremented.
5. Error Handling: The code handles cases where the file may not exist.
To run the program, set file_name to the path of the file you want to analyze.
4) Write a program to create, display, append, insert and reverse the order of the items
in the array?
Program:
from array import array
# Example usage
if __name__ == "__main__":
# Create an array with initial values
elements = [10, 20, 30, 40, 50]
arr = create_array(elements)
This program handles integer arrays and provides the basic operations you requested. You
can easily modify it to work with other data types if needed.
5) Write a python program to create a class that represents a shape. Include methods to
calculate its area and perimeter. Implement subclasses for different shapes like circle, triangle
and square?
Program:
import math
class Shape:
def area(self):
pass
def perimeter(self):
pass
class Circle(Shape):
self.radius = radius
def area(self):
def perimeter(self):
class Triangle(Shape):
self.a = a
self.b = b
self.c = c
def area(self):
def perimeter(self):
class Square(Shape):
self.side = side
def area(self):
return self.side ** 2
def perimeter(self):
return 4 * self.side
# Example usage
if __name__ == "__main__":
circle = Circle(5)
triangle = Triangle(3, 4, 5)
square = Square(4)
Explanation:
1. Shape (Base Class): This is the base class that defines the structure for other shapes.
It has two methods, area() and perimeter(), which are placeholders and do nothing in
the base class.
2. Circle (Subclass): Inherits from Shape. It overrides the area() and perimeter()
methods to calculate the area and perimeter of a circle, given its radius.
3. Triangle (Subclass): Inherits from Shape. It implements area() using Heron’s
formula, which calculates the area based on the lengths of the three sides, and
perimeter() calculates the sum of the sides.
4. Square (Subclass): Inherits from Shape. It implements area() and perimeter() for a
square given its side length.
Output:
Circle: Area = 78.54, Perimeter = 31.42
Triangle: Area = 6.00, Perimeter = 12.00
Square: Area = 16.00, Perimeter = 16.00
You can create more subclasses for other shapes by extending the Shape class and
implementing their specific formulas for area and perimeter.
UNIT – V
INTRODUCTION TO DATA SCIENCE
Functional Programming
Functional programming is a paradigm in software development that treats computation as
the evaluation of mathematical functions and avoids changing-state or mutable data. In the
context of data science, functional programming offers several advantages, such as cleaner
code, fewer bugs, and easier parallelization, making it particularly suited for handling large-
scale data analysis.
Toolz and CyToolz: These are Python libraries that bring many functional
programming tools, such as curry, compose, and partial, which help in chaining and
composing functions.
@curry
def add(x, y):
return x + y
add_five = add(5)
print(add_five(10)) # Output: 15
sc = SparkContext()
data = sc.parallelize([1, 2, 3, 4, 5])
import xml.etree.ElementTree as ET
NumPy in Python
NumPy is the fundamental package for scientific computing in Python. It adds support for
large, multi-dimensional arrays and matrices, along with a large collection of mathematical
functions to operate on these arrays.
Creating a NumPy array:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Element-wise addition
c=a+b
print(c) # Output: [5 7 9]
Array reshaping:
arr = np.array([[1, 2, 3], [4, 5, 6]])
# Reshape into 3x2
reshaped = arr.reshape(3, 2)
print(reshaped)
Matrix multiplication:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
# Matrix multiplication
result = np.dot(a, b)
print(result)
Pandas in Python
Pandas is a powerful data manipulation library in Python that provides data structures
like DataFrames and Series for working with structured data.
Creating a DataFrame:
import pandas as pd
df = pd.DataFrame(data)
print(df)
Filtering DataFrame:
# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)
Basic operations:
# Adding a new column
df['Salary'] = [50000, 60000, 70000]
JSON and XML are essential formats for data interchange in data science. Python
offers robust support for both using the json and xml modules.
NumPy is a key library for numerical operations, making it invaluable in data
analysis and machine learning.
Pandas is the go-to library for data manipulation and analysis, providing flexible
data structures for handling diverse types of data.
Visual Aids for Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in data analysis that allows data scientists
and analysts to summarize the main characteristics of the data, often using visual methods.
Visualizations help uncover patterns, trends, relationships, and anomalies in the data. This
guide provides an overview of technical requirements, various types of visualizations, and
considerations for choosing the best chart for EDA using the Seaborn library in Python.
Technical Requirements
To perform EDA with visualizations using Seaborn, ensure you have the following libraries
installed:
1. Python: A programming language that you’ll be using for data analysis.
2. Pandas: For data manipulation and analysis.
3. Matplotlib: A plotting library for creating static, animated, and interactive
visualizations.
4. Seaborn: A statistical data visualization library based on Matplotlib that provides a
high-level interface for drawing attractive graphics.
You can install these libraries using pip:
bash
pip install pandas matplotlib seaborn
Visualizations in EDA
1. Line Chart
Description: A line chart is used to display data points over time, showcasing trends in a time
series dataset.
When to Use: Ideal for visualizing continuous data, especially time series data.
Example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sample data
data = {'Year': [2017, 2018, 2019, 2020, 2021],
'Sales': [200, 250, 300, 350, 400]}
df = pd.DataFrame(data)
# Line chart
sns.lineplot(x='Year', y='Sales', data=df)
plt.title('Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
2. Bar Chart
Description: A bar chart represents categorical data with rectangular bars, where the length
of each bar is proportional to the value it represents.
When to Use: Suitable for comparing different categories.
Example:
# Sample data
data = {'Product': ['A', 'B', 'C', 'D'],
'Sales': [150, 200, 250, 300]}
df = pd.DataFrame(data)
# Bar chart
sns.barplot(x='Product', y='Sales', data=df)
plt.title('Sales by Product')
plt.xlabel('Product')
plt.ylabel('Sales')
plt.show()
3. Scatter Plot
Description: A scatter plot displays values for typically two variables for a set of data,
showing the relationship between them.
When to Use: Useful for identifying relationships or correlations between two numerical
variables.
Example:
# Sample data
data = {'Height': [5.1, 5.5, 6.0, 5.8, 5.7],
'Weight': [100, 150, 180, 175, 160]}
df = pd.DataFrame(data)
# Scatter plot
sns.scatterplot(x='Height', y='Weight', data=df)
plt.title('Height vs. Weight')
plt.xlabel('Height (inches)')
plt.ylabel('Weight (lbs)')
plt.show()
4. Polar Chart
Description: A polar chart represents data in a circular format, where each point is
determined by an angle and a radius.
When to Use: Ideal for visualizing data with a cyclical nature, such as wind direction or
periodic functions.
Example:
import numpy as np
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [4, 2, 5, 3]
# Polar chart
angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
values += values[:1]
angles += angles[:1]
# Histogram
sns.histplot(df['Values'], bins=5, kde=True)
plt.title('Value Distribution')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
Choosing the Best Chart
Selecting the appropriate chart depends on the data type and the analysis objective:
1. Line Chart: Use for continuous data, especially time series.
2. Bar Chart: Use for comparing categories or discrete data.
3. Scatter Plot: Use for examining relationships between two quantitative variables.
4. Polar Chart: Use for cyclical data or displaying relationships in a circular format.
5. Histogram: Use for displaying the distribution of a single quantitative variable.
Visualization plays a critical role in EDA, enabling better understanding and interpretation of
data. Tools like Seaborn provide a powerful way to create meaningful visualizations with just
a few lines of code. By using the appropriate type of visualization, analysts can effectively
communicate their findings and insights derived from the data.
Sample Experiments:
1) Python program to check whether a JSON string contains complex object or not?
Program:
import json
def is_complex_object(obj):
"""Check if the object is a complex JSON object."""
# Check if the object is a dictionary or a list
return isinstance(obj, dict) or isinstance(obj, list)
def check_json_complexity(json_string):
"""Check if a JSON string contains complex objects."""
try:
# Parse the JSON string
parsed_json = json.loads(json_string)
# Example usage
json_string_1 = '{"name": "John", "age": 30, "city": "New York"}' # Simple object
json_string_2 = '{"name": "John", "age": 30, "address": {"street": "123 Main St", "city":
"New York"}}' # Complex object
json_string_3 = '[1, 2, 3, 4]' # List (complex object)
json_string_4 = '"Just a string"' # Simple string
Output
False
True
True
False
2D Array:
[[1 2 3]
[4 5 6]]
3D Array:
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Array of Zeros:
[[0 0 0]
[0 0 0]]
Array of Ones:
[[1 1 1]
[1 1 1]]
Empty Array:
[[0. 0. 0.]
[0. 0. 0.]]
This output demonstrates the different types of NumPy arrays that can be created using the
array() function.
3) Python program to demonstrate use of ndim, shape, size, dtype?
Program:
import numpy as np
3D Array:
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Number of dimensions (ndim): 3
Shape: (2, 2, 2)
Size: 8
Data type (dtype): int64
Array of Zeros:
[[0. 0. 0.]
[0. 0. 0.]]
Number of dimensions (ndim): 2
Shape: (2, 3)
Size: 6
Data type (dtype): float64
This output demonstrates the different properties of the NumPy arrays that can be accessed
using ndim, shape, size, and dtype.
4) Python program to demonstrate basic slicing, integer, and boolean indexing?
Program:
import numpy as np
# Basic slicing
print("\nBasic Slicing:")
print("Elements from index 2 to 5:", array_1d[2:6]) # Slicing from index 2 to 5
# Integer indexing
print("\nInteger Indexing:")
indices = [0, 2, 4]
print("Elements at indices 0, 2, 4:", array_1d[indices]) # Fetching elements at specified
indices
# Boolean indexing
print("\nBoolean Indexing:")
boolean_mask = (array_2d > 5) # Create a boolean mask where elements greater than 5 are
True
print("Original Array:")
print(array_2d)
print("Boolean Mask:")
print(boolean_mask)
print("Elements greater than 5:")
print(array_2d[boolean_mask]) # Fetching elements that satisfy the boolean condition
Explanation
1. Import NumPy: The program starts by importing the NumPy library as np.
2. Creating Arrays:
o A 1D array is created with integer values.
o A 2D array is created as a 3x3 matrix.
3. Basic Slicing:
o For the 1D array, it demonstrates slicing by selecting elements from index 2 to
5 (inclusive of 2 and exclusive of 6).
o For the 2D array, it slices the first two rows and all columns.
4. Integer Indexing:
o It retrieves specific elements from the 1D array using an array of indices.
o In the 2D array, it uses integer indexing to fetch elements at specific row and
column indices.
5. Boolean Indexing:
o A boolean mask is created by checking which elements in the 2D array are
greater than 5.
o It then uses this boolean mask to retrieve only those elements from the array
that satisfy the condition.
Output
1D Array:
[10 20 30 40 50 60 70 80 90]
Basic Slicing:
Elements from index 2 to 5: [30 40 50 60]
Integer Indexing:
Elements at indices 0, 2, 4: [10 30 50]
2D Array:
[[1 2 3]
[4 5 6]
[7 8 9]]
# Create a dictionary with five keys and each key has a list of ten values
data_dict = {
'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
'C': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'D': [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
'E': [41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
}
# Convert the dictionary into a Pandas DataFrame
df = pd.DataFrame(data_dict)
# Sample dataset
data = {
'Year': [2015, 2016, 2017, 2018, 2019, 2020],
'Sales': [150, 200, 250, 300, 350, 400],
'Profit': [30, 50, 70, 90, 120, 150],
'Expenses': [120, 150, 180, 210, 230, 250]
}
# Create a DataFrame
df = pd.DataFrame(data)
# a) Line Chart
plt.figure(figsize=(10, 6))
plt.plot(df['Year'], df['Sales'], marker='o', label='Sales', color='blue')
plt.plot(df['Year'], df['Profit'], marker='o', label='Profit', color='green')
plt.title('Sales and Profit Over Years')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.legend()
plt.grid()
plt.show()
# b) Bar Chart
plt.figure(figsize=(10, 6))
plt.bar(df['Year'], df['Sales'], color='blue', alpha=0.6, label='Sales')
plt.bar(df['Year'], df['Expenses'], color='red', alpha=0.6, label='Expenses')
plt.title('Sales and Expenses Over Years')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.legend()
plt.show()
# c) Scatter Plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Sales'], df['Profit'], color='purple', s=100)
plt.title('Sales vs Profit')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.grid()
plt.show()
# d) Bubble Plot
plt.figure(figsize=(10, 6))
plt.scatter(df['Sales'], df['Profit'], s=df['Expenses'], color='orange', alpha=0.5,
edgecolor='black')
plt.title('Bubble Plot of Sales vs Profit')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.grid()
plt.show()
Explanation
1. Import Libraries: The program imports the necessary libraries: pandas,
matplotlib.pyplot, and seaborn.
2. Create Sample Dataset: A sample dataset is created using a dictionary, which
includes 'Year', 'Sales', 'Profit', and 'Expenses'. This dataset is then converted into a
Pandas DataFrame.
3. Line Chart:
o A line chart is created to show the trend of sales and profit over the years.
o plt.plot() is used to plot the data, and markers are added to indicate data points.
4. Bar Chart:
o A bar chart is created to compare sales and expenses over the years.
o Two bar plots are drawn on the same axes using plt.bar().
5. Scatter Plot:
o A scatter plot is created to visualize the relationship between sales and profit.
o The size of each point is fixed, and the points are colored in purple.
6. Bubble Plot:
o A bubble plot is created to show the relationship between sales and profit,
where the size of each bubble represents the expenses.
o The s parameter in plt.scatter() controls the size of the bubbles.
Output
When you run the program, it will generate four different plots:
1. Line Chart: Displays sales and profit trends over the years.
2. Bar Chart: Compares sales and expenses for each year.
3. Scatter Plot: Shows the relationship between sales and profit.
4. Bubble Plot: Displays the relationship between sales and profit, with bubble sizes
representing expenses.
Note:
To run this code, you will need to have pandas, matplotlib, and seaborn installed in your
Python environment. You can install them using pip if they are not already installed:
pip install pandas matplotlib seaborn
7) Generate Scatter Plot using seaborn library for iris dataset?
Program:
import seaborn as sns
import matplotlib.pyplot as plt
# Sample dataset
data = {
'Category': ['A', 'B', 'C', 'D'],
'2020': [10, 20, 30, 40],
'2021': [15, 25, 35, 45],
'2022': [20, 30, 25, 50]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Set the category as the index
df.set_index('Category', inplace=True)
# Plotting
plt.figure(figsize=(12, 8))
# a) Area Plot
plt.subplot(2, 2, 1)
df.plot.area(alpha=0.5)
plt.title('Area Plot')
plt.ylabel('Values')
plt.xlabel('Categories')
plt.grid()
# b) Stacked Plot
plt.subplot(2, 2, 2)
df.plot(kind='bar', stacked=True)
plt.title('Stacked Bar Plot')
plt.ylabel('Values')
plt.xlabel('Categories')
plt.grid()
# c) Pie Chart
plt.subplot(2, 2, 3)
df['2022'].plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis')
plt.title('Pie Chart for Year 2022')
plt.ylabel('') # Hide the y-label
# d) Table Chart
plt.subplot(2, 2, 4)
plt.axis('tight')
plt.axis('off')
table = plt.table(cellText=df.values, colLabels=df.columns, cellLoc = 'center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 1.2)
plt.title('Table Chart')
plt.tight_layout()
plt.show()
Explanation
1. Import Libraries: The code imports the necessary libraries: pandas for data
manipulation and matplotlib.pyplot for plotting.
2. Sample Dataset: A sample dataset is created with categories (A, B, C, D) and values
for the years 2020, 2021, and 2022.
3. Create a DataFrame: A pandas DataFrame is created from the sample dataset, and
the 'Category' column is set as the index.
4. Plotting:
o Figure Size: The figure size is set to (12, 8) for better visibility.
o Area Plot:
A subplot is created for the area plot, where the plot.area() method is
used to create an area plot for the DataFrame.
The plot is labeled and gridded.
o Stacked Plot:
Another subplot is created for the stacked bar plot, where
plot(kind='bar', stacked=True) is used to generate a stacked bar chart.
The plot is labeled and gridded.
o Pie Chart:
A subplot is created for the pie chart using the values for the year 2022.
The plot.pie() method is used, and percentages are displayed on the
chart with autopct='%1.1f%%'.
o Table Chart:
A subplot is created for the table chart using plt.table() to display the
DataFrame values in a tabular format. The axis is turned off for better
appearance.
5. Display Plots: The plt.tight_layout() method is called to adjust the spacing between
plots, and plt.show() is used to display the plots.
Output
When you run the above program, it will display four visualizations in a single window:
An area plot representing the values over the years.
A stacked bar plot to show how values accumulate in each category.
A pie chart illustrating the distribution of values for the year 2022.
A table chart showing the values in a tabular format.
Note:
To run this code, make sure you have pandas and matplotlib installed in your Python
environment. You can install them using pip if you haven't done so:
pip install pandas matplotlib