DMV Lab Manual Final 13th April 24 v2
DMV Lab Manual Final 13th April 24 v2
Lab Manual
Third Year Semester VI
Subject: Data Management and
Visualization
EVEN SEMESTER
1
2
3
4
5
Index
Sr. No. Contents Page No.
1. List of Experiments
2. Experiment Plan and Course Outcomes
Mapping of Course Outcomes – Program
3.
Outcomes and Program Specific outcome
4. Study and Evaluation Scheme
5. Experiment No. 1
6. Experiment No. 2
7. Experiment No. 3
8. Experiment No. 4
9. Experiment No. 5
10. Experiment No. 6
11. Experiment No. 7
12. Experiment No. 8
13. Experiment No. 9
14. Experiment No.10
6
List of Experiments
7
Course Objective, Course Outcome &
Experiment Plan
Course Objective:
Course Outcomes:
Demonstrate proficiency in Python and integrate SQL seamlessly for practical applications.
CO1
Comprehend the evolution and architecture of data warehousing, demonstrating proficiency in
CO2 data staging, ETL design, and data modeling.
Gain a comprehensive understanding of big data, the Hadoop ecosystem, and the
CO3 fundamentals of Spark.
Adept at using Power BI for comprehensive data management for versatile data analysis and
CO4 sharing.
Demonstrate proficiency in configuring Snowflake as a Power BI data source.
CO5
Gain the foundational understanding of cloud computing, cloud models and analytics tools for
CO6 business intelligence.
8
Experiment Plan:
Module Week Course Weightage
Experiments Name
No. No. Outcome
Analyze and Implement SQL queries on the CO1
1. W1 05
Snowflake platform
Demonstrate connectivity between Python and the CO1
2. W2 05
Snowflake environment by executing SQL queries.
CO6
9. W9 Case study on Cloud Services: Azure, AWS, GCP 10
9
CO-PO & PSO Mapping
Mapping of Course outcomes with Program Outcomes:
10
Mapping of Course outcomes with Program Specific Outcomes:
Contribution to Program
Course Outcomes
Specific outcomes
PSO1 PSO2 PSO3
Demonstrate proficiency in Python and integrate SQL
CO1 seamlessly for practical applications. 2 2 2
Comprehend the evolution and architecture of data
CO2 warehousing, demonstrating proficiency in data 2 2 2
staging, ETL design, and data modeling.
Gain a comprehensive understanding of big data, the
CO3 Hadoop ecosystem, and the fundamentals of Spark. 2 2 2
Adept at using Power BI for comprehensive data
CO4 management for versatile data analysis and sharing. 2 2 2
Demonstrate proficiency in configuring Snowflake as a
CO5 Power BI data source. 2 2 2
Gain the foundational understanding of cloud computing,
CO6 cloud models and analytics tools for business
2 2 2
intelligence.
11
Study and Evaluation Scheme
Course
Course Name Teaching Scheme Credits Assigned
Code
CELDLO Data Theo Practical Tutorial Theory Practical Tutorial Total
6025 Management ry
and
Visualization 02
-- -- -- 02 -- 02
Term Work:
The Term work Marks are based on the weekly experimental performance of the students, Oral performance
and regularity in the lab.
Students are expected to be prepared for the lab ahead of time by referring the manual and perform the
experiment under the guidance and discussion. Next week the experiment write-up to be corrected along with
oral examination.
End of the semester, there will be oral evaluation based on the Theory and laboratory work.
12
Data Management and Visualization Lab
Experiment No.: 1
Analyze and Implement SQL queries on the Snowflake platform.
13
Experiment No.1
1. Aim: Analyze and Implement SQL queries on the Snowflake platform
3. Outcomes: Demonstrate proficiency in Python and integrate SQL seamlessly for practical
applications.
5. Theory:
Snowflake supports most of the standard functions defined in SQL: 1999, as well as parts
of the SQL: 2003 analytic extensions.
Scalar Functions
A scalar function is a function that returns one value per invocation; in most cases, you
can think of this as returning one value per row. This contrasts with Aggregate Functions,
which return one value per group of rows.
Conversion Functions Convert expressions from one data type to another data
type.
Note: students has to perform above mentioned functions using any inbuilt dataset from
snowflake as SQL queries.
6. Conclusion: Thus student able to learn how to extract the information from given dataset
as per the user needs.
7. Viva Questions :
i. Differentiate between Number and Aggregate function.
ii. Differentiate between COUNT & COUNT IF, MAX & MAX_BY and MIN &
MIN_BY functions.
8. References
i. https://docs.snowflake.com/en/sql-reference-functions
15
Data Management and Visualization
Lab Experiment No.: 2
Demonstrate connectivity between Python and Snowflake
Environment by executing SQL queries.
16
Experiment No.2
1. Aim: Demonstrate connectivity between Python and the Snowflake environment by
executing SQL queries.
3. Outcomes: Demonstrate proficiency in Python and integrate SQL seamlessly for practical
applications.
5. Theory:
Snowpark API
The Snowpark library provides an intuitive library for querying and processing data at scale
in Snowflake. Using a library for any of three languages, you can build applications that
process data in Snowflake without moving data to the system where your application code
runs, and process at scale as part of the elastic and serverless Snowflake engine.
Snowflake currently provides Snowpark libraries for three languages: Java, Python, and
Scala.
Attributes
analytics
columns Returns all column names as a list.
dtypes
na Returns a DataFrameNaFunctions object that provides functions for handling
missing values in the DataFrame.
queries Returns a dict that contains a list of queries that will be executed to evaluate this
DataFrame with the key queries, and a list of post-execution actions (e.g.,
queries to clean up temporary objects) with the key post_actions.
schema The definition of the columns in this DataFrame (the "relational schema" for the
DataFrame).
session Returns a snowflake.snowpark.Session object that provides access to the
session the current DataFrame is relying on.
stat
write Returns a new DataFrameWriter object that you can use to write the data in
the DataFrame to a Snowflake database or a stage location
is_cached Whether the dataframe is cached.
• Admin
• Billing & Terms
• Activate anaconda python package
Step 2: Select Python worksheet : The following code is auto generated in python
worksheet.
dataframe.show()
sql_query = "SELECT * FROM EMPLOYEE"
18
dataframe = session.sql(sql_query)
dataframe.show()
Note: students has to perform at least 5-10 aggregate queries and stores into newly created
table as a resultant values.
9. Conclusion: Thus student able to learn how to communicate between python and
snowflake database using snowpark API.
11. References
https://docs.snowflake.com/en/developer-guide/snowpark/python/calling-functions
19
Data Management and Visualization
Lab Experiment No :3
Design of Dimensional data modelling using power BI
20
1. Aim:. Design Dimensional data modelling using Power BI
3. Outcomes: Students will be able to visualize Dimensional data modelling real time data
4. Hardware / Software Required: Power BI
Dimensional data models are primarily used in data warehouses and data marts that support
business intelligence applications. They consist of fact tables that contain data about transactions
or other events and dimension tables that list attributes of the entities in the fact tables. For
example, a fact table could detail product purchases by customers, while connected dimension
tables hold data about the products and customers. Notable types of dimensional models are star
schemas, which connect a fact table to different dimension tables, and snowflake schemas, which
include multiple levels of dimension tables.
Steps in building a dimensional data model
• Choose the business processes that you want to use to analyse the subject area to be modelled.
• Determine the granularity of the fact tables.
• Identify dimensions and hierarchies for each fact table.
• Identify measures for the fact tables.
• Determine the attributes for each dimension table.
Star schema
Star schema is a modeling approach widely adopted by relational data warehouses. It requires
modelers to classify their model tables as either dimension or fact.
Dimension tables describe business entities—the things you model. Entities can include products,
people, places, and concepts including time itself. The most consistent table you'll find in a star
schema is a date dimension table. A dimension table contains a key column (or columns) that acts
as a unique identifier, and descriptive columns.
Fact tables store observations or events, and can be sales orders, stock balances, exchange rates,
temperatures, etc. A fact table contains dimension key columns that relate to dimension tables,
and numeric measure columns. The dimension key columns determine the dimensionality of a fact
table, while the dimension key values determine the granularity of a fact table. For example,
consider a fact table designed to store sale targets that has two dimension key columns Date and
Product Key. It's easy to understand that the table has two dimensions. The granularity, however,
can't be determined without considering the dimension key values. In this example, consider that
the values stored in the Date column are the 21first day of each month. In this case, the granularity
is at month-product level.
Generally, dimension tables contain a relatively small number of rows. Fact tables, on the other
hand, can contain a very large number of rows and continue to grow over time.
22
1. In Power BI Desktop, at the left, click the Model view icon.
2. If you do not see all seven tables, scroll horizontally to the right, and then drag andarrange
the tables more closely together so they can all be seen at the same time.
In Model view, it’s possible to view each table and relationships (connectors between tables).
3. In PowerBI on the Modeling ribbon tab, from inside the Relationships group, click
ManageRelationships.
4. In the Manage Relationships window, notice that no relationships are yet defined.
23
7. In the second dropdown list (beneath the Product table grid), select the Sales table.
24
8. Notice the ProductKey columns in each table have been selected.
The columns were automatically selected because they share the same name.
9. In the Cardinality dropdown list, notice that One To Many is selected.
The cardinality was automatically detected, because Power BI understands that
the ProductKey column from the Product table contains unique values. One-to-many
relationships are the most common cardinality.
10. Active relationships will propagate filters. It’s possible to mark a relationship as inactiveso filters
don’t propagate. Inactive relationships can exist when there are multiple relationship paths
between tables. In which case, model calculations can use special functions to activate them.
You’ll work with an inactive relationship
12. In the Manage Relationships window, notice that the new relationship is listed, andthen
click Close.
25
6. Conclusion: Thus student will able to design Dimensional data modelling using Power BI
7. Viva Questions:
i. What is dimensional modelling?
ii. Explian Star Schema with example
8.References: https://learn.microsoft.com/en-us/power-bi/guidance/star-schema
26
Data Management and Visualization Lab
Experiment No.: 4
Install and implement Word Count program on Hadoop using
Cloudera platform
27
Experiment No.: 4
1. Aim: Install and implement Word Count program on Hadoop using Cloudera platform
2. Objectives: To execute the WordCount application and copy the results from WordCount out of
HDFS (Hadoop Distributed File system)
3. Outcomes: Students will be able to learn the distributed file system enviourment and execute
application in HDFS.
5. Theory: To install VirtualBox and Cloudera Virtual Machine (VM) Image follow the following links.
3. After the successful installations of Cloudera Virtual Machine (VM) Image following steps to be
followed to execute the WordCount application.
Part-1
1.Open a terminal shell. Start the Cloudera VM in VirtualBox, if not already running, and open a terminal
shell. Detailed instructions for these steps can be found in the previous Readings.
2. See example MapReduce programs. Hadoop comes with several example MapReduce applications. You
can see a list of them by running hadoop jar /usr/jars/hadoop-examples.jar. We are interested in running
WordCount.
28
The output says that WordCount takes the name of one or more input files and the name of the output
directory. Note that these files are in HDFS, not the local file system.
3. Verify input file exists. In the previous Reading, we downloaded the complete works of Shakespeare and
copied them into HDFS. Let's make sure this file is still in HDFS so we can run WordCount on it. Run
hadoop fs -ls
4. See WordCount command line arguments. We can learn how to run WordCount by examining its
command-line arguments. Run hadoop jar /usr/jars/hadoop-examples.jar wordcount.
5. Run WordCount. Run WordCount for words.txt: hadoop jar /usr/jars/hadoop-examples.jar wordcount
words.txt out
29
As WordCount executes, the Hadoop prints the progress in terms of Map and Reduce. When the WordCount
is complete, both will say 100%.
6. See WordCount output directory. Once WordCount is finished, let's verify the output was created. First,
let's see that the output directory, out, was created in HDFS by running hadoop fs –ls
We can see there are now two items in HDFS: words.txt is the text file that we previously created, and out
is the directory created by WordCount.
7. Look inside output directory. The directory created by WordCount contains several files. Look inside the
directory by running hadoop –fs ls out
The file part-r-00000 contains the results from WordCount. The file _SUCCESS means WordCount
executed successfully.
8. Copy WordCount results to local file system. Copy part-r-00000 to the local file system by running hadoop
fs –copyToLocal out/part-r-00000 local.txt
9. View the WordCount results. View the contents of the results: more local.txt
Each line of the results file shows the number of occurrences for a word in the input file. For example,
Accuse appears four times in the input, but Accusing appears only once.
30
How do I figure out how to run Hadoop MapReduce programs
Hadoop comes with several MapReduce applications. In the Cloudera VM, these applications are in
/usr/jars/hadoop-examples.jar. You can see a list of all the applications by running hadoop jar
/usr/jars/hadoop-examples.jar.
Each of these MapReduce applications can be run in the terminal. To see how to run a specific application,
append the application name to the end of the command line. For example, to see how to run wordcount, run
31
hadoop jar /usr/jars/hadoop-examples.jar wordcount.
The <in> and <out> denote the names of the input and output, respectively. The square brackets around the
second <in> mean that the second input is optional, and the ... means that more than one input can be used.
This usage says that wordcount is run with one or more inputs and one output, the input(s) are specified first,
and the output is specified last.
Part -2-Copy your data into the Hadoop Distributed File System (HDFS). Follow the instructions.
1. Open a browser. Open the browser by click on the browser icon on the top left of the screen.
2. Download the Shakespeare. We are going to download a text file to copy into HDFS. Enter the
following link in the
browser: http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
32
3. Once the page is loaded, click on the Open menu button.
6.Open a terminal shell. Open a terminal shell by clicking on the square black box on the top left
of the screen.
33
7. Run cd Downloads to change to the Downloads directory.
9. Copy file to HDFS. Run hadoop fs –copyFromLocal words.txt to copy the text file to HDFS.
10. Verify file was copied to HDFS. Run hadoop fs –ls to verify the file was copied to HDFS.
11. Copy a file within HDFS. You can make a copy of a file in HDFS. Run hadoop fs -cp words.txt
words2.txt to make a copy of words.txt called words2.txt
13. Copy a file from HDFS. We can also copy a file from HDFS to the local file system.
34
Run hadoop fs -copyToLocal words2.txt . to copy words2.txt to the local directory.
14. Let's run ls to see that the file was copied to see that words2.txt is there.
7. Delete a file in HDFS. Let's the delete words2.txt in HDFS. Run hadoop fs -rm words2.txt
6. Conclusion: We successfully interacted with Hadoop via its command-line interface, gaining
insights into its functionalities. We efficiently transferred files between the local file system and
HDFS, demonstrating proficiency in data management within distributed environments.
7. Viva Questions:
i. How do you initiate interactions with Hadoop using its command-line application?
ii. Can you explain the process of copying files into and out of the Hadoop Distributed File
System (HDFS) via the command line?
iii. What are the advantages of using the command-line interface for interacting with
Hadoop compared to graphical user interfaces?
iv. How does transferring files between the local file system and HDFS contribute to
efficient data management in distributed environments?
8. References:
1. Cloudera Documentation: https://docs.cloudera.com/documentation/enterprise/latest.html
2. MapR Documentation: https://mapr.com/docs/
3. Towards Data Science: https://towardsdatascience.com/tagged/hadoop
35
Data Management and Visualization Lab
Experiment No.: 5
Case Study on Implementation of Word Count program using
Spark Platform
36
Experiment No.: 5
1. Aim: Case Study on Implementation of Word Count program using Spark Platform
2. Objectives:
- To learn the implementation of a Word Count program using Apache Spark, demonstrating
its distributed computing capabilities for processing large datasets efficiently.
- Evaluate the performance and scalability of the Spark-based Word Count program,
identifying solutions in its implementation.
3. Outcomes:
- The outcome of the case study showcases the successful implementation of the Word Count
program using Spark, highlighting its efficiency in distributed data processing.
- Insights gained from performance evaluation provide valuable understanding for future
development and optimization of Spark-based applications.
5. Theory: Spark is a unified analytics engine for large-scale data processing including built-in
modules for SQL, streaming, machine learning and graph processing. Apache Spark is an open-
source cluster computing framework. Its primary purpose is to handle the real-time generated
data. Spark was built on the top of the Hadoop MapReduce. It was optimized to run in memory
whereas alternative approaches like Hadoop's MapReduce writes data to and from computer hard
drives. So, Spark process the data much quicker than other alternatives.
Following is the case study to find out the frequency of each word exists in a particular file.
Here, we use Scala language to perform Spark operations.
Steps to execute Spark word count example: In this example, we find and display the number of
occurrences of each word.
1. Create a text file in your local machine and write some text into it.
$ nano sparkdata.txt
37
2. Check the text written in the sparkdata.txt file.
$ cat sparkdata.txt
38
5. Now, follow the below command to open the spark in Scala mode.
$ spark-shell
7. Here, pass any file name that contains the data. Now, we can read the generated result by using
the following command.
scala> data.collect;
39
8. Here, we split the existing data in the form of individual words by using the following command.
scala> splitdata.collect;
11. Now, we can read the generated result by using the following command.
scala> mapdata.collect;
40
12. Now, perform the reduce operation
13. Now, we can read the generated result by using the following command.
1. scala> reducedata.collect;
6. Conclusion:
This case study demonstrates the effectiveness of Apache Spark for implementing the Word
Count program, showcasing its scalability and performance in distributed computing. The
insights gained pave the way for further utilization and optimization of Spark for diverse big
data processing tasks.
7. Viva Questions:
8. References: https://cloudxlab.com/assessment/displayslide/458/apache-spark-streaming-
wordcount-hands-on
https://www.digitalocean.com/community/tutorials/apache-spark-example-word-
count
program-java
42
Data Management and Visualization Lab
Experiment No.: 6
Perform Data Extraction and Transformation tasks using Power BI
43
Experiment No.: 6
7. Aim: To perform Data Extraction and Transformation tasks using Power BI
8. Objectives:
- To extract relevant data from various sources and integrate it into Power BI for analysis.
- To cleanse, transform, and enrich the extracted data to ensure accuracy and suitability for
reporting in Power BI.
9. Outcomes:
- To improve decision-making through deeper insights into operations, customer behavior,
and market trends.
- To enhance reporting and visualization with intuitive dashboards, enabling quick
understanding and informed actions.
11. Theory:
44
the extracted data to ensure data accuracy and reliability.
• Data Integration: Combining data from multiple sources into a cohesive dataset, aligning
data structures and formats for seamless analysis.
• Data Enrichment: Enhancing the extracted data by adding calculated columns, derived
metrics, or additional contextual information to provide deeper insights.
Loading Data from various sources
Power BI supports large range of data sources. You can click Get data and it shows you all the
available data connections. It allows you to connect to different flat files, SQL database, and Azure
cloud or even web platforms such as Facebook, Google Analytics, and Salesforce objects. It also
includes ODBC connection to connect to other ODBC data sources, which are not listed.
✓ Flat Files
✓ SQL Database
✓ OData Feed
✓ Blank Query
✓ Azure Cloud platform
✓ Online Services
✓ Blank Query
✓ Other data sources such as Hadoop, Exchange, or Active Directory
To get data in Power BI desktop, you need to click the Get data option in the main screen. It shows
you the most common data sources first. Then, click the More option to see a full list of available
data sources.
45
Transformations on Databases in PowerBI
We don’t get the data that we can directly use in the reports in real time. Instead, we have to clean
that data to meet our business standards. We have a query editor within the Desktop to perform all
the needed operations. To get to Power Query Editor, select Transform data from the Home tab of
Power BI Desktop.
The ribbon in Power Query Editor consists of four tabs: Home, Transform, Add Column, View,
Tools, and Help.
46
Transform Tab: The Transform tab provides access to common data transformation tasks, such
as:
✓ Adding or removing columns
✓ Changing data types
✓ Splitting columns
✓ Other data-driven tasks
47
✓ UnPivot Table
• Please select the Column for which you want to change the data type. Next, click on the left
corner of the column header (currently it represent ABC text). Clicking in that position opens a
drop-down list of supported data types. Please select the data type that suits your data. Here, we
are selecting the Whole number.
Note: Students should apply any five above listed transformations on the dataset using Power
Query Editor in PowerBI
12. Conclusion: Power BI for data extraction and transformation yields tangible benefits by
facilitating streamlined data analysis and reporting. By leveraging its intuitive interface and
robust features, organizations can harness the full potential of their data assets, driving informed
decision-making and fostering business growth in today's competitive landscape.
7. Viva Questions:
• Can you explain the process of data extraction in Power BI and how it differs from
traditional methods?
• What are the key considerations when selecting data sources for extraction in Power BI?
• How do you ensure data quality and integrity during the extraction and transformation
process in Power BI?
• Can you discuss any challenges you encountered while performing data extraction and
transformation tasks in Power BI and how you addressed them?
8. References: https://learn.microsoft.com/en-us/power-bi/fundamentals/desktop-what-is-desktop
48
Data Management and Visualization Lab
Experiment No.: 7
Perform Data Visualization tasks and create dashboards using Power BI
49
Experiment No.: 7
1. Aim: To Perform Data Visualization tasks and create dashboards using Power BI
2. Objectives:
- To Create visually appealing and informative data visualizations that accurately represent
key metrics and trends extracted from various data sources using Power BI.
- To Develop interactive dashboards in Power BI that consolidate multiple visualizations and
provide a comprehensive overview of organizational performance, enabling users to gain
insights and make data-driven decisions efficiently.
3. Outcomes:
• To gain a deeper understanding of complex datasets, enabling them to identify patterns,
correlations, and outliers more easily.
• To the create intuitive dashboards in Power BI, organizations streamline data analysis
processes, reducing the time and effort required to access and interpret key information.
5. Theory:
Visualizations allow data to be represented in different ways, leading to insights into data
relationships that may not be easily seen. Power BI allows users to create and adjust visualizations
based on their own needs as they look at data. Users will be able to look at data from different
perspectives and find insights into data relationships that help them make better informed
decisions.
Steps of Data Sourcing to Creation of Reports and Dashboards:
The whole process of data sourcing to the creation of reports and dashboards consists of four basic
steps.
1. Data Sourcing in Power BI: Power BI offers a versatile range of data sources, including cloud-
based online services and local files. While there's a 1 GB limit on importing data from online
services, Power BI supports various sources such as Excel, Text/CSV, XML, JSON, Oracle
Database, and Azure SQL Database.
2. Data Transformation in Power BI: Before visualizing the data, a crucial step involves cleaning
and pre-processing. This includes eliminating missing values and irrelevant data from rows
and columns. Adhering to specific rules, datasets are transformed and loaded into the
warehouse for further analysis.
50
3. Report Development in Power BI: Once data is cleaned and transformed, reports are crafted
based on specific requirements. These reports are essentially data visualizations that
incorporate different filters and constraints. The visual representations can take the form of
graphs, pie charts, and other graphical elements.
4. Dashboard Creation in Power BI: Power BI dashboards are built by pinning independent
elements from live reports. This process occurs after publishing the report to the Power BI
service. The saved reports retain their filter settings, allowing users to create dynamic
dashboards with real-time data insights.
51
4. The Filters pane, where you can filter data visualizations.
5. The Visualizations pane, where you can add, change, or customize visualizations, and
apply drillthrough.
6. The Format pane, where you design the report and visualizations.
7. The Fields pane, which shows the available fields in your queries. You can drag these
fields onto the canvas, the Filters pane, or the Visualizations pane to create or modify
visualizations.
Visualizations
1. The Fields option in the Visualization pane lets you
drag data fields to Legend and other field wells in the
pane.
2. The Format option lets you apply formatting and
other controls to visualizations.
3. The icons show the type of visualization created.
You can change the type of a selected visualization by
selecting a different icon, or create a new visualization
by selecting an icon with no existing visualization
selected.
Power BI offers the functionality to visually represent
our data or a subset of it so that it can be used to draw
inferences or gain a deeper understanding of the data.
These visuals can be bar graphs, pie charts, etc.
Following are some examples of basic visual options
provided in Power BI-
Card – It is used to represent a single value such as
Total Sales, etc.
Stacked bar/column chart – they combine a line chart(
which joins points representing some values with a line) and a bar/column chart(which
represents a value against the purpose and other optional fields).
Waterfall chart – It represents a continuously changing value where increase or decrease in
value may be represented by differently colored bars. Pie chart– it represents the fractional value
of each category of a particular field.
Map-It is used to represent different information on a map.
52
KPI-It represents the continuous progress made towards a target.
Slicer – A slicer has options representing different categories of a field. Selecting that category
shows only the information specific to that category in other visuals.
Table – A table represents data in tabular form, i.e rows, and columns.
Note: Students should select any dataset and create meaningful reports by exploring various
visual tools available in PowerBI
7. Viva Questions:
• How does Power BI utilize principles of data visualization theory to enhance the
effectiveness of its visualizations?
• Can you discuss how interactive features in Power BI dashboards improve user engagement
and facilitate data exploration?
• What steps did you take to ensure that your Power BI visualizations adhere to ethical
considerations, such as accuracy and transparency in data representation?
8. References: https://learn.microsoft.com/en-us/power-bi/fundamentals/desktop-what-is-desktop
53
Data Management and Visualization Lab
Experiment No.: 8
Integrate Snowflake with Power BI and provide insights
54
Experiment No.: 8
1. Aim: To Integrate Snowflake with Power BI and provide insights
2. Objectives:
• Establish a seamless integration between Snowflake and Power BI to enable efficient data transfer and
create visually appealing dashboards.
• Utilize Power BI's analytical capabilities to extract meaningful insights from Snowflake data,
including identifying trends and correlations.
• Evaluate the performance of the Snowflake-Power BI integration in terms of data processing speed,
query efficiency, and overall system responsiveness.
3. Outcomes:
• To connect Snowflake data to Power BI, import relevant datasets, and create interactive
visualizations such as charts, graphs, and maps within Power BI dashboards.
• Extract actionable insights from Snowflake data using Power BI's analytical capabilities,
helping in informed decision-making.
5. Theory:
Snowflake is a cloud-based data warehousing platform that offers scalability, flexibility, and
performance for storing and analyzing large volumes of data. It uses a unique architecture that
separates storage and compute resources, allowing users to scale each independently based on
their needs. Snowflake supports various data types, including structured, semi-structured, and
unstructured data, making it suitable for diverse data analytics tasks.
Integrating Snowflake with Power BI, a leading business intelligence tool by Microsoft, provides
organizations with powerful capabilities for data visualization, reporting, and analysis. Power BI
allows users to connect directly to Snowflake data warehouses, enabling real-time or near-real-
time access to updated data for decision-making.
Steps to connect Snowflake data warehouse from Power Query Desktop
To make the connection to a Snowflake computing warehouse, take the following steps:
1. Select Get Data from the Home ribbon in Power BI Desktop, select Database from the
categories on the left, select Snowflake, and then select Connect.
55
2. In the Snowflake window that appears, enter the name of your Snowflake server in Server and
the name of your Snowflake computing warehouse in Warehouse.
3. Optionally, enter values in any advanced options that you want to use to modify the connection
query, such as a text value to use as a Role name or a command timeout.
4. Select OK.
5. To sign in to your Snowflake computing warehouse, enter your username and password, and then
select Connect.
56
6. In Navigator, select one or multiple elements to import and use in Power BI Desktop. Then select
either Load to load the table in Power BI Desktop, or Transform Data to open the Power Query
Editor where you can filter and refine the set of data you want to use, and then load that refined
set of data into Power BI Desktop.
7. Select Import to import data directly into Power BI, or select DirectQuery, then select OK.
57
For DB and Warehouse name
The screenshot provided below is for your reference.
Note: Students should select any dataset from snowflake and integrate in PowerBI and create
meaningful dashboard.
6. Conclusion: Snowflake and Power BI work together seamlessly to give businesses easy
access to data and powerful tools for analysis, helping them make better decisions and
succeed.
7. Viva Questions:
• What are the benefits of using Power BI's drag-and-drop interface for creating visualizations
compared to traditional methods?
• How does Snowflake's separation of storage and compute resources contribute to cost-
effectiveness and flexibility for businesses?
• Can you explain a scenario where the integration of Snowflake with Power BI resulted in
significant insights or improvements for a specific business process?
8. References: https://learn.microsoft.com/en-us/power-query/connectors/snowflake
58
Data Management and Visualization Lab
Experiment No.: 09
Case study on Cloud Services: Azure, AWS, GCP
59
Experiment No.: 09
1. Aim: Case study on Cloud Services: Azure, AWS, GCP
2. Objectives:
- Compare the performance metrics of Azure, AWS, and GCP under various workloads to
determine which platform offers better speed, scalability, and reliability.
- Conduct a comprehensive cost analysis of utilizing Azure, AWS, and GCP services.
3. Outcomes:
- To the create intuitive dashboards in Power BI, organizations streamline data analysis
processes, reducing the time and effort required to access and interpret key information.
- Analyze the cost-effectiveness of each cloud platform based on the identified performance
metrics
3. Hardware / Software Required: NA
4. Theory: In today's digital era, organizations increasingly rely on cloud computing services to host
applications, store data, and streamline operations. Three major players dominate the cloud market:
Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP). Each offers
a vast array of services tailored to diverse business needs. This case study aims to compare these
cloud platforms comprehensively, focusing on performance, cost, and suitability for various
workloads.
1. Azure (Microsoft Azure): Azure is a comprehensive cloud computing platform offered by
Microsoft. It provides a wide range of services, including computing, storage, networking,
databases, analytics, and AI.
Key Features:
• Azure Blob Storage: Object storage service for storing and accessing large
amounts of unstructured data.
• Azure App Service: Platform as a Service (PaaS) offering for building, deploying,
and managing web and mobile applications.
• Azure AI and Machine Learning: Tools and services for building and deploying AI
and machine learning models.
60
Use Cases: Azure is popular among enterprises for hosting applications, websites,
databases, and for running data analytics and AI workloads. It caters to a wide range of
industries, including healthcare, finance, manufacturing, and government sectors.
2. AWS (Amazon Web Services): AWS is a leading cloud computing platform provided by
Amazon. It offers a broad set of global compute, storage, database, analytics, and machine
learning services, as well as Internet of Things (IoT) and security solutions.
Key Features:
• Amazon S3: Scalable object storage service for storing and retrieving data.
• Amazon RDS: Managed relational database service for deploying and scaling
databases.
• Amazon SageMaker: Fully managed service for building, training, and deploying
machine learning models.
Use Cases: AWS is widely used by startups, enterprises, and government organizations
for various purposes, including website hosting, mobile app development, data storage
and analytics, IoT applications, and machine learning.
3. GCP (Google Cloud Platform): GCP is a suite of cloud computing services provided by
Google. It offers infrastructure as a service (IaaS), platform as a service (PaaS), and software
as a service (SaaS) solution, along with data storage, analytics, machine learning, and
networking services.
Key Features:
• Google Cloud Storage: Scalable object storage with high availability and global
edge-caching.
• Google Cloud SQL: Fully managed relational database service for MySQL,
PostgreSQL, and SQL Server.
Use Cases: GCP is favored by organizations looking for scalable and cost-effective cloud
solutions. It is particularly popular among companies in the technology, gaming, media, and
retail industries. GCP is known for its strong capabilities in data analytics, machine learning,
and containerization.
62
- Azure Cognitive Services - AWS AI/ML services - AI Hub (pre-trained
models)
- Azure Databricks (data - Amazon Redshift (data - BigQuery (data
analytics) warehousing) analytics)
- Azure Synapse Analytics - Amazon EMR (Hadoop, - Dataflow (stream and
(big data) Spark) batch processing)
Additional - Azure DevOps - AWS CloudFormation - Cloud Build
Services (continuous integration)
- Azure Active Directory - AWS Identity and Access - Cloud IAM (identity
Management (IAM) and access
management)
- Azure Monitor - Amazon CloudWatch - Stackdriver
(monitoring) (monitoring) (monitoring)
- Azure Security Center - AWS Inspector (security) - Cloud Security
Scanner (security)
- Azure Backup - AWS Backup (data - Cloud Armor (DDoS
protection) protection)
5. Comparison of pricing for Azure, AWS, and GCP across some common services:
Service Azure AWS GCP
Virtual Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
Machines (VMs) reserved instances reserved instances sustained use discounts
Storage (e.g., S3, Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
Blob) storage tiers storage classes multi-regional storage
Database Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
Services reserved capacity reserved instances committed use discounts
Networking Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
VPN Gateway Direct Connect Cloud Interconnect
AI/ML Services Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
usage-based usage-based preemptible VMs
6. Conclusion: The case study reveals Azure's hybrid integration, AWS's vast service portfolio,
and GCP's innovative pricing, aiding organizations in cloud platform selection based on
individual needs and preferences.
63
7. Viva Questions:
• How do the hybrid cloud integration capabilities of Azure differentiate it from AWS and
GCP?
• How do the pricing models of Azure, AWS, and GCP impact the cost-effectiveness of
using each platform?
• What are some key factors organizations should consider when selecting a cloud
platform, and how do Azure, AWS, and GCP address these considerations differently?
8. References:
• https://www.digitalocean.com/resources/article/comparing-aws-azure-gcp#use-
cases-for-aws-vs-azure-vs-gcp
• https://www.veritis.com/blog/aws-vs-azure-vs-gcp-the-cloud-platform-of-your-
choice/
64
Data Management and Visualization Lab
Experiment No.: 10
Perform Data Visualization tasks and create dashboards using Tableau
65
Experiment No.: 10
5. Aim: To Perform Data Visualization tasks and create dashboards using Tableau
6. Objectives:
- To Create visually appealing and informative data visualizations that accurately represent key
metrics and trends extracted from various data sources using Power BI.
- To Develop interactive dashboards in Power BI that consolidate multiple visualizations and
provide a comprehensive overview of organizational performance, enabling users to gain
insights and make data-driven decisions efficiently.
7. Outcomes:
- To gain a deeper understanding of complex datasets, enabling them to identify patterns,
correlations, and outliers more easily.
- To the create intuitive dashboards in Power BI, organizations streamline data analysis
processes, reducing the time and effort required to access and interpret key information.
9. Theory:
Tableau is an excellent data visualization and business intelligence tool used for reporting and
analyzing vast volumes of data. It is an American company that started in 2003—in June 2019,
Salesforce acquired Tableau. It helps users create different charts, graphs, maps, dashboards, and
stories for visualizing and analyzing data, to help in making business decisions.
- Tableau supports powerful data discovery and exploration that enables users to answer
important questions in seconds
- No prior programming knowledge is needed; users without relevant experience can start
immediately with creating visualizations using Tableau
- It can connect to several data sources that other BI tools do not support. Tableau enables
users to create reports by joining and blending different datasets
- Tableau Server supports a centralized location to manage all published data sources within
an organization
66
Working with Tableau Desktop
1. Connect to your Data
The start page gives you several options to choose from:
1.1 Tableau icon. Click An icon of the Tableau logo in the upper left corner of any page to
toggle between the start page and the authoring workspace.
1.2 Connect pane. Under Connect, you can:
- Connect to data that is stored in a file, such as Microsoft Excel, PDF, Spatial files, and
more.
- Connect to data that is stored on Tableau Server, Microsoft SQL Server, Google
Analytics, or another server.
- Connect to a data source that you’ve connected to before.
Tableau supports the ability to connect to a wide variety of data stored in a wide variety of
places. The Connect pane lists the most common places that you might want to connect to, or
click the More links to see more options.
1.3. Under Accelerators, see a selection of Accelerators and the sample workbooks that come
with Tableau Desktop. Prior to 2023.2, these were only sample workbooks.
1.4. Under Open, you can open workbooks that you've already created.
1.5. Under Discover, find additional resources like video tutorials, forums, or the “Viz of the
week” to get ideas about what you can build.
In the Connect pane, under Saved Data Sources, click Sample - Superstore to connect to the
67
sample data set.
After you select Sample - Superstore, your screen will look something like this:
The Sample - Superstore data set comes with Tableau. It contains information about products,
sales, profits, and so on that you can use to identify key areas for improvement within this
fictitious company.
Creating Reports
1. From the Data pane, drag Order Date to the Columns shelf.
Note: When you drag Order Date to the Columns shelf, Tableau creates a column for each
year in your data set. Under each column is an Abc indicator. This indicates that you can
68
drag text or numerical data here, like what you might see in an Excel spreadsheet. If you
were to drag Sales to this area, Tableau creates a crosstab (like a spreadsheet) and displays
the sales totals for each year.
2. From the Data pane, drag Sales to the Rows shelf.
Tableau generates the following chart with sales rolled up as a sum (aggregated). You can
see total aggregated sales for each year by order date.
When you first create a view that includes time (in this case Order Date), Tableau
automatically generates a line chart.
Refine your View
1. From the Data pane, drag Category to the Columns shelf and place it to the right of
YEAR(Order Date).
Your view updates to a bar chart. By adding a second discrete dimension to the view you can
categorize your data into discrete chunks instead of looking at your data continuously over time.
This creates a bar chart and shows you overall sales for each product category by year.
69
2. Double-click or drag Sub-Category to the Columns shelf.
Note: You can drag and drop or double-click a field to add it to your view, but be careful.
Tableau makes assumptions about where to add that data, and it might not be placed where
you expect. You can always click Undo to remove the field, or drag it off the area where
Tableau placed it to start over.
Sub-Category is another discrete field. It creates another header at the bottom of the view, and
shows a bar for each sub-category (68 marks) broken down by category and year.
70
two simple filters to your worksheet to make it easier to look at product sales by sub-category for
a specific year.
1. In the Data pane, right-click Order Date and select Show Filter.
2. Repeat the step above for the Sub-Category field.
The filters are added to the right side of your view in the order that you selected them. Filters are
card types and can be moved around on the canvas by clicking on the filter and dragging it to
another location in the view. As you drag the filter, a line appears that shows you where you can
drop the filter to move it.
Note: The Get Started tutorial uses the default position of the filter cards.
More on Filtering in the Learning Library (in the top menu).
71
Note: Tableau automatically added a color legend and assigned a diverging color palette because
your data includes both negative and positive values.
Visualization in Tableau
In Tableau, data visualization is a central aspect of its functionality. Tableau provides a user -
friendly interface and a wide range of tools to create interactive and insightful visualizations from
various data sources.
Visualization Types:
• Charts and Graphs: Tableau offers a variety of visualization types including bar charts, line
charts, scatter plots, heat maps, pie charts, histograms, treemaps, geographic maps, and more.
• Dashboard Layouts: Users can create interactive dashboards by combining multiple
visualizations and designing custom layouts to present insights effectively.
Interactivity and Drill-Down:
• Filters and Parameters: Tableau provides filters and parameters to control data displayed in
visualizations dynamically, allowing users to drill down into specific details and explore data
interactively.
• Actions: Users can create interactive actions between visualizations, enabling cross-filtering,
highlighting, and linking behaviors to enhance user experience.
Calculations and Expressions:
• Calculated Fields: Users can create calculated fields using formulas, functions, and logical
expressions to perform custom calculations within Tableau.
72
• Table Calculations: Tableau offers table calculations for computing values across rows,
columns, or specific dimensions within visualizations.
Mapping and Geospatial Analysis:
• Geographic Mapping: Tableau supports geographic mapping to visualize data on maps,
including custom geocoding, layers, and spatial data integration.
• Spatial Analysis: Users can perform spatial analysis tasks such as distance calculations,
clustering, and territory mapping for geospatial insights.
Advanced Analytics:
• Predictive Modeling: Tableau integrates with statistical tools and algorithms for predictive
modeling, forecasting, trend analysis, and what-if scenarios.
• R and Python Integration: Users can leverage R and Python scripts within Tableau for advanced
analytics, statistical calculations, and custom visualizations.
10. Conclusion: Tableau offers robust data visualization capabilities, enabling users to create
interactive and insightful visualizations from diverse data sources, facilitating data-driven decision-
making and enhancing communication of key insights.
11. Viva Questions
• Can you explain the process of creating a dashboard in Tableau from multiple data sources?
• How does Tableau handle data connections and data preparation tasks?
• Discuss Tableau's capabilities for connecting to various data sources, data blending
techniques, and data cleansing features.
• Describe the role of calculated fields and expressions in Tableau.
• Provide examples of how calculated fields are used to perform advanced calculations, create
custom metrics, and enhance visualizations.
12. References:
https://help.tableau.com/current/guides/get-started-tutorial/en-us/get-started-tutorial-
home.htm
73