0% found this document useful (0 votes)
14 views

DMV Lab Manual Final 13th April 24 v2

The document is a lab manual for a Data Management and Visualization course for third-year Computer Engineering students. It outlines the course objectives, outcomes, and a detailed list of experiments that include SQL queries on Snowflake, data modeling using Power BI, and data visualization tasks. Additionally, it includes a study and evaluation scheme, mapping of course outcomes with program outcomes, and specific experiment plans for hands-on learning.

Uploaded by

starsky9886
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

DMV Lab Manual Final 13th April 24 v2

The document is a lab manual for a Data Management and Visualization course for third-year Computer Engineering students. It outlines the course objectives, outcomes, and a detailed list of experiments that include SQL queries on Snowflake, data modeling using Power BI, and data visualization tasks. Additionally, it includes a study and evaluation scheme, mapping of course outcomes with program outcomes, and specific experiment plans for hands-on learning.

Uploaded by

starsky9886
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Department of Computer Engineering

Lab Manual
Third Year Semester VI
Subject: Data Management and
Visualization

EVEN SEMESTER

1
2
3
4
5
Index
Sr. No. Contents Page No.
1. List of Experiments
2. Experiment Plan and Course Outcomes
Mapping of Course Outcomes – Program
3.
Outcomes and Program Specific outcome
4. Study and Evaluation Scheme
5. Experiment No. 1
6. Experiment No. 2
7. Experiment No. 3
8. Experiment No. 4
9. Experiment No. 5
10. Experiment No. 6
11. Experiment No. 7
12. Experiment No. 8
13. Experiment No. 9
14. Experiment No.10

6
List of Experiments

Sr. Experiment Name


No.

1. Analyze and Implement SQL queries on the Snowflake platform


Demonstrate connectivity between Python and the Snowflake environment
2. by executing SQL queries.

3. Design Dimensional data modelling using power BI


Install and implement Word Count program on Hadoop using Cloudera
4. platform
5. Install and implement Word Count program using Spark platform

6. Perform Data Extraction and Transformation tasks using Power BI

7. Perform Data Visualization tasks and create dashboards using Power BI

8. Integrate Snowflake with Power BI and provide insights


Case study on Cloud Services: Azure, AWS, GCP
9.
Perform Data Visualization tasks and create dashboards using Tableau
10.

7
Course Objective, Course Outcome &
Experiment Plan
Course Objective:

1 To achieve proficiency in programming fundamentals and data management.

2 To provide understanding of data warehousing and big data.

3 To visualize data by integrating from various data sources.

4 To provide understanding of the core cloud data services

Course Outcomes:
Demonstrate proficiency in Python and integrate SQL seamlessly for practical applications.
CO1
Comprehend the evolution and architecture of data warehousing, demonstrating proficiency in
CO2 data staging, ETL design, and data modeling.
Gain a comprehensive understanding of big data, the Hadoop ecosystem, and the
CO3 fundamentals of Spark.
Adept at using Power BI for comprehensive data management for versatile data analysis and
CO4 sharing.
Demonstrate proficiency in configuring Snowflake as a Power BI data source.
CO5
Gain the foundational understanding of cloud computing, cloud models and analytics tools for
CO6 business intelligence.

8
Experiment Plan:
Module Week Course Weightage
Experiments Name
No. No. Outcome
Analyze and Implement SQL queries on the CO1
1. W1 05
Snowflake platform
Demonstrate connectivity between Python and the CO1
2. W2 05
Snowflake environment by executing SQL queries.

3. W3 Design of dimensional data modelling using power CO2


10
BI
Install and implement Word Count program on
4. W4 CO3 05
Hadoop using Cloudera platform
Case Study on Implementation of Word Count CO3
5. W5 05
program using Spark Platform
6 W6 Perform Data Extraction and Transformation tasks CO4
05
using Power BI

7. W7 Perform Data Visualization tasks and create CO4


05
dashboards using Power BI

8. W8 Integrate Snowflake with Power BI and provide CO5


05
insights

CO6
9. W9 Case study on Cloud Services: Azure, AWS, GCP 10

10. W10 Content Beyond Syllabus


Perform Data Visualization tasks and create
dashboards using Tableau

9
CO-PO & PSO Mapping
Mapping of Course outcomes with Program Outcomes:

Subject Weight Course Outcomes Contribution to Program outcomes


1 2 3 4 5 6 7 8 9 1 11 1
0 2
Demonstrate proficiency in Python 1 1 1 2 2 1 1 1 1
and integrate SQL seamlessly for
practical applications.
Comprehend the evolution and 1 1 1 1 2 1 1 1 1
Practical architecture of data warehousing,
100% demonstrating proficiency in data
staging, ETL design, and data
modeling.
Gain a comprehensive 1 2 1 1 1 1 1 1 1
understanding of big data, the
Hadoop ecosystem, and the
fundamentals of Spark.
Adept at using Power BI for 1 2 1 2 1 1 1 1
comprehensive data management for
versatile data analysis and sharing.
Demonstrate proficiency in configuring 1 1 2 2 1 1 1 1
Snowflake as a Power BI data
source.
Gain the foundational understanding of
cloud computing, cloud models and 1 1 1 2 2 1 1 1 1
analytics tools for business
intelligence.

10
Mapping of Course outcomes with Program Specific Outcomes:

Contribution to Program
Course Outcomes
Specific outcomes
PSO1 PSO2 PSO3
Demonstrate proficiency in Python and integrate SQL
CO1 seamlessly for practical applications. 2 2 2
Comprehend the evolution and architecture of data
CO2 warehousing, demonstrating proficiency in data 2 2 2
staging, ETL design, and data modeling.
Gain a comprehensive understanding of big data, the
CO3 Hadoop ecosystem, and the fundamentals of Spark. 2 2 2
Adept at using Power BI for comprehensive data
CO4 management for versatile data analysis and sharing. 2 2 2
Demonstrate proficiency in configuring Snowflake as a
CO5 Power BI data source. 2 2 2
Gain the foundational understanding of cloud computing,
CO6 cloud models and analytics tools for business
2 2 2
intelligence.

11
Study and Evaluation Scheme
Course
Course Name Teaching Scheme Credits Assigned
Code
CELDLO Data Theo Practical Tutorial Theory Practical Tutorial Total
6025 Management ry
and
Visualization 02
-- -- -- 02 -- 02

Course Code Course Name Examination Scheme


CELDLO6025 Data Term Work Oral Total
Management
25 25 50
and
Visualization

Term Work:

The Term work Marks are based on the weekly experimental performance of the students, Oral performance
and regularity in the lab.

Students are expected to be prepared for the lab ahead of time by referring the manual and perform the
experiment under the guidance and discussion. Next week the experiment write-up to be corrected along with
oral examination.

End Semester Examination:

End of the semester, there will be oral evaluation based on the Theory and laboratory work.

12
Data Management and Visualization Lab

Experiment No.: 1
Analyze and Implement SQL queries on the Snowflake platform.

13
Experiment No.1
1. Aim: Analyze and Implement SQL queries on the Snowflake platform

2. Objectives: To achieve proficiency in programming fundamentals and data management.

3. Outcomes: Demonstrate proficiency in Python and integrate SQL seamlessly for practical
applications.

4. Hardware / Software Required: Snowflake

5. Theory:

Snowflake supports most of the standard functions defined in SQL: 1999, as well as parts
of the SQL: 2003 analytic extensions.

Scalar Functions
A scalar function is a function that returns one value per invocation; in most cases, you
can think of this as returning one value per row. This contrasts with Aggregate Functions,
which return one value per group of rows.

Category and few Description


Functions with syntax

Conversion Functions Convert expressions from one data type to another data
type.

CAST( <source_expr> AS <target_data_type> )


Date & Time Functions Manipulate dates, times, and timestamps.
DATE_PART( <date_or_time_part> , <date_or_time_expr> )
EXTRACT( <date_or_time_part> FROM
<date_or_time_expr> )
HOUR( <time_or_timestamp_expr> )
MINUTE( <time_or_timestamp_expr> )
SECOND( <time_or_timestamp_expr> )
LAST_DAY( <date_or_time_expr> [ , <date_part> ] )
MONTHNAME( <date_or_timestamp_expr> )
NEXT_DAY( <date_or_time_expr> , <dow_string> )
PREVIOUS_DAY( <date_or_time_expr> , <dow> )
MONTHS_BETWEEN ( <date_expr1> , <date_expr2> )

Numeric Functions Perform rounding, truncation, exponent, root, logarithmic,


and trigonometric operations on numeric values.
ABS( <num_expr> )
CEIL( <input_expr> [, <scale_expr> ] )
FLOOR( <input_expr> [, <scale_expr> ] )
MOD( <expr1>
14 , <expr2> )
ROUND( <input_expr> [ , <scale_expr> [ , <rounding_mode>
]])
SIGN( <expr> )
TRUNC( <input_expr> [ , <scale_expr> ] )

String & Binary Manipulate and transform string input.


Functions
<subject> REGEXP <pattern>
REGEXP_LIKE( <subject> , <pattern> [ , <parameters> ] )

Aggregate Functions It take multiple rows/values as input and return a single


value.
AVG( [ DISTINCT ] <expr1> )
COUNT( [ DISTINCT ] <expr1> [ , <expr2> ... ] )
COUNT_IF( <condition> )
MAX( <expr> )
MAX_BY( <col_to_return>, <col_containing_maximum> [ ,
<maximum_number_of_values_to_return> ] )
MEDIAN( <expr> )
MIN( <expr> )
MIN_BY( <col_to_return>, <col_containing_mininum> [ ,
<maximum_number_of_values_to_return> ] )
SUM( [ DISTINCT ] <expr1> )

Note: students has to perform above mentioned functions using any inbuilt dataset from
snowflake as SQL queries.

6. Conclusion: Thus student able to learn how to extract the information from given dataset
as per the user needs.

7. Viva Questions :
i. Differentiate between Number and Aggregate function.
ii. Differentiate between COUNT & COUNT IF, MAX & MAX_BY and MIN &
MIN_BY functions.

8. References

i. https://docs.snowflake.com/en/sql-reference-functions

15
Data Management and Visualization
Lab Experiment No.: 2
Demonstrate connectivity between Python and Snowflake
Environment by executing SQL queries.

16
Experiment No.2
1. Aim: Demonstrate connectivity between Python and the Snowflake environment by
executing SQL queries.

2. Objectives: To achieve proficiency in programming fundamentals and data management.

3. Outcomes: Demonstrate proficiency in Python and integrate SQL seamlessly for practical
applications.

4. Hardware / Software Required: Snowflake

5. Theory:

Snowpark API

The Snowpark library provides an intuitive library for querying and processing data at scale
in Snowflake. Using a library for any of three languages, you can build applications that
process data in Snowflake without moving data to the system where your application code
runs, and process at scale as part of the elastic and serverless Snowflake engine.

Snowflake currently provides Snowpark libraries for three languages: Java, Python, and
Scala.

Attributes

analytics
columns Returns all column names as a list.
dtypes
na Returns a DataFrameNaFunctions object that provides functions for handling
missing values in the DataFrame.
queries Returns a dict that contains a list of queries that will be executed to evaluate this
DataFrame with the key queries, and a list of post-execution actions (e.g.,
queries to clean up temporary objects) with the key post_actions.
schema The definition of the columns in this DataFrame (the "relational schema" for the
DataFrame).
session Returns a snowflake.snowpark.Session object that provides access to the
session the current DataFrame is relying on.
stat
write Returns a new DataFrameWriter object that you can use to write the data in
the DataFrame to a Snowflake database or a stage location
is_cached Whether the dataframe is cached.

Stepwise explanation to connect to snowflake database using python worksheet.


17
Step 1: Initially we have to activate Anaconda Python Package from snowflake: follow the
steps to activate python package.

• Admin
• Billing & Terms
• Activate anaconda python package

Step 2: Select Python worksheet : The following code is auto generated in python
worksheet.

# The Snowpark package is required for Python Worksheets.


# You can add more packages by selecting them using the Packages control and then
importing them.

import snowflake.snowpark as snowpark


from snowflake.snowpark.functions import col

def main(session: snowpark.Session):


# Your code goes here, inside the "main" handler.
tableName = 'information_schema.packages'
dataframe = session.table(tableName).filter(col("language") == 'python')

# Print a sample of the dataframe to standard output.


dataframe.show()

# Return value will appear in the Results tab.


return dataframe

Step 3: Select dataset and warehouse for further queries.

Step 4: Run the worksheet.

Sample code of database connectivity:

# The Snowpark package is required for Python Worksheets.


# You can add more packages by selecting them using the Packages control and then importing
them.

import snowflake.snowpark as snowpark


from snowflake.snowpark.functions import col

def main(session: snowpark.Session):

sql_insert = "INSERT INTO EMPLOYEE values (1007, 'Tushar G')"


dataframe = session.sql(sql_insert)

# Print a sample of the dataframe to standard output.

dataframe.show()
sql_query = "SELECT * FROM EMPLOYEE"
18
dataframe = session.sql(sql_query)
dataframe.show()

# Return value will appear in the Results tab.


return dataframe

Note: students has to perform at least 5-10 aggregate queries and stores into newly created
table as a resultant values.

9. Conclusion: Thus student able to learn how to communicate between python and
snowflake database using snowpark API.

10. Viva Questions :


i. Discuss about snowpark API.
ii. Explain the steps of database connectivity using snowpar in python worksheet.
iii. What are the packages required to activate python connectivity in snowflake?

11. References

https://docs.snowflake.com/en/developer-guide/snowpark/python/calling-functions

19
Data Management and Visualization

Lab Experiment No :3
Design of Dimensional data modelling using power BI

20
1. Aim:. Design Dimensional data modelling using Power BI

2. Objectives: To achieve data modelling by visualization tool Power BI

3. Outcomes: Students will be able to visualize Dimensional data modelling real time data
4. Hardware / Software Required: Power BI

5. Theory: What is Dimensional data modelling?

Dimensional data models are primarily used in data warehouses and data marts that support
business intelligence applications. They consist of fact tables that contain data about transactions
or other events and dimension tables that list attributes of the entities in the fact tables. For
example, a fact table could detail product purchases by customers, while connected dimension
tables hold data about the products and customers. Notable types of dimensional models are star
schemas, which connect a fact table to different dimension tables, and snowflake schemas, which
include multiple levels of dimension tables.
Steps in building a dimensional data model

• Choose the business processes that you want to use to analyse the subject area to be modelled.
• Determine the granularity of the fact tables.
• Identify dimensions and hierarchies for each fact table.
• Identify measures for the fact tables.
• Determine the attributes for each dimension table.

Star schema

Star schema is a modeling approach widely adopted by relational data warehouses. It requires
modelers to classify their model tables as either dimension or fact.

Dimension tables describe business entities—the things you model. Entities can include products,
people, places, and concepts including time itself. The most consistent table you'll find in a star
schema is a date dimension table. A dimension table contains a key column (or columns) that acts
as a unique identifier, and descriptive columns.

Fact tables store observations or events, and can be sales orders, stock balances, exchange rates,
temperatures, etc. A fact table contains dimension key columns that relate to dimension tables,
and numeric measure columns. The dimension key columns determine the dimensionality of a fact
table, while the dimension key values determine the granularity of a fact table. For example,
consider a fact table designed to store sale targets that has two dimension key columns Date and
Product Key. It's easy to understand that the table has two dimensions. The granularity, however,
can't be determined without considering the dimension key values. In this example, consider that
the values stored in the Date column are the 21first day of each month. In this case, the granularity
is at month-product level.

Generally, dimension tables contain a relatively small number of rows. Fact tables, on the other
hand, can contain a very large number of rows and continue to grow over time.

22
1. In Power BI Desktop, at the left, click the Model view icon.

2. If you do not see all seven tables, scroll horizontally to the right, and then drag andarrange
the tables more closely together so they can all be seen at the same time.

In Model view, it’s possible to view each table and relationships (connectors between tables).

3. In PowerBI on the Modeling ribbon tab, from inside the Relationships group, click
ManageRelationships.

4. In the Manage Relationships window, notice that no relationships are yet defined.

5. To create a relationship, click New.

6. In the Create Relationship window, in the first dropdown list, selectthe


Product table.

23
7. In the second dropdown list (beneath the Product table grid), select the Sales table.

24
8. Notice the ProductKey columns in each table have been selected.
The columns were automatically selected because they share the same name.
9. In the Cardinality dropdown list, notice that One To Many is selected.
The cardinality was automatically detected, because Power BI understands that
the ProductKey column from the Product table contains unique values. One-to-many
relationships are the most common cardinality.
10. Active relationships will propagate filters. It’s possible to mark a relationship as inactiveso filters
don’t propagate. Inactive relationships can exist when there are multiple relationship paths
between tables. In which case, model calculations can use special functions to activate them.
You’ll work with an inactive relationship

11. Click OK.

12. In the Manage Relationships window, notice that the new relationship is listed, andthen
click Close.

25
6. Conclusion: Thus student will able to design Dimensional data modelling using Power BI

7. Viva Questions:
i. What is dimensional modelling?
ii. Explian Star Schema with example

8.References: https://learn.microsoft.com/en-us/power-bi/guidance/star-schema

26
Data Management and Visualization Lab

Experiment No.: 4
Install and implement Word Count program on Hadoop using
Cloudera platform

27
Experiment No.: 4
1. Aim: Install and implement Word Count program on Hadoop using Cloudera platform

2. Objectives: To execute the WordCount application and copy the results from WordCount out of
HDFS (Hadoop Distributed File system)

3. Outcomes: Students will be able to learn the distributed file system enviourment and execute
application in HDFS.

4. Hardware / Software Required: Oracle Virtual Box, Cloudera

5. Theory: To install VirtualBox and Cloudera Virtual Machine (VM) Image follow the following links.

1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install


VirtualBox for your computer. The course uses Virtualbox 5.1.X, so we recommend
clicking VirtualBox 5.1 builds on that page and downloading the older package for ease of following
instructions and screenshots. However, it shouldn't be too different if you choose to use or upgrade to
VirtualBox 5.2.X. For Windows, select the link "VirtualBox 5.1.X for Windows hosts x86/amd64"
where 'X' is the latest version.

2. Download the Cloudera VM. Download the Cloudera VM


fromhttps://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-
virtualbox.zip. The VM is over 4GB, so will take some time to download.

3. After the successful installations of Cloudera Virtual Machine (VM) Image following steps to be
followed to execute the WordCount application.

• Part 1- Execute the WordCount application.

• Part-2 Copy the results from WordCount out of HDFS.

Part-1

1.Open a terminal shell. Start the Cloudera VM in VirtualBox, if not already running, and open a terminal
shell. Detailed instructions for these steps can be found in the previous Readings.

2. See example MapReduce programs. Hadoop comes with several example MapReduce applications. You
can see a list of them by running hadoop jar /usr/jars/hadoop-examples.jar. We are interested in running
WordCount.

28
The output says that WordCount takes the name of one or more input files and the name of the output
directory. Note that these files are in HDFS, not the local file system.

3. Verify input file exists. In the previous Reading, we downloaded the complete works of Shakespeare and
copied them into HDFS. Let's make sure this file is still in HDFS so we can run WordCount on it. Run
hadoop fs -ls

4. See WordCount command line arguments. We can learn how to run WordCount by examining its
command-line arguments. Run hadoop jar /usr/jars/hadoop-examples.jar wordcount.

5. Run WordCount. Run WordCount for words.txt: hadoop jar /usr/jars/hadoop-examples.jar wordcount
words.txt out

29
As WordCount executes, the Hadoop prints the progress in terms of Map and Reduce. When the WordCount
is complete, both will say 100%.

6. See WordCount output directory. Once WordCount is finished, let's verify the output was created. First,
let's see that the output directory, out, was created in HDFS by running hadoop fs –ls

We can see there are now two items in HDFS: words.txt is the text file that we previously created, and out
is the directory created by WordCount.

7. Look inside output directory. The directory created by WordCount contains several files. Look inside the
directory by running hadoop –fs ls out

The file part-r-00000 contains the results from WordCount. The file _SUCCESS means WordCount
executed successfully.

8. Copy WordCount results to local file system. Copy part-r-00000 to the local file system by running hadoop
fs –copyToLocal out/part-r-00000 local.txt

9. View the WordCount results. View the contents of the results: more local.txt

Each line of the results file shows the number of occurrences for a word in the input file. For example,
Accuse appears four times in the input, but Accusing appears only once.

30
How do I figure out how to run Hadoop MapReduce programs

Hadoop comes with several MapReduce applications. In the Cloudera VM, these applications are in
/usr/jars/hadoop-examples.jar. You can see a list of all the applications by running hadoop jar
/usr/jars/hadoop-examples.jar.

Each of these MapReduce applications can be run in the terminal. To see how to run a specific application,
append the application name to the end of the command line. For example, to see how to run wordcount, run
31
hadoop jar /usr/jars/hadoop-examples.jar wordcount.

The output tells you how to run wordcount:

Usage: wordcount <in> [<in>...] <out>

The <in> and <out> denote the names of the input and output, respectively. The square brackets around the
second <in> mean that the second input is optional, and the ... means that more than one input can be used.

This usage says that wordcount is run with one or more inputs and one output, the input(s) are specified first,
and the output is specified last.

Part -2-Copy your data into the Hadoop Distributed File System (HDFS). Follow the instructions.

1. Open a browser. Open the browser by click on the browser icon on the top left of the screen.

2. Download the Shakespeare. We are going to download a text file to copy into HDFS. Enter the
following link in the
browser: http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt

32
3. Once the page is loaded, click on the Open menu button.

4. Click on Save Page

5. Change the output to words.txt and click Save.

6.Open a terminal shell. Open a terminal shell by clicking on the square black box on the top left
of the screen.

33
7. Run cd Downloads to change to the Downloads directory.

8. Run ls to see that words.txt was saved.

9. Copy file to HDFS. Run hadoop fs –copyFromLocal words.txt to copy the text file to HDFS.

10. Verify file was copied to HDFS. Run hadoop fs –ls to verify the file was copied to HDFS.

11. Copy a file within HDFS. You can make a copy of a file in HDFS. Run hadoop fs -cp words.txt
words2.txt to make a copy of words.txt called words2.txt

12. We can see the new file by running hadoop fs -ls

13. Copy a file from HDFS. We can also copy a file from HDFS to the local file system.
34
Run hadoop fs -copyToLocal words2.txt . to copy words2.txt to the local directory.

14. Let's run ls to see that the file was copied to see that words2.txt is there.

7. Delete a file in HDFS. Let's the delete words2.txt in HDFS. Run hadoop fs -rm words2.txt

8. Run hadoop fs -ls to see that the file is gone.

6. Conclusion: We successfully interacted with Hadoop via its command-line interface, gaining
insights into its functionalities. We efficiently transferred files between the local file system and
HDFS, demonstrating proficiency in data management within distributed environments.

7. Viva Questions:
i. How do you initiate interactions with Hadoop using its command-line application?
ii. Can you explain the process of copying files into and out of the Hadoop Distributed File
System (HDFS) via the command line?
iii. What are the advantages of using the command-line interface for interacting with
Hadoop compared to graphical user interfaces?
iv. How does transferring files between the local file system and HDFS contribute to
efficient data management in distributed environments?

8. References:
1. Cloudera Documentation: https://docs.cloudera.com/documentation/enterprise/latest.html
2. MapR Documentation: https://mapr.com/docs/
3. Towards Data Science: https://towardsdatascience.com/tagged/hadoop
35
Data Management and Visualization Lab

Experiment No.: 5
Case Study on Implementation of Word Count program using
Spark Platform

36
Experiment No.: 5
1. Aim: Case Study on Implementation of Word Count program using Spark Platform
2. Objectives:
- To learn the implementation of a Word Count program using Apache Spark, demonstrating
its distributed computing capabilities for processing large datasets efficiently.
- Evaluate the performance and scalability of the Spark-based Word Count program,
identifying solutions in its implementation.
3. Outcomes:
- The outcome of the case study showcases the successful implementation of the Word Count
program using Spark, highlighting its efficiency in distributed data processing.
- Insights gained from performance evaluation provide valuable understanding for future
development and optimization of Spark-based applications.

4. Hardware / Software Required: Apache Spark, HDFS

5. Theory: Spark is a unified analytics engine for large-scale data processing including built-in
modules for SQL, streaming, machine learning and graph processing. Apache Spark is an open-
source cluster computing framework. Its primary purpose is to handle the real-time generated
data. Spark was built on the top of the Hadoop MapReduce. It was optimized to run in memory
whereas alternative approaches like Hadoop's MapReduce writes data to and from computer hard
drives. So, Spark process the data much quicker than other alternatives.

Following is the case study to find out the frequency of each word exists in a particular file.
Here, we use Scala language to perform Spark operations.

Steps to execute Spark word count example: In this example, we find and display the number of
occurrences of each word.

1. Create a text file in your local machine and write some text into it.

$ nano sparkdata.txt

37
2. Check the text written in the sparkdata.txt file.

$ cat sparkdata.txt

3. Create a directory in HDFS, where to kept text file.

$ hdfs dfs -mkdir /spark


4. Upload the sparkdata.txt file on HDFS in the specific directory.

$ hdfs dfs -put /home/codegyani/sparkdata.txt /spark

38
5. Now, follow the below command to open the spark in Scala mode.

$ spark-shell

6. Let's create an RDD by using the following command.

scala> val data=sc.textFile("sparkdata.txt")

7. Here, pass any file name that contains the data. Now, we can read the generated result by using
the following command.

scala> data.collect;

39
8. Here, we split the existing data in the form of individual words by using the following command.

scala> val splitdata = data.flatMap(line => line.split(" "));


9. Now, we can read the generated result by using the following command.

scala> splitdata.collect;

Now, perform the map operation.

10. scala> val mapdata = splitdata.map(word => (word,1));

Here, we are assigning a value 1 to each word.

11. Now, we can read the generated result by using the following command.

scala> mapdata.collect;

40
12. Now, perform the reduce operation

scala> val reducedata = mapdata.reduceByKey(_+_);

Here, we are summarizing the generated data.

13. Now, we can read the generated result by using the following command.

1. scala> reducedata.collect;

Here, we got the desired output.

6. Conclusion:
This case study demonstrates the effectiveness of Apache Spark for implementing the Word
Count program, showcasing its scalability and performance in distributed computing. The
insights gained pave the way for further utilization and optimization of Spark for diverse big
data processing tasks.

7. Viva Questions:

1. Explain the role of Resilient Distributed Datasets (RDDs) in Apache Spark?


2. What are some key advantages of using Apache Spark over traditional MapReduce for
processing large-scale data?
41
3. How does Apache Spark ensure fault tolerance and data reliability in distributed
computing environments?

8. References: https://cloudxlab.com/assessment/displayslide/458/apache-spark-streaming-
wordcount-hands-on
https://www.digitalocean.com/community/tutorials/apache-spark-example-word-
count
program-java

42
Data Management and Visualization Lab

Experiment No.: 6
Perform Data Extraction and Transformation tasks using Power BI

43
Experiment No.: 6
7. Aim: To perform Data Extraction and Transformation tasks using Power BI
8. Objectives:
- To extract relevant data from various sources and integrate it into Power BI for analysis.
- To cleanse, transform, and enrich the extracted data to ensure accuracy and suitability for
reporting in Power BI.
9. Outcomes:
- To improve decision-making through deeper insights into operations, customer behavior,
and market trends.
- To enhance reporting and visualization with intuitive dashboards, enabling quick
understanding and informed actions.

10. Hardware / Software Required: Microsoft Power BI

11. Theory:

Power BI is a powerful business analytics tool developed by Microsoft, designed to empower


organizations to visualize and analyze their data effectively. It enables users to connect to various
data sources, from databases to cloud services, and transform raw data into insightful
visualizations and interactive reports. With its intuitive interface and robust features, Power BI
allows users to explore data, uncover trends, and make data-driven decisions. Its flexibility,
scalability, and integration with other Microsoft products make it a popular choice for businesses
of all sizes seeking to harness the full potential of their data. The process of data extraction and
transformation in Power BI follows a structured methodology aimed at harnessing the full
potential of available data for analysis and reporting purposes.
Data Extraction:
• Source Identification: Identifying relevant data sources such as databases, spreadsheets, or
online services based on the requirements of the analysis.
• Data Connection: Establishing connections to the identified sources using Power BI's built-
in connectors or custom connectors to extract raw data.
• Data Loading: Loading the extracted data into Power BI's data model, where it can be
manipulated and analyzed further.
Data Transformation:
• Data Cleansing: Identifying and rectifying any inconsistencies, errors, or missing values in

44
the extracted data to ensure data accuracy and reliability.
• Data Integration: Combining data from multiple sources into a cohesive dataset, aligning
data structures and formats for seamless analysis.
• Data Enrichment: Enhancing the extracted data by adding calculated columns, derived
metrics, or additional contextual information to provide deeper insights.
Loading Data from various sources
Power BI supports large range of data sources. You can click Get data and it shows you all the
available data connections. It allows you to connect to different flat files, SQL database, and Azure
cloud or even web platforms such as Facebook, Google Analytics, and Salesforce objects. It also
includes ODBC connection to connect to other ODBC data sources, which are not listed.
✓ Flat Files
✓ SQL Database
✓ OData Feed
✓ Blank Query
✓ Azure Cloud platform
✓ Online Services
✓ Blank Query
✓ Other data sources such as Hadoop, Exchange, or Active Directory
To get data in Power BI desktop, you need to click the Get data option in the main screen. It shows
you the most common data sources first. Then, click the More option to see a full list of available
data sources.

45
Transformations on Databases in PowerBI
We don’t get the data that we can directly use in the reports in real time. Instead, we have to clean
that data to meet our business standards. We have a query editor within the Desktop to perform all
the needed operations. To get to Power Query Editor, select Transform data from the Home tab of
Power BI Desktop.

The ribbon in Power Query Editor consists of four tabs: Home, Transform, Add Column, View,
Tools, and Help.

46
Transform Tab: The Transform tab provides access to common data transformation tasks, such
as:
✓ Adding or removing columns
✓ Changing data types
✓ Splitting columns
✓ Other data-driven tasks

We can perform following transformations in PowerBI.


✓ Change the Data type of a Column
✓ Combine Multiple Tables
✓ Enter data or Copy and Paste from Clipboard
✓ Format Dates
✓ Groups
✓ Hierarchies
✓ Joins
✓ Pivot Table
✓ Query Groups
✓ Reorder or Remove Columns
✓ Rename Column Names
✓ Rename Table Names
✓ Split Columns

47
✓ UnPivot Table

Example: PowerBI Transformation: Change the Data type of a Column


• When you load a table from any data source, Power BI automatically detects the data type of a
column. However, there may be some situations where Power BI might get them wrong.
• For example, it may consider amounts, values, or even dates as the text. In these situations, you
can use Power BI change data types of a column option.
• Step 1: Click Edit Queries option. Clicking Edit Queries option opens a new window
called Power BI Power Query Editor.

• Please select the Column for which you want to change the data type. Next, click on the left
corner of the column header (currently it represent ABC text). Clicking in that position opens a
drop-down list of supported data types. Please select the data type that suits your data. Here, we
are selecting the Whole number.

Note: Students should apply any five above listed transformations on the dataset using Power
Query Editor in PowerBI

12. Conclusion: Power BI for data extraction and transformation yields tangible benefits by
facilitating streamlined data analysis and reporting. By leveraging its intuitive interface and
robust features, organizations can harness the full potential of their data assets, driving informed
decision-making and fostering business growth in today's competitive landscape.

7. Viva Questions:
• Can you explain the process of data extraction in Power BI and how it differs from
traditional methods?
• What are the key considerations when selecting data sources for extraction in Power BI?
• How do you ensure data quality and integrity during the extraction and transformation
process in Power BI?
• Can you discuss any challenges you encountered while performing data extraction and
transformation tasks in Power BI and how you addressed them?

8. References: https://learn.microsoft.com/en-us/power-bi/fundamentals/desktop-what-is-desktop
48
Data Management and Visualization Lab

Experiment No.: 7
Perform Data Visualization tasks and create dashboards using Power BI

49
Experiment No.: 7
1. Aim: To Perform Data Visualization tasks and create dashboards using Power BI
2. Objectives:
- To Create visually appealing and informative data visualizations that accurately represent
key metrics and trends extracted from various data sources using Power BI.
- To Develop interactive dashboards in Power BI that consolidate multiple visualizations and
provide a comprehensive overview of organizational performance, enabling users to gain
insights and make data-driven decisions efficiently.
3. Outcomes:
• To gain a deeper understanding of complex datasets, enabling them to identify patterns,
correlations, and outliers more easily.
• To the create intuitive dashboards in Power BI, organizations streamline data analysis
processes, reducing the time and effort required to access and interpret key information.

4. Hardware / Software Required: Microsoft Power BI

5. Theory:

Visualizations allow data to be represented in different ways, leading to insights into data
relationships that may not be easily seen. Power BI allows users to create and adjust visualizations
based on their own needs as they look at data. Users will be able to look at data from different
perspectives and find insights into data relationships that help them make better informed
decisions.
Steps of Data Sourcing to Creation of Reports and Dashboards:
The whole process of data sourcing to the creation of reports and dashboards consists of four basic
steps.
1. Data Sourcing in Power BI: Power BI offers a versatile range of data sources, including cloud-
based online services and local files. While there's a 1 GB limit on importing data from online
services, Power BI supports various sources such as Excel, Text/CSV, XML, JSON, Oracle
Database, and Azure SQL Database.
2. Data Transformation in Power BI: Before visualizing the data, a crucial step involves cleaning
and pre-processing. This includes eliminating missing values and irrelevant data from rows
and columns. Adhering to specific rules, datasets are transformed and loaded into the
warehouse for further analysis.
50
3. Report Development in Power BI: Once data is cleaned and transformed, reports are crafted
based on specific requirements. These reports are essentially data visualizations that
incorporate different filters and constraints. The visual representations can take the form of
graphs, pie charts, and other graphical elements.
4. Dashboard Creation in Power BI: Power BI dashboards are built by pinning independent
elements from live reports. This process occurs after publishing the report to the Power BI
service. The saved reports retain their filter settings, allowing users to create dynamic
dashboards with real-time data insights.

Create Reports and Dashboards


In Power BI Desktop Report view, you can build visualizations and reports. The Report view has
six main areas:
1. The ribbon at the top, which displays common tasks associated with reports and
visualizations.
2. The canvas area in the middle, where you create and arrange visualizations.
3. The pages tab area at the bottom, which lets you select or add report pages.

51
4. The Filters pane, where you can filter data visualizations.
5. The Visualizations pane, where you can add, change, or customize visualizations, and
apply drillthrough.
6. The Format pane, where you design the report and visualizations.
7. The Fields pane, which shows the available fields in your queries. You can drag these
fields onto the canvas, the Filters pane, or the Visualizations pane to create or modify
visualizations.
Visualizations
1. The Fields option in the Visualization pane lets you
drag data fields to Legend and other field wells in the
pane.
2. The Format option lets you apply formatting and
other controls to visualizations.
3. The icons show the type of visualization created.
You can change the type of a selected visualization by
selecting a different icon, or create a new visualization
by selecting an icon with no existing visualization
selected.
Power BI offers the functionality to visually represent
our data or a subset of it so that it can be used to draw
inferences or gain a deeper understanding of the data.
These visuals can be bar graphs, pie charts, etc.
Following are some examples of basic visual options
provided in Power BI-
Card – It is used to represent a single value such as
Total Sales, etc.
Stacked bar/column chart – they combine a line chart(
which joins points representing some values with a line) and a bar/column chart(which
represents a value against the purpose and other optional fields).
Waterfall chart – It represents a continuously changing value where increase or decrease in
value may be represented by differently colored bars. Pie chart– it represents the fractional value
of each category of a particular field.
Map-It is used to represent different information on a map.

52
KPI-It represents the continuous progress made towards a target.
Slicer – A slicer has options representing different categories of a field. Selecting that category
shows only the information specific to that category in other visuals.
Table – A table represents data in tabular form, i.e rows, and columns.

Note: Students should select any dataset and create meaningful reports by exploring various
visual tools available in PowerBI

6. Conclusion: Data visualization helps to create clear and compelling visual


representations of data, enhancing comprehension and aiding decision-making. By
adhering to principles of perception, cognition, and ethical practice, practitioners can
create visualizations that effectively communicate insights and drive actionable
outcomes.

7. Viva Questions:
• How does Power BI utilize principles of data visualization theory to enhance the
effectiveness of its visualizations?
• Can you discuss how interactive features in Power BI dashboards improve user engagement
and facilitate data exploration?
• What steps did you take to ensure that your Power BI visualizations adhere to ethical
considerations, such as accuracy and transparency in data representation?

8. References: https://learn.microsoft.com/en-us/power-bi/fundamentals/desktop-what-is-desktop

53
Data Management and Visualization Lab

Experiment No.: 8
Integrate Snowflake with Power BI and provide insights

54
Experiment No.: 8
1. Aim: To Integrate Snowflake with Power BI and provide insights
2. Objectives:
• Establish a seamless integration between Snowflake and Power BI to enable efficient data transfer and
create visually appealing dashboards.
• Utilize Power BI's analytical capabilities to extract meaningful insights from Snowflake data,
including identifying trends and correlations.
• Evaluate the performance of the Snowflake-Power BI integration in terms of data processing speed,
query efficiency, and overall system responsiveness.
3. Outcomes:
• To connect Snowflake data to Power BI, import relevant datasets, and create interactive
visualizations such as charts, graphs, and maps within Power BI dashboards.
• Extract actionable insights from Snowflake data using Power BI's analytical capabilities,
helping in informed decision-making.

4. Hardware / Software Required: Snowflake and Microsoft Power BI

5. Theory:

Snowflake is a cloud-based data warehousing platform that offers scalability, flexibility, and
performance for storing and analyzing large volumes of data. It uses a unique architecture that
separates storage and compute resources, allowing users to scale each independently based on
their needs. Snowflake supports various data types, including structured, semi-structured, and
unstructured data, making it suitable for diverse data analytics tasks.

Integrating Snowflake with Power BI, a leading business intelligence tool by Microsoft, provides
organizations with powerful capabilities for data visualization, reporting, and analysis. Power BI
allows users to connect directly to Snowflake data warehouses, enabling real-time or near-real-
time access to updated data for decision-making.
Steps to connect Snowflake data warehouse from Power Query Desktop
To make the connection to a Snowflake computing warehouse, take the following steps:
1. Select Get Data from the Home ribbon in Power BI Desktop, select Database from the
categories on the left, select Snowflake, and then select Connect.

55
2. In the Snowflake window that appears, enter the name of your Snowflake server in Server and
the name of your Snowflake computing warehouse in Warehouse.

3. Optionally, enter values in any advanced options that you want to use to modify the connection
query, such as a text value to use as a Role name or a command timeout.
4. Select OK.
5. To sign in to your Snowflake computing warehouse, enter your username and password, and then
select Connect.

56
6. In Navigator, select one or multiple elements to import and use in Power BI Desktop. Then select
either Load to load the table in Power BI Desktop, or Transform Data to open the Power Query
Editor where you can filter and refine the set of data you want to use, and then load that refined
set of data into Power BI Desktop.

7. Select Import to import data directly into Power BI, or select DirectQuery, then select OK.

Step to get Server name from Snowflake:


Step-1: Go to Snowflake and make a note of your server, database, and warehouse name.
For Server name:
In the Snowflake web UI, navigate to the Accounts tab and copy the URL from there.

57
For DB and Warehouse name
The screenshot provided below is for your reference.

Note: Students should select any dataset from snowflake and integrate in PowerBI and create
meaningful dashboard.

6. Conclusion: Snowflake and Power BI work together seamlessly to give businesses easy
access to data and powerful tools for analysis, helping them make better decisions and
succeed.

7. Viva Questions:
• What are the benefits of using Power BI's drag-and-drop interface for creating visualizations
compared to traditional methods?
• How does Snowflake's separation of storage and compute resources contribute to cost-
effectiveness and flexibility for businesses?
• Can you explain a scenario where the integration of Snowflake with Power BI resulted in
significant insights or improvements for a specific business process?

8. References: https://learn.microsoft.com/en-us/power-query/connectors/snowflake

58
Data Management and Visualization Lab

Experiment No.: 09
Case study on Cloud Services: Azure, AWS, GCP

59
Experiment No.: 09
1. Aim: Case study on Cloud Services: Azure, AWS, GCP
2. Objectives:
- Compare the performance metrics of Azure, AWS, and GCP under various workloads to
determine which platform offers better speed, scalability, and reliability.
- Conduct a comprehensive cost analysis of utilizing Azure, AWS, and GCP services.
3. Outcomes:
- To the create intuitive dashboards in Power BI, organizations streamline data analysis
processes, reducing the time and effort required to access and interpret key information.
- Analyze the cost-effectiveness of each cloud platform based on the identified performance
metrics
3. Hardware / Software Required: NA

4. Theory: In today's digital era, organizations increasingly rely on cloud computing services to host
applications, store data, and streamline operations. Three major players dominate the cloud market:
Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP). Each offers
a vast array of services tailored to diverse business needs. This case study aims to compare these
cloud platforms comprehensively, focusing on performance, cost, and suitability for various
workloads.
1. Azure (Microsoft Azure): Azure is a comprehensive cloud computing platform offered by
Microsoft. It provides a wide range of services, including computing, storage, networking,
databases, analytics, and AI.

Key Features:

• Azure Virtual Machines: On-demand scalable computing resources.

• Azure Blob Storage: Object storage service for storing and accessing large
amounts of unstructured data.

• Azure App Service: Platform as a Service (PaaS) offering for building, deploying,
and managing web and mobile applications.

• Azure SQL Database: Fully managed relational database service.

• Azure AI and Machine Learning: Tools and services for building and deploying AI
and machine learning models.

60
Use Cases: Azure is popular among enterprises for hosting applications, websites,
databases, and for running data analytics and AI workloads. It caters to a wide range of
industries, including healthcare, finance, manufacturing, and government sectors.

2. AWS (Amazon Web Services): AWS is a leading cloud computing platform provided by
Amazon. It offers a broad set of global compute, storage, database, analytics, and machine
learning services, as well as Internet of Things (IoT) and security solutions.

Key Features:

• Amazon EC2: Scalable virtual servers in the cloud.

• Amazon S3: Scalable object storage service for storing and retrieving data.

• Amazon RDS: Managed relational database service for deploying and scaling
databases.

• Amazon Lambda: Serverless computing service for running code without


provisioning or managing servers.

• Amazon SageMaker: Fully managed service for building, training, and deploying
machine learning models.

Use Cases: AWS is widely used by startups, enterprises, and government organizations
for various purposes, including website hosting, mobile app development, data storage
and analytics, IoT applications, and machine learning.

3. GCP (Google Cloud Platform): GCP is a suite of cloud computing services provided by
Google. It offers infrastructure as a service (IaaS), platform as a service (PaaS), and software
as a service (SaaS) solution, along with data storage, analytics, machine learning, and
networking services.

Key Features:

• Google Compute Engine: Virtual machines on Google's infrastructure.

• Google Cloud Storage: Scalable object storage with high availability and global
edge-caching.

• Google Cloud SQL: Fully managed relational database service for MySQL,
PostgreSQL, and SQL Server.

• Google Kubernetes Engine (GKE): Managed Kubernetes service for container


orchestration.

• Google AI and Machine Learning: Suite of AI and machine learning tools,


61
including TensorFlow and AutoML.

Use Cases: GCP is favored by organizations looking for scalable and cost-effective cloud
solutions. It is particularly popular among companies in the technology, gaming, media, and
retail industries. GCP is known for its strong capabilities in data analytics, machine learning,
and containerization.

4. Comparative analysis of Azure, AWS, and GCP:


Features Azure AWS GCP
Computing - Virtual Machines (VMs) - Elastic Compute Cloud - Compute Engine
Services (EC2) (VMs)
- Azure Kubernetes Service - Elastic Kubernetes Service - Google Kubernetes
(AKS) (EKS) Engine (GKE)
- Azure Functions - AWS Lambda (serverless) - Cloud Functions
(serverless) (serverless)
Storage - Azure Blob Storage - Amazon Simple Storage - Google Cloud Storage
Services Service (S3)
- Azure Files (file storage) - Amazon Elastic File System - Cloud Filestore (file
(EFS) storage)
- Azure Disk Storage - Amazon Elastic Block Store - Persistent Disk (block
(block storage) (EBS) storage)
Database - Azure SQL Database - Amazon Relational Database - Cloud SQL (managed
Services Service (RDS) databases)
- Azure Cosmos DB - Amazon DynamoDB - Bigtable (NoSQL)
(NoSQL) (NoSQL)
- Azure Database for - Amazon Aurora - Firestore (NoSQL)
MySQL/PostgreSQL (MySQL/PostgreSQL)
Networking - Azure Virtual Network - Amazon Virtual Private - Virtual Private Cloud
Services Cloud (VPC) (VPC)
- Azure Load Balancer - Elastic Load Balancing - Cloud Load Balancing
(ELB)
- Azure Application - AWS Direct Connect - Cloud Interconnect
Gateway (dedicated network) (dedicated network)
- Azure VPN Gateway - AWS Transit Gateway (hub- - Cloud VPN (virtual
and-spoke) private network)
AI/ML - Azure Machine Learning - Amazon SageMaker - Google AI Platform
Services

62
- Azure Cognitive Services - AWS AI/ML services - AI Hub (pre-trained
models)
- Azure Databricks (data - Amazon Redshift (data - BigQuery (data
analytics) warehousing) analytics)
- Azure Synapse Analytics - Amazon EMR (Hadoop, - Dataflow (stream and
(big data) Spark) batch processing)
Additional - Azure DevOps - AWS CloudFormation - Cloud Build
Services (continuous integration)
- Azure Active Directory - AWS Identity and Access - Cloud IAM (identity
Management (IAM) and access
management)
- Azure Monitor - Amazon CloudWatch - Stackdriver
(monitoring) (monitoring) (monitoring)
- Azure Security Center - AWS Inspector (security) - Cloud Security
Scanner (security)
- Azure Backup - AWS Backup (data - Cloud Armor (DDoS
protection) protection)

5. Comparison of pricing for Azure, AWS, and GCP across some common services:
Service Azure AWS GCP
Virtual Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
Machines (VMs) reserved instances reserved instances sustained use discounts
Storage (e.g., S3, Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
Blob) storage tiers storage classes multi-regional storage
Database Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
Services reserved capacity reserved instances committed use discounts
Networking Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
VPN Gateway Direct Connect Cloud Interconnect
AI/ML Services Pay-as-you-go pricing, Pay-as-you-go pricing, Pay-as-you-go pricing,
usage-based usage-based preemptible VMs

6. Conclusion: The case study reveals Azure's hybrid integration, AWS's vast service portfolio,
and GCP's innovative pricing, aiding organizations in cloud platform selection based on
individual needs and preferences.

63
7. Viva Questions:
• How do the hybrid cloud integration capabilities of Azure differentiate it from AWS and
GCP?
• How do the pricing models of Azure, AWS, and GCP impact the cost-effectiveness of
using each platform?
• What are some key factors organizations should consider when selecting a cloud
platform, and how do Azure, AWS, and GCP address these considerations differently?

8. References:
• https://www.digitalocean.com/resources/article/comparing-aws-azure-gcp#use-
cases-for-aws-vs-azure-vs-gcp
• https://www.veritis.com/blog/aws-vs-azure-vs-gcp-the-cloud-platform-of-your-
choice/

64
Data Management and Visualization Lab

Experiment No.: 10
Perform Data Visualization tasks and create dashboards using Tableau

65
Experiment No.: 10
5. Aim: To Perform Data Visualization tasks and create dashboards using Tableau
6. Objectives:
- To Create visually appealing and informative data visualizations that accurately represent key
metrics and trends extracted from various data sources using Power BI.
- To Develop interactive dashboards in Power BI that consolidate multiple visualizations and
provide a comprehensive overview of organizational performance, enabling users to gain
insights and make data-driven decisions efficiently.
7. Outcomes:
- To gain a deeper understanding of complex datasets, enabling them to identify patterns,
correlations, and outliers more easily.
- To the create intuitive dashboards in Power BI, organizations streamline data analysis
processes, reducing the time and effort required to access and interpret key information.

8. Hardware / Software Required: Tableau

9. Theory:

Tableau is an excellent data visualization and business intelligence tool used for reporting and
analyzing vast volumes of data. It is an American company that started in 2003—in June 2019,
Salesforce acquired Tableau. It helps users create different charts, graphs, maps, dashboards, and
stories for visualizing and analyzing data, to help in making business decisions.
- Tableau supports powerful data discovery and exploration that enables users to answer
important questions in seconds
- No prior programming knowledge is needed; users without relevant experience can start
immediately with creating visualizations using Tableau
- It can connect to several data sources that other BI tools do not support. Tableau enables
users to create reports by joining and blending different datasets
- Tableau Server supports a centralized location to manage all published data sources within
an organization

66
Working with Tableau Desktop
1. Connect to your Data
The start page gives you several options to choose from:
1.1 Tableau icon. Click An icon of the Tableau logo in the upper left corner of any page to
toggle between the start page and the authoring workspace.
1.2 Connect pane. Under Connect, you can:
- Connect to data that is stored in a file, such as Microsoft Excel, PDF, Spatial files, and
more.
- Connect to data that is stored on Tableau Server, Microsoft SQL Server, Google
Analytics, or another server.
- Connect to a data source that you’ve connected to before.

Tableau supports the ability to connect to a wide variety of data stored in a wide variety of
places. The Connect pane lists the most common places that you might want to connect to, or
click the More links to see more options.
1.3. Under Accelerators, see a selection of Accelerators and the sample workbooks that come
with Tableau Desktop. Prior to 2023.2, these were only sample workbooks.
1.4. Under Open, you can open workbooks that you've already created.
1.5. Under Discover, find additional resources like video tutorials, forums, or the “Viz of the
week” to get ideas about what you can build.
In the Connect pane, under Saved Data Sources, click Sample - Superstore to connect to the
67
sample data set.

After you select Sample - Superstore, your screen will look something like this:

The Sample - Superstore data set comes with Tableau. It contains information about products,
sales, profits, and so on that you can use to identify key areas for improvement within this
fictitious company.
Creating Reports
1. From the Data pane, drag Order Date to the Columns shelf.
Note: When you drag Order Date to the Columns shelf, Tableau creates a column for each
year in your data set. Under each column is an Abc indicator. This indicates that you can

68
drag text or numerical data here, like what you might see in an Excel spreadsheet. If you
were to drag Sales to this area, Tableau creates a crosstab (like a spreadsheet) and displays
the sales totals for each year.
2. From the Data pane, drag Sales to the Rows shelf.
Tableau generates the following chart with sales rolled up as a sum (aggregated). You can
see total aggregated sales for each year by order date.

When you first create a view that includes time (in this case Order Date), Tableau
automatically generates a line chart.
Refine your View
1. From the Data pane, drag Category to the Columns shelf and place it to the right of
YEAR(Order Date).
Your view updates to a bar chart. By adding a second discrete dimension to the view you can
categorize your data into discrete chunks instead of looking at your data continuously over time.
This creates a bar chart and shows you overall sales for each product category by year.

69
2. Double-click or drag Sub-Category to the Columns shelf.

Note: You can drag and drop or double-click a field to add it to your view, but be careful.
Tableau makes assumptions about where to add that data, and it might not be placed where
you expect. You can always click Undo to remove the field, or drag it off the area where
Tableau placed it to start over.
Sub-Category is another discrete field. It creates another header at the bottom of the view, and
shows a bar for each sub-category (68 marks) broken down by category and year.

2. Add filters to your view


You can use filters to include or exclude values in your view. In this example, you decide to add

70
two simple filters to your worksheet to make it easier to look at product sales by sub-category for
a specific year.
1. In the Data pane, right-click Order Date and select Show Filter.
2. Repeat the step above for the Sub-Category field.
The filters are added to the right side of your view in the order that you selected them. Filters are
card types and can be moved around on the canvas by clicking on the filter and dragging it to
another location in the view. As you drag the filter, a line appears that shows you where you can
drop the filter to move it.
Note: The Get Started tutorial uses the default position of the filter cards.
More on Filtering in the Learning Library (in the top menu).

3. Add color to your view


Adding filters helps you to sort through all of this data—but wow, that’s a lot of blue! It's time to
do something about that.
Currently, you are looking at sales totals for your various products. You can see that some products
have consistently low sales, and some products might be good candidates for reducing sales efforts
for those product lines. But what does overall profitability look like for your different products?
Drag Profit to color to see what happens.
From the Data pane, drag Profit to Color on the Marks card.
By dragging profit to color, you now see that you have negative profit in Tables, Bookcases, and
even Machines. Another insight is revealed!

71
Note: Tableau automatically added a color legend and assigned a diverging color palette because
your data includes both negative and positive values.

Visualization in Tableau
In Tableau, data visualization is a central aspect of its functionality. Tableau provides a user -
friendly interface and a wide range of tools to create interactive and insightful visualizations from
various data sources.
Visualization Types:
• Charts and Graphs: Tableau offers a variety of visualization types including bar charts, line
charts, scatter plots, heat maps, pie charts, histograms, treemaps, geographic maps, and more.
• Dashboard Layouts: Users can create interactive dashboards by combining multiple
visualizations and designing custom layouts to present insights effectively.
Interactivity and Drill-Down:
• Filters and Parameters: Tableau provides filters and parameters to control data displayed in
visualizations dynamically, allowing users to drill down into specific details and explore data
interactively.
• Actions: Users can create interactive actions between visualizations, enabling cross-filtering,
highlighting, and linking behaviors to enhance user experience.
Calculations and Expressions:
• Calculated Fields: Users can create calculated fields using formulas, functions, and logical
expressions to perform custom calculations within Tableau.
72
• Table Calculations: Tableau offers table calculations for computing values across rows,
columns, or specific dimensions within visualizations.
Mapping and Geospatial Analysis:
• Geographic Mapping: Tableau supports geographic mapping to visualize data on maps,
including custom geocoding, layers, and spatial data integration.
• Spatial Analysis: Users can perform spatial analysis tasks such as distance calculations,
clustering, and territory mapping for geospatial insights.
Advanced Analytics:
• Predictive Modeling: Tableau integrates with statistical tools and algorithms for predictive
modeling, forecasting, trend analysis, and what-if scenarios.
• R and Python Integration: Users can leverage R and Python scripts within Tableau for advanced
analytics, statistical calculations, and custom visualizations.

10. Conclusion: Tableau offers robust data visualization capabilities, enabling users to create
interactive and insightful visualizations from diverse data sources, facilitating data-driven decision-
making and enhancing communication of key insights.
11. Viva Questions
• Can you explain the process of creating a dashboard in Tableau from multiple data sources?
• How does Tableau handle data connections and data preparation tasks?
• Discuss Tableau's capabilities for connecting to various data sources, data blending
techniques, and data cleansing features.
• Describe the role of calculated fields and expressions in Tableau.
• Provide examples of how calculated fields are used to perform advanced calculations, create
custom metrics, and enhance visualizations.
12. References:
https://help.tableau.com/current/guides/get-started-tutorial/en-us/get-started-tutorial-
home.htm

73

You might also like