This pipleine will focus on Data Loading, Data Ananlysis & Visualization demonstrating loading of data in MySQL to Hadoop File System (HDFS), later with the analysis using Spark followed by storing into Hive Tables, and import it into Tableau to generate dashboards for meaningful insights.
The concept of using pictures and graphs to understand data has been around for many years. As day by day, the data is getting increased it is a challenge to visualize these data and provide productive results within the lesser amount of time. Thus, Data visualization comes to the rescue to convey concepts in a universal manner and to experiment in different scenarios by making slight adjustments. Data visualization is a process of describing information in a graphical or pictorial format which helps the decision makers to analyze the data in an easier way.
- Data visualization just not makes data more beautiful but also provides insight into complex data sets by communicating with the key aspects more intrude on the meaningful ways.
- Helps in identifying areas that need attention or improvement.
- Clarify which factors influence customer behavior
- Helps to understand which fields to place where
- Helps to predict scenarios and more
To find the profits gained by every customer for each region considering different country and states.
Dataset contains csv files consists of many column of an orders data.
Best suited technologies:
-
MYSQL
-
SQOOP
-
Apache Spark
-
Apache Hive
-
Tableau (Visualisation)
** WebConsole Login Credentials **
Username: support1161
Password : **********
** MYSQL Login Credentials **
HostName: ip-172-31-20-247
Username: sqoopuser
Password: ***********
- Use below command to connect with MySQL Server
mysql -h ip-172-31-20-247 -u sqoopuser -p
-
Create Table in MySQL using below command:
create table pipeline(Row_ID int not null, Order_ID int not null, Order_Date date, Ship_Date date, Ship_Mode varchar(20) default 'new', Customer_ID varchar(20) default 'new', Customer_Name varchar(20) default 'new', Segment varchar(20) default 'new', City varchar(20) default 'new', State varchar(20) default 'new', Country varchar(20) default 'new', Postal_Code varchar(20) default 'new', Market varchar(20) default 'new', Region varchar(20) default 'new', Product_ID varchar(20) default 'new', Category varchar(20) default 'new', Sub_Category varchar(20) default 'new', Product_Name varchar(20) default 'new', Sales varchar(20) default 'new', Quantity varchar(20) default 'new', Discount varchar(20) default 'new', Profit varchar(20) default 'new');
-
Below command is used to Load orders data in MySQL Table:
LOAD DATA LOCAL INFILE "orders.csv" INTO TABLE pipeline COLUMNS TERMINATED BY ',' LINES TERMINATED BY '\n' (Row_ID, Order_ID, @var1, @var2, Ship_Mode, Customer_ID, Customer_Name, Segment, City, State, Country, Postal_Code, Market, Region, Product_ID, Category, Sub_Category, Product_Name, Sales, Quantity, Discount, Profit) SET Order_Date = STR_TO_DATE(@var1, '%m/%d/%Y'), Ship_Date = STR_TO_DATE(@var2, '%m/%d/%Y');
// Parsing Date from String in above command, to store exact Date in table instead of null values
Command to store MySQL Table into HDFS
Here, it arises with 2 cases:
Case 1: Storing in HDFS & then reading with part-m file created by HDFS(by-default)
Case 2: While storing in HDFS converting file into Parquete format
Use below Sqoop command to store MySQL loaded data into HDFS
sqoop import --connect "jdbc:mysql://ip-172-31-20-247:3306/sqoopex" --table pipeline --username sqoopuser -P --target-dir /user/support1161/Divyansh -m 1
** SPARK CODE **
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
val hiveContext:SQLContext = new HiveContext(sc)
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("/user/support1161/Divya/part-m-00000")
df.show()
df.createOrReplaceTempView("orders")
hiveContext.sql("use default")
hiveContext.sql("create table djjj as select * from orders")
hiveContext.sql("select * from djjj").show(2)
Use below Sqoop command to store MySQL loaded data into HDFS
sqoop import --connect "jdbc:mysql://ip-172-31-20-247:3306/sqoopex" --table pipeline --username sqoopuser -P --target-dir /user/support1161/Divya --as-parquet-file -m 1
** SPARK CODE **
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
val conf = new SparkConf().setMaster("local").setAppName("HiveContext")
val sc = new SparkContext(conf);
val hiveContext:SQLContext = new HiveContext(sc)
hiveContext.setConf("hive.metastore.uris","thrift://ip-172-31-20-247.ec2.internal:9083") // open ambari and find the ip, there for hive
hiveContext.tables("default").show
hiveContext.sql("use default")
val orders = hiveContext.read.parquet("/user/support1161/Divya4") // Not to give parquete file name
orders.write.format("orc").mode(SaveMode.Append).saveAsTable("dee")
// For this Spark needs a mysql connector to connect MySQL with Spark // Download MySQL connector from the link: https://mvnrepository.com/artifact/mysql/mysql-connector-java/5.1.36
After downloading use below command to enter into spark shell
spark-shell --driver-class-path /home/support1161/mysql-connector-java-5.1.36.jar --jars /home/support1161/mysql-connector-java-5.1.36.jar
After entering into SPARK-SHELL write below code:
** SPARK CODE **
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
val conf = new SparkConf().setMaster("local").setAppName("HiveContext")
val sc = new SparkContext(conf);
val hiveContext:SQLContext = new HiveContext(sc)
hiveContext.setConf("hive.metastore.uris","thrift://ip-172-31-20-247.ec2.internal:9083")
val prop = new java.util.Properties
prop.put("user","sqoopuser")
prop.put("password","****") ///must mention your password
prop.put("driverClass","com.mysql.jdbc.Driver")
val uri = "jdbc:mysql://ip-172-31-20-247:3306/sqoopex"
val table = "pipeline"
//while setting up above properties must check that it should not store null, if so re run these above properties in spark-shell
val orders = hiveContext.read.jdbc(uri,table,prop)
orders.createOrReplaceTempView("orders")
hiveContext.sql("select * from orders").write.format("orc").mode(SaveMode.Append).saveAsTable("dd")
// Tables are now created in Hive, now the task is to do visualisation with Tableau.
Downlaod Tableau from the below link: https://www.tableau.com/products/trial
And then to connect Tableau with Hive, Download Tableau Hive ODBC connector from below link: https://hortonworks.com/downloads/#addons
After opening the link, drag to HDP adds on and Download & install Hortonworks ODBC Driver for Apache Hive (v2.1.16)
Later connect Hive server with Tableau & go with Visualisations as below: