Hcia Big Data V 3 Merci
Hcia Big Data V 3 Merci
HCIA-Big Data
Lab Guide for Big Data
Engineers
ISSUE:3.0
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of
their respective holders.
Notice
The purchased products, services and features are stipulated by the contract made
between Huawei and the customer. All or part of the products, services and features
described in this document may not be within the purchase scope or the usage scope.
Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has
been made in the preparation of this document to ensure accuracy of the contents, but
all statements, information, and recommendations in this document do not constitute
a warranty of any kind, express or implied.
Contents
1.1 Introduction
This document uses HUAWEI CLOUD MapReduce Service (MRS) as the exercise
environment to guide trainees through related tasks and help them understand how to
use big data components of MRS.
1.3 Precautions
A HUAWEI CLOUD account and real-name authentication are required.
It is recommended that each trainee uses one exercise environment so that they do not
affect each other.
1.4 References
To obtain the MapReduce help documents, visit https://support.huaweicloud.com/intl/en-
us/mrs/index.html.
Figure 1-1
2 HDFS Practice
2.1 Background
HDFS is the basis of big data components. Hive data, MapReduce and Spark computing
data, and regions of HBase are all stored in HDFS. On the HDFS shell client, you can
perform various operations, such as uploading, downloading, and deleting data, and
managing file systems. Learning HDFS will help you better understand and master big
data knowledge.
2.2 Objectives
Understand command HDFS operations.
Understand HDFS management operations.
2.3 Tasks
2.3.1 Task 1: Understating Common HDFS Commands
Run the following command to set environment variables before running commands to
operate the MRS components:
source /opt/client/bigdata_env
Figure 2-1
Check how to use the ls command.
Figure 2-2
Step 2 Run the ls command.
Figure 2-3
Step 3 Run the mkdir command.
Figure 2-4
Step 4 Run the put command.
This command is used to upload a file in the Linux system to a specified HDFS directory.
Before executing this command, run the following command to edit a file in a local Linux
host:
vi stu01.txt
Figure 2-5
Press i to enter the editing mode, enter the content, and press Esc. Then, press Shift and
a colon (:), and enter wq to save the settings and exit. The following is a file content
example:
Figure 2-6
Run the hdfs dfs -put stu01.txt /user/stu01/ command to upload the file.
Figure 2-7
Run the ls command to check whether the stu01.txt file has been uploaded to the
/user/stu01 directory.
Figure 2-8
Step 6 Run the text command.
Figure 2-9
Step 7 Run the moveFromLocal command.
This command is used to cut and paste data from the local PC to HDFS.
Run the vi command to create a data file, for example, stu01_2.txt, on a local Linux host.
The following figure shows the content of the file.
Figure 2-10
Run the following command:
Figure 2-11
The stu01_2.txt file has been uploaded to the /user/stu01 directory on the HDFS. Run the
ls command to check the Linux local host. The stu01_2.txt file does not exist, indicating
that the file is cut and pasted to the destination HDFS directory. Comparatively, after the
put command is executed, the local file is only copied to the HDFS and still exists on the
Linux host.
Figure 2-12
Run the following command:
Add the content of the stu01_3.txt file to the stu01_2.txt file in the HDFS.
Run the cat command to view the result. The following information is displayed:
Figure 2-13
Step 9 Run the cp command.
This command is used to copy a file from one HDFS path to another HDFS path.
Run the vi command to edit the stu01_4.txt file on a local Linux host, and run the put
command to upload the file to the HDFS root directory, as shown in the following figure:
Figure 2-14
Run the hdfs dfs -cp /stu01_4.txt /user/stu01/ command.
Figure 2-15
The stu01_4.txt file exists in the /user/stu01 directory and the root directory.
Figure 2-16
Run the hdfs dfs -mv /stu01_5.txt /user/stu01/ command.
Figure 2-17
The stu01_5.txt file exists in the /user/stu01 folder, but it has been removed from the
root directory.
Similar to copyToLocal, this command is used to download files from the HDFS to a local
host.
Delete the stu01_5.txt file from the Linux host.
Figure 2-18
Run the hdfs dfs -copyToLocal /user/stu01/stu01_5.txt command.
Figure 2-19
The stu01_5.txt file exists on the Linux host.
Note the period at the end of the HDFS command, which indicates the current directory.
If you specify another directory, you can specify the path for saving the file.
Figure 2-20
Run the hdfs dfs -getmerge /user/stu01/* ./merge.txt command.
Figure 2-21
The merge.txt file is generated in the current directory. The content in the file is a
combination of the files in /user/stu01/.
Figure 2-22
The stu01_5.txt file does not exist in /user/stu01/.
This command is used to collect information on the available space of a file system.
Run the hdfs dfs -df -h / command.
Figure 2-23
Step 15 Run the du command.
Figure 2-24
Step 16 Run the count command.
This command is used to collect information on the number of file nodes in a specified
directory.
Run the hdfs dfs -count -v /user/stu01 command.
Figure 2-25
Figure 2-26
Note that deleted data is retained for seven days by default.
Run the mv command to move the file back to the /user/stu01/ directory.
Figure 2-27
2.4 Summary
This exercise mainly describes common operations on HDFS. After completing this
exercise, you will be able to perform common HDFS operations.
3.1 Background
Hive is a data warehouse tool and plays an important role in data mining, data
aggregation, and statistics analysis. In telecom services, Hive can be used to collect
statistics on users' data usage and phone bills, and mine user consumption models to
help carriers better design packages.
3.2 Objectives
Understand common Hive operations.
Learn how to run HQL on Hue.
3.3 Tasks
3.3.1 Task 1: Creating Hive Tables
3.3.1.1 Viewing Statements for Creating Tables
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
Figure 3-1
Statement for creating database tables (If multiple users share the same environment, it
is recommended that the name of each table contain the first letters of the user's last
and first names to differentiate tables.)
create table cx_stu01(name string,gender string ,age int) row format delimited fields terminated by ','
stored as textfile;
Figure 3-2
The show tables command is used to display all tables.
create external table cx_stu02(name string,gender string ,age int) row format delimited fields
terminated by ',' stored as textfile ;
Figure 3-3
Figure 3-4
Run the following put command to upload data to the /user/stu01/ directory of the
HDFS:
Figure 3-5
Run the beeline command to go to Hive and run the following command to load data
and import data to the foreign table:
Figure 3-6
Figure 3-7
Figure 3-8
Step 2 Run the following Where statement:
Figure 3-9
Step 3 Run the following Order statement:
Figure 3-10
Figure 3-11
Step 2 Upload data to the HDFS.
Figure 3-12
Step 3 Create a table and import data to the table.
Run the beeline command to go to Hive and enter the following table creation
statement:
create external table cx_table_stu03(id int,name string ,subject string,score float) row format
delimited fields terminated by ',' stored as textfile ;
Figure 3-13
Step 4 Perform the sum operation.
To calculate the total score of each student, run the following statement:
Figure 3-14
To calculate the total score of each student and filter out students whose total score is
greater than 230, run the following statement:
Figure 3-15
Step 5 Perform the max operation.
To view the highest score of each course, run the following statement:
Figure 3-16
Step 6 Perform the count operation.
To calculate the number of trainees taking the exam of each course, run the following
statement:
Figure 3-17
Figure 3-18
Statements for creating cx_table_department:
dept_name string)
row format delimited fields terminated by ','
stored as textfile ;
Figure 3-19
Statements for creating cx_table_salary:
Figure 3-20
The data in the three tables is as follows:
cx_table_ employee (employee table):
1,zhangsas,1
2,lisi,2
3,wangwu,3
4,tom,1
5,lily,2
6,amy,3
7,lilei,1
8,hanmeimei,2
9,poly,3
cx_table_department (department table):
1,Technical
2,sales
3,HR
4,marketing
cx_table_salary (salary table):
1,1,20000
2,2,16000
3,3,20000
4,1,50000
5,2,18900
6,3,12098
7,1,21900
When INNER JOIN join is performed on multiple tables, only the data that matches the
on condition in all tables is displayed. For example, the following SQL statement
implements the join between the employee table and the department table. The on
condition is dept_id. Only data with the same dept_id is matched and displayed.
Run the following statement:
Figure 3-21
You can join two or more tables. Run the following statement to query the employee
names, departments, and salaries:
Figure 3-22
Generally, a MapReduce job is generated for a join. If more than two tables are joined,
Hive associates the tables from left to right. For the preceding SQL statement, a
MapReduce job is started to connect the employee and department tables, and then the
second MapReduce job is started to connect the output of the first MapReduce job to the
salary table. This is contrary to the standard SQL, which performs the join operation from
right to left. Therefore, in Hive SQL, small tables are written on the left to improve the
execution efficiency.
Hive supports the /*+STREAMTALBE*/ syntax to specify which table is a large table. For
example, in the following SQL statement, dept is specified as a large table. If the
/+STREAMTALBE/ syntax is not used, Hive considers the rightmost table as a large table.
Run the following statement:
Figure 3-23
Generally, the number of MapReduce jobs to be started is the same as the number of
tables to be joined. However, if the join keys of the on condition are the same, only one
MapReduce job is started.
LEFT OUTER JOIN, same as the standard SQL statement, uses the left table as a baseline.
If the right table matches the on condition, the data is displayed. Otherwise, NULL is
displayed.
Run the following statement:
Figure 3-24
As shown in the preceding figure, all records in the employee table on the left are
displayed, and the data that meets the on condition in the salary table on the right is
displayed. The data that does not meet the on condition is displayed as NULL.
LEFT OUTER JOIN is opposite to LEFT OUTER JOIN. It uses the table on the right as a
baseline. If the table on the left matches the on condition, the data is displayed.
Otherwise, NULL is displayed.
Hive is a component for processing big data. It is often used to process hundreds of GB or
even TB-level data. Therefore, you are advised to use the where condition to filter out
data that does not meet the condition when compiling SQL statements. However, for
LEFT and RIGHT OUTER JOINs, the where condition is executed after the on condition is
executed. Therefore, to optimize the Hive SQL execution efficiency, use subqueries in
scenarios where OUTER JOINs are required and use the where condition to filter out data
that does not meet the conditions in the subqueries.
Run the following statement:
select e1.user_id,e1.username,s.salarys from (select e.* from cx_table_employee e where e.user_id < 8)
e1 left outer join cx_table_salary s on e1.user_id = s.userid;
Figure 3-25
In the preceding SQL statement, the data whose user_id is greater than or equal to 8 is
filtered out in the subquery.
FULL OUTER JOIN returns all the data that meets the where condition in the table. The
data that does not meet the where condition is replaced with NULL.
Run the following statement:
Figure 3-26
The results of FULL OUTER JOIN and LEFT OUTER JOIN are the same.
LEFT SEMI JOIN is used to query only the data that meets the requirements of the left
table.
Run the following statement:
Figure 3-27
LEFT SEMI JOIN is evolved from INNER JOIN. When a data record in the left table exists
in the right table, Hive stops scanning. Therefore, the efficiency is higher than that of
INNER JOIN. However, only the fields in the left table can be displayed behind the select
and where keywords in the LEFT SEMI JOIN. Hive does not support RIGHT SEMI JOIN.
The result of Cartesian product join is to multiply the data in the left table by the data in
the right table.
Run the following statement:
Figure 3-28
The execution result of the preceding SQL statement is the number of records in the
employee table multiplied by the number of records in the salary table.
Map-side JOIN is an optimization of Hive SQL. Hive converts SQL statements into
MapReduce jobs. Therefore, the map-side JOIN corresponds to the map-side JOIN in the
Hadoop Join. Small tables are loaded to the memory to improve the Hive SQL execution
speed. You can use either of the following methods to use map-side JOIN of Hive SQL.
The first method is to use /*+ MAPJOIN*/:
Run the following statement:
Figure 3-29
The second one is to set hive.auto,convert.JOIN to true.
On the Services page, click Hue. On the displayed page, click Hue (Active). The Hue page
is displayed.
Figure 3-30
Figure 3-31
Click the Query Editor and select Hive.
Figure 3-32
Figure 3-33
Step 2 Compile HQL.
select *,row_number() over(order by totalscore desc) rank from (select name,sum(score) totalscore
from cx_table_stu03 group by name) a;
Figure 3-34
Figure 3-35
Step 3 Query data.
Figure 3-36
Figure 3-37
Step 4 View the result.
Figure 3-38
Figure 3-39
3.4 Summary
This exercise describes the add, delete, modify, and query operations of the Hive data
warehouse and introduces multiple join methods to help trainees understand the join
types and differences. This exercise aims to help trainees better understand and use Hive.
4.1 Background
The HBase database is an important big data component and is the most commonly used
NoSQL database in the industry. Banks can store new customer information in HBase and
update or delete out-of-date data in HBase.
4.2 Objectives
Understand common HBase operations, region operations, and filter usage.
4.3 Tasks
4.3.1 Task 1: Performing Common HBase Operations
Run the source /opt/client/bigdata_env command to set environment variables.
Run the hbase shell command to access the HBase shell client.
Figure 4-1
Figure 4-2
list: displays all tables.
put 'cx_table_stu01','20200001','cf1:name','tom'
put 'cx_table_stu01','20200001','cf1:gender','male'
put 'cx_table_stu01','20200001', 'cf1:age','20'
put 'cx_table_stu01','20200002', 'cf1:name','hanmeimei'
put 'cx_table_stu01','20200002', 'cf1:gender','female'
put 'cx_table_stu01','20200002', 'cf1:age','19'
Figure 4-3
scan 'cx_table_stu01',{COLUMNS=>'cf1'} #Queries only the data in the cf1 column family.
scan 'cx_table_stu01',{COLUMNS=>'cf1:name'} #Queries only the name information in the cf1
column family.
Figure 4-4
get 'cx_table_stu01','20200001'
get 'cx_table_stu01','20200001','cf1:name'
Figure 4-5
scan 'cx_table_stu01',{STARTROW=>'20200001','LIMIT'=>2,STOPROW=>'20200002'}
scan 'cx_table_stu01',{STARTROW=>'20200001','LIMIT'=>2,COLUMNS=>'cf1:name'}
Figure 4-6
Note: In addition to column (COLUMNS) modifiers, HBase supports Limit (limiting the
number of rows in the query results) and STARTROW (ROWKEY start row. The system
locates the region based on the key and then scans the region backwards.), STOPROW
(end row), TIMERANGE (timestamp range), VERSIONS (the number of versions), and
FILTER (filtering rows based on conditions).
put 'cx_table_stu01','20200001','cf1:name','ZhangSan'
put 'cx_table_stu01','20200001','cf1:name','LiSi'
put 'cx_table_stu01','20200001','cf1:name','WangWu'
Figure 4-7
Specify multiple versions to be queried.
get 'cx_table_stu01','20200001',{COLUMNS=>'cf1',VERSIONS=>5}
Figure 4-8
The version is specified during the search, but the last record is still displayed. Although
VERSIONS is added, only one record is returned after the get operation. This is because
the default value of VERSIONS is 1 during table creation.
Run the desc'cx_table_stu01' statement to view the table attributes.
Figure 4-9
To view data of multiple versions, run the following statement to change the value of
VERSIONS of the table or specify the value when creating the table:
alter 'cx_table_stu01',{NAME=>'cf1','VERSIONS'=>5}
alter ‘cx_table_stu01’,{NAME=>'cf1','VERSIONS'=>5}
put 'cx_table_stu01','20200001','cf1:name','ZhangSan'
put 'cx_table_stu01','20200001','cf1:name','LiSi'
put 'cx_table_stu01','20200001','cf1:name','WangWu'
Figure 4-10
Figure 4-11
Run the deleteall 'cx_table_stu01','20200002' command to delete a row of data.
Figure 4-12
Figure 4-13
Figure 4-14
Region name format: [table],[region start key],[region id]
Log in to the HBase WebUI and check the table partitions.
Log in to MRS Manager, choose Services > HBase.
Figure 4-15
Click HMaster (Active). The HMaster WebUI is displayed.
Figure 4-16
Click cx_table_stu02 on the User Tables tab page. The Tables Regions page is displayed.
Figure 4-17
The cx_table_stu02 table has four partitions.
Figure 4-18
4.3.2.2 Viewing the Start Key and End Key of Specified Regions
Run the create 'cx_table_stu03', 'cf3', SPLITS => ['10000', '20000', '30000'] command to
create a table.
Check the table partitions.
Figure 4-19
scan 'cx_table_stu01',{FILTER=>"ValueFilter(=,'binary:20')"}
scan 'cx_table_stu01',{FILTER=>"ValueFilter(=,'binary:tom')"}
scan 'cx_table_stu01',FILTER=>"ColumnPrefixFilter('gender')"
scan 'cx_table_stu01',{FILTER=>"ColumnPrefixFilter('name') AND ValueFilter(=,'binary:hanmeimei')"}
Figure 4-20
4.4 Summary
This exercise describes how to create and delete HBase tables and add, delete, modify,
and query data, how to pre-split regions, and how to use filters to query data. After
completing this exercise, you will be able to know how to use HBase.
5.1 Background
This section mainly introduces how to use MR to count words.
5.2 Objectives
Understand the principles of MapReduce programming.
5.3 Tasks
5.3.1 Task 1: MapReduce Shell Practice
Step 1 Log in to an ECS.
Figure 5-1
Step 2 Edit a data file on the local Linux host.
Figure 5-2
Step 3 Upload the file to the HDFS.
Figure 5-3
Step 4 Run the following command to execute the JAR file program:
Figure 5-4
Note: This JAR package is a sample JAR package built-in the Hadoop framework. The
default file separator is the Tab key. The output01 folder does not exist. The program
automatically creates the folder.
The result file is saved in the output01 folder. The system automatically generates a part-
r-00000 file.
Figure 5-5
The file statistics are complete.
package com.huawei.bigdata.mapreduce.examples;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setMapOutputValueClass(LongWritable.class);
// Specify where the custom reducer comes from.
job.setReducerClass(MyReducer.class);
// Specify the type of <k3,v3> output by reducer.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
// Specify where data is written.
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// true indicates that information such as running progress is sent to users in time.
job.waitForCompletion(true);
}
}
package com.huawei.bigdata.mapreduce.examples;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Figure 5-6
Step 2 Understand the scenario.
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
Figure 5-7
Step 4 Understand the development approaches.
Collect statistics on female netizens whose TP is more than 2 hours on the weekend.
To achieve the objective, the process is as follows:
1. Read the original file data.
2. Filter data about the TP of the female netizens.
3. Summarize the total TP of each female.
4. Filter information about female netizens whose TP for online shopping is more than
two hours.
Parse sample code. The class in the sample project is FemaleInfoCollector.java.
Collect statistics on female netizens whose TP for online shopping is more than 2 hours
on the weekend.
To achieve the objective, the process is as follows:
1. Filter the TP of female netizens in original files using the CollectionMapper class
inherited from the Mapper abstract class.
2. Summarize the TP of each female netizen, and output information about female
netizens whose TP is more than 2 hours using the CollectionReducer class inherited
from the Reducer abstract class.
3. Use the main method to create a MapReduce job and submit the MapReduce job to
the Hadoop cluster.
Open the cmd window, go to the directory where the project is located, and run the mvn
package command to package the project.
Figure 5-8
Run the mvn package command to generate a JAR package and obtain it from the target
directory in the project directory, for example, mapreduce-examples-mrs-2.0.jar.
Figure 5-9
Step 6 Use WinSCP to log in to an ECS.
Figure 5-10
Step 7 Use PuTTY to log in to the ECS and run the MR program.
Figure 5-11
Run the mapreduce program.
Figure 5-12
Step 8 View the result. The MR output result is stored in the /output2 directory. A result
file part-r-00000 is generated. Run the cat command to view the result.
Figure 5-13
There are two persons whose TP exceeds 2 hours.
5.4 Summary
This exercise describes the MapReduce programming process in shell and Java modes and
explains the source code to help trainees quickly get started with MapReduce.
6.1 Background
Spark is implemented in the Scala language, and uses Scala as its application framework.
Different from Hadoop, Spark can be tightly integrated with Scala. Scala can operate
Resilient Distributed Datasets (RDDs) so easily as operating local combined objects. This
exercise describes how to use Scala to operate Spark RDD and Spark SQL.
6.2 Objectives
Understand Spark programming by exercising Spark RDD and Spark SQL.
6.3 Tasks
6.3.1 Task 1: Spark RDD Programming
This exercise introduces Spark RDD programming to help you understand the working
principles and core mechanism of Spark Core.
The process is as follows:
Understand how to create an RDD.
Understand the common operator of RDD.
Understand how to use Scala project code to complete RDD operations.
Spark uses the textFile() method to load data from a file system to create an RDD.
This method takes the URI of the file as a parameter, which can be the address of the
local file system, the address of the HDFS, the address of Amazon S3, or more.
Connect to the cluster, start PuTTY or another connection software, load environment
variables, and enter spark-shell.
source /opt/client/bigdata_env
spark-shell
Figure 6-1
1. Load data from a Linux local file system.
Figure 6-2
2. Load data from the HDFS. If the file does not exist, put it again.
You can use either of the statements but the second one is recommended.
Figure 6-3
Step 2 Create an RDD using a parallel set (array).
You can call the parallelize method of SparkContext to create an RDD on an existing set
(array) in Driver.
Transformation Meaning
Transformation Meaning
Seq instead of a single item).
sortBy(func,[ascending],
Similar to sortByKey, but more flexible.
[numTasks])
Transformation Meaning
either more or fewer partitions and balance it
across them.
Common actions
Action Meaning
Figure 6-4
Step 2 Use flatMap.
Figure 6-5
Step 3 Use intersection and union.
rdd3.distinct.collect
rdd4.collect
Figure 6-6
Figure 6-7
Step 4 Use join and groupByKey.
val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("jim", 2)))
//cogroup
val rdd3 = rdd1.cogroup(rdd2)
//Pay attention to the difference between cogroup and groupByKey.
rdd3.collect
//Reduce aggregation.
val rdd2 = rdd1.reduce(_ + _)
rdd2
Figure 6-8
Step 7 Use reduceByKey and sortByKey.
val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 3), ("kitty", 2), ("shuke", 1)))
val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 3), ("shuke", 2), ("kitty", 5)))
val rdd3 = rdd1.union(rdd2)
//Aggregate by key.
val rdd4 = rdd3.reduceByKey(_ + _)
rdd4.collect
//Sort by value in descending order.
val rdd5 = rdd4.map(t => (t._2, t._1)).sortByKey(false).map(t => (t._2, t._1))
rdd5.collect
Figure 6-9
Figure 6-10
Step 8 Understand the lazy mechanism.
The lazy mechanism means that the entire transformation process only records the track
of the transformation and does not trigger real calculation. Only when an operation is
performed, real calculation is triggered from the beginning to the end.
A simple statement is provided to explain the lazy mechanism of Spark. The data.txt file
does not exist, but the first two statements are executed successfully. An error occurs
only when the third action statement is executed.
Figure 6-11
Step 9 Perform persistence operations.
The following is an example of calculating the same RDD for multiple times:
After the preceding instance is added, the execution process after a persistence statement
is added is as follows:
//Persist(MEMORY_ONLY) is called. However, when the statement is executed, the RDD is not cached
because the RDD has not been calculated and generated.
scala> println(rdd.count())
//The first action triggers a real start-to-end calculation. In this case, the preceding rdd.cache() is
executed and the RDD is stored in the cache.
3
scala> println(rdd.collect().mkString(","))
//The second action does not need to trigger a start-to-end calculation. Only the RDD in the cache
needs to be reused.
Hadoop,Spark,Hive
Same as the MapReduce exercise background, this exercise requires you to calculate the
TP.
Develop a Spark application to perform the following operations on logs about the TP of
netizens for online shopping on a weekend:
1. Collect statistics on female netizens whose TP for online shopping is more than 2
hours on the weekend.
2. The first column in the log file records names, the second column records gender,
and the third column records the TP in the unit of minute. Three columns are
separated by comma (,).
log1.txt: logs collected on Saturday.
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
Upload two Internet access log files to the /user/stu01/input directory of the HDFS. If the
log files already exist, you do not need to upload it.
Based on the MRS 2.0 sample project imported during environment installation, start the
FemaleInfoCollection project, which is the Spark Core project. The folder in the MRS 2.0
sample project package is SparkJavaExample.
Figure 6-12
Step 4 Package the project.
Open the cmd window, go to the directory where the project is located, and run the mvn
package command to package the project.
Figure 6-13
Step 5 Use WinSCP to log in to an ECS.
Figure 6-14
Step 6 Use PuTTY to log in to the ECS and run the Spark program.
Figure 6-15
Execute the Spark program.
/opt/client/Spark/spark/bin/spark-submit --class
com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client
/root/FemaleInfoCollection-mrs-2.0.jar /user/stu01/input
Figure 6-16
The total TP of the two persons exceeds 2 hours.
Figure 6-17
Step 1 Edit a data file.
Create the cx_person.txt file on the local Linux host. The file contains three columns: id,
name, and age. The three columns are separated by space. The content of the
cx_person.txt file is as follows:
1 zhangsan 20
2 lisi 29
3 wangwu 25
4 zhaoliu 30
5 tianqi 35
6 kobe 40
Figure 6-18
Run the spark-shell command to go to Spark. Then run the following command to read
data and separate data in each row using column separators:
Figure 6-19
Step 3 Define a case class.
Figure 6-20
Step 4 Associate an RDD with the case class.
Figure 6-21
Step 5 Transform the RDD into DataFrame.
Figure 6-22
Step 6 View information about DataFrame.
personDF.show
Figure 6-23
personDF.printSchema
Figure 6-24
Step 7 Use the domain-specific language (DSL).
personDF.select(personDF.col("name")).show
Figure 6-25
Check another format of the name field.
personDF.select("name").show
Figure 6-26
Step 8 Check the data of the name and age fields.
personDF.select(col("name"), col("age")).show
Figure 6-27
Step 9 Query all names and ages and increase the value of age by 1.
Figure 6-28
You can also perform the following operation:
Step 10 Use the filter method to filter the records where age is greater than or equal to 25.
Figure 6-29
Step 11 Count the number of people who are older than 30.
personDF.filter(col("age")>30).count()
Figure 6-30
Step 12 Group people by age and collect statistics on the number of people of the same
age.
personDF.groupBy("age").count().show
Figure 6-31
Step 13 Use SQL.
personDF.registerTempTable("cx_t_person")
Run the following command to display the schema information of the table:
Figure 6-32
Step 14 Query the two oldest people.
Figure 6-33
Step 15 Query information about people older than 30.
Figure 6-34
Figure 6-35
Step 2 Create a dataset using a file.
Figure 6-36
Step 3 Create a dataset using the toDS method.
Figure 6-37
Step 4 Create a database using DataFrame and as[Type].
Perform transformation based on the DataFrame of personDF in task 1. Note that the
person object fields in Person2 and personDF must be the same.
Figure 6-38
Step 5 Collect statistics on the number of people older than 30 in the dataset.
Figure 6-39
6.4 Summary
This exercise introduces RDD-based Spark Core programming and DataFrame- and
DataSet-based Spark SQL programming, and enables trainees to understand the basic
operations of Spark programming.
7.1 Background
Flink is a unified computing framework that supports both batch processing and stream
processing. It provides a stream data processing engine that supports data distribution
and parallel computing.
Flink provides high-concurrency pipeline data processing, millisecond-level latency, and
high reliability, making it extremely suitable for low-latency data processing.
7.2 Objectives
The asynchronous CheckPoint mechanism and real-time hot-selling product statistics of
Flink help you understand the core ideas of Flink and how to use Flink to solve problems.
7.3 Tasks
7.3.1 Task 1: Importing a Flink Sample Project
Step 1 Download Flink sample code.
Visit https://support.huaweicloud.com/en-us/devg-
mrs/mrs_06_0002.html#mrs_06_0002__section336726849219.
Click the sample project of HUAWEI CLOUD MRS 1.8 for download.
Figure 7-1
Step 2 Import the sample project.
For details about how to import the MRS sample project, navigate to the Appendix to
refer to the instructions on how to import an MRS sample project in Eclipse. After the
import, the project automatically downloads related dependency packages.
Figure 7-2
The preceding figure shows the project code structure.
7.3.2.3 Procedure
Step 1 Write the snapshot data code.
The snapshot data is used to store the number of data pieces recorded by operators
during snapshot creation.
Create a class named UDFState in the com.huawei.flink.example.common package of the
sample project. The code is as follows:
import java.io.Seriablizale;
// This class is part of the snapshot and is used to save UDFState.
public class UDFState implements Serializable {
private long count;
// Initialize UDFState.
public UDFState() {
count = 0L;
}
// Set UDFState.
public void setState(long count) {
this.count = count;
}
// Obtain UDFState.
public long geState() {
return this.count;
}
}
The code snippet of a source operator pauses 1 second every time after sending 10,000
pieces of data. When a snapshot is created, the code saves the total number of sent data
pieces in UDFState. When the snapshot is used for restoration, the number of sent data
pieces saved in UDFState is read and assigned to the count variable.
Create the SimpleSourceWithCheckPoint class in the common package. The code is as
follows:
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.checkpoint.ListCheckpointed;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
@Override
public List<UDFState> snapshotState(long l, long l1) throws Exception
{
UDFState udfState = new UDFState();
List<UDFState> udfStateList = new ArrayList<UDFState>();
udfState.setState(count);
udfStateList.add(udfState);
return udfStateList;
}
@Override
public void restoreState(List<UDFState> list) throws Exception
{
UDFState udfState = list.get(0);
count = udfState.geState();
}
@Override
public void run(SourceContext<Tuple4<Long, String, String, Integer>> sourceContext) throws
Exception
{
Random random = new Random();
while (isRunning) {
for (int i = 0; i < 10000; i++) {
sourceContext.collect(Tuple4.of(random.nextLong(), "hello" + count, alphabet, 1));
count ++;
}
Thread.sleep(1000);
}
}
@Override
public void cancel()
{
isRunning = false;
}
}
This code snippet is about a window operator and is used to calculate the number or
tuples in the window.
Create the WindowStatisticWithChk class in the common package. The code is as follows:
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.checkpoint.ListCheckpointed;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.List;
@Override
@Override
public void restoreState(List<UDFState> list) throws Exception
{
UDFState udfState = list.get(0);
total = udfState.geState();
}
@Override
public void apply(Tuple tuple, TimeWindow timeWindow, Iterable<Tuple4<Long, String, String,
Integer>> iterable, Collector<Long> collector) throws Exception
{
long count = 0L;
for (Tuple4<Long, String, String, Integer> tuple4 : iterable) {
count ++;
}
total += count;
collector.collect(total);
}
}
The code is about the definition of StreamGraph and is used to implement services. The
processing time is used as the time for triggering the window.
Create the FlinkProcessingTimeAPIChkMain class in the common package. The code is as
follows:
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
public class FlinkProcessingTimeAPIChkMain {
public static void main(String[] args) throws Exception
{
String chkPath = ParameterTool.fromArgs(args).get("chkPath",
"hdfs://hacluster/flink/checkpoints/");
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
.window(SlidingProcessingTimeWindows.of(Time.seconds(4), Time.seconds(1)))
.apply(new WindowStatisticWithChk())
.print();
env.execute();
}
}
Open the cmd window, go to the directory where the project is located, and run the mvn
package command to package the project.
Figure 7-3
Run the mvn package command to generate a JAR package and obtain it from the target
directory in the project directory, for example, mapreduce-examples-mrs-2.0.jar.
Figure 7-4
Step 6 Use WinSCP to log in to an ECS.
Figure 7-5
Step 7 Start the Flink cluster.
Use PuTTY to log in to the ECS and run the source /opt/client/bigdata_env command.
Figure 7-6
Start the Flink cluster before running the Flink applications on Linux. Run the yarn
session command on the Flink client to start the Flink cluster. The following is a
command example:
Note: yarn-session starts a running Flink cluster on Yarn. Once the session is successfully
created, you can use the bin/flink tool to submit tasks to the cluster. The system uses the
conf/flink-conf.yaml configuration file by default.
Parameters in the yarn-session command:
Mandatory:
Optional:
Figure 7-7
The IP address of the JobManager web page is an intranet IP address. Log in to MRS
Manager, find the server, and bind an EIP to it. For details about how to bind an EIP, see
the related operations in the MRS documents. Use the EIP to replace the intranet IP
address and access the server. For example, if the bound IP address is 119.3.4.47, the
access address is http://119.3.4.47:42552.
Figure 7-8
Step 8 Run the JAR package of Flink.
Parameter description: class is followed by the full path of the main program entry class
and then the JAR package of the program. chkPath is the path for storing the checkpoint
file. In cluster mode, Flink stores the checkpoint file in HDFS.
The run parameter can be used to compile and run a program.
Usage: run [OPTIONS] <jar-file> <arguments>
Run parameters:
-c,--class <classname>: If the entry class is not specified in the JAR package, this parameter is
used to specify the entry class.
-m,--jobmanager <host:port>: specifies the address of the JobManager (active node) to be
connected. This parameter can be used to specify a JobManager that is different from that in the
configuration file.
-p,--parallelism <parallelism>: specifies the degree of parallelism of a program. The default
value in the configuration file can be overwritten.
Execution result:
Figure 7-9
On the Flink management panel, one more running Flink job is displayed.
Figure 7-10
Step 9 View the output.
On the Task Manager page of the Flink management panel, click Stdout to view the
output.
Figure 7-11
Step 10 View checkpoints.
Start PuTTY and run the HDFS command to view the /flink/checkpoints directory.
Figure 7-12
Step 11 Kill a Flink job.
Run the /opt/client/Flink/flink/bin/flink list command to view the Flink task list.
Figure 7-13
Specify the obtained job ID and run the following command to kill the job:
Figure 7-14
7.3.3.1 Tasks
1. How is data processed based on EventTime and how is Watermark specified?
2. How are Window APIs with flexible Flink used?
3. When and how is State used?
4. How is ProcessFunction used to Implement the TopN function?
7.3.3.3 Procedure
Step 1 Prepare a Flink project.
Figure 7-15
Step 2 Prepare data.
Table 7-1
Column Description
Create the resources folder in the src/main directory of the project and save the data file
to the folder.
Figure 7-16
Figure 7-17
The preceding figure shows the project directory.
Create the HotGoods class in the goods package. The code is as follows:
package com.huawei.flink.example.goods;
import java.io.File;
import java.net.URL;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.java.io.PojoCsvInputFormat;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple1;
import org.apache.flink.api.java.typeutils.PojoTypeInfo;
import org.apache.flink.api.java.typeutils.TypeExtractor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
env
// To create data source and obtain DataStream of the UserBehavior type
.createInput(csvInput, pojoType)
// To extract time and generate watermark
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UserBehavior>() {
@Override
public long extractAscendingTimestamp(UserBehavior userBehavior) {
// Convert the unit of the source data from seconds to millisecond
return userBehavior.timestamp * 1000;
}
})
// To filter the click data
.filter(new FilterFunction<UserBehavior>() {
@Override
public boolean filter(UserBehavior userBehavior) throws Exception {
// To filter the click data
return userBehavior.behavior.equals("pv");
}
})
.keyBy("itemId")
.timeWindow(Time.minutes(60), Time.minutes(5))
.aggregate(new CountAgg(), new WindowResultFunction())
.keyBy("windowEnd")
.process(new TopNHotItems(3))
.print();
/** To obtain the top N hot items in a window. key indicates the timestamp of the window. The
output is a character string of TopN. */
public static class TopNHotItems extends KeyedProcessFunction<Tuple, ItemViewCount, String> {
// To save the states of the stored items and click count, and calculate TopN when all data in a
window is collected
private ListState<ItemViewCount> itemState;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
ListStateDescriptor<ItemViewCount> itemsStateDesc = new ListStateDescriptor<>(
"itemState-state",
ItemViewCount.class);
itemState = getRuntimeContext().getListState(itemsStateDesc);
}
@Override
public void processElement(
ItemViewCount input,
Context context,
Collector<String> collector) throws Exception {
@Override
public void onTimer(
// To control the output frequency and simulate the real-time scrolling result
Thread.sleep(1000);
out.collect(result.toString());
}
}
@Override
public void apply(
Tuple key, // Primary key of the window, that is, itemId
TimeWindow window, // Window
Iterable<Long> aggregateResult, // Result of the aggregate function, that is, count value
Collector<ItemViewCount> collector // Output type: ItemViewCount
) throws Exception {
Long itemId = ((Tuple1<Long>) key).f0;
Long count = aggregateResult.iterator().next();
collector.collect(ItemViewCount.of(itemId, window.getEnd(), count));
}
}
/** COUNT of aggregate function implementation. The value increases by 1 each time a record is
generated. */
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(UserBehavior userBehavior, Long acc) {
return acc + 1;
}
@Override
public Long getResult(Long acc) {
return acc;
}
@Override
public Long merge(Long acc1, Long acc2) {
return acc1 + acc2;
}
}
Flink can run on a single server or even a single Java virtual machine (VM). This
mechanism enables users to test or debug Flink programs locally. Now, run the Flink
program locally. You can also refer to task 2 to run the Flink program in the cluster.
Right-click Run as and choose Java Application from the shortcut menu. Run the main
function. The hot-selling offering IDs at each time point are displayed.
Figure 7-18
7.4 Summary
This exercise describes two cases of implementing the asynchronous CheckPoint
mechanism and real-time hot-selling offerings, and helps trainees learn multiple core
concepts and API usage of Flink, including how to use EventTime, Watermark, State,
Window API, and TopN. It is expected that this exercise can deepen your understanding
of Flink and help you resolve real-world problems.
8.1 Background
The Kafka message subscription system plays an important role in big data services,
especially in real-time services. The typical Taobao You May Like service uses Kafka to
store page clickstream data. After the streaming analysis, the analysis result is pushed to
users.
8.2 Objectives
Understand how to use Kafka shell producers and consumers to generate and
consume data in real time.
8.3 Tasks
8.3.1 Task 1: Producing and Consuming Kafka Messages on the
Shell Side
Step 1 Log in to Kafka.
Use PuTTY to log in to a server and run the source command to set environment
variables.
source /opt/client/bigdata_env
Figure 8-1
Run the cd /opt/client/Kafka/kafka/ command to go to the Kafka directory.
Figure 8-2
Step 2 Create a Kafka topic.
Figure 8-3
Note: For details about how to obtain the ZooKeeper IP address, see the related content
in the Appendix.
Figure 8-4
Step 4 Create a console consumer.
Figure 8-5
Note: The IP address of bootstrap-server is the IP address of the Kafka broker. You can
obtain the IP address by referring to the related content in the Appendix.
After this command is executed, the cx_topic2 data is consumed. Do not perform other
operations in this window or close the window.
Log in to PuTTY again, run the source command to obtain the environment variables, and
go to the Kafka directory.
Figure 8-6
Run the following command to create a producer:
Figure 8-7
Note: The IP address of broker-list is the broker address of Kafka. For details about how
to obtain the IP address, see the related content in the Appendix.
Switch to the shell of the consumer. The console data output is displayed.
Figure 8-8
You can continue to enter data on the producer for testing.
Figure 8-9
Note that the topic partition is set to 3. Different value settings will lead to different
effects.
For example, to delete the topic, run the bin/kafka-topics.sh --delete --topic cx_*** --
zookeeper 192.168.0.151:2181/kafka command.
Open three PuTTY windows, set environment variables, go to the Kafka directory, and run
the following command to start three consumers:
Note that you can add --consumer-property group1 to specify consumer group group1.
Figure 8-10
Switch to the three consumer windows. It is found that each window consumes two
messages evenly.
Figure 8-11
Figure 8-12
Figure 8-13
The three consumers evenly consume six messages. Each consumer processes two
messages. This ensures data integrity.
Open another PuTTY window, set environment variables, go to the Kafka directory, and
run the following command to start the fourth consumer:
Figure 8-14
Step 5 Configure four consumers.
Figure 8-15
Switch to the four consumer windows:
Figure 8-16
Figure 8-17
Figure 8-18
Figure 8-19
As shown in the preceding figure, a consumer does not have a corresponding partition.
Therefore, the consumer cannot obtain messages. Therefore, when creating topics, you
can create more partitions to ensure that multiple consumers can correspond to the
partitions, preventing consumers from being wasted. To add partitions, you can run the
kafka-reassign-partitions.sh command.
Disable two consumers by pressing Ctrl+C and retain the remaining two consumers.
Enter the following six messages in sequence in the producer window:
Figure 8-20
Check the status of the two consumer windows.
Figure 8-21
Figure 8-22
When two consumers are used, the unexpected phenomenon also occurs. That is,
messages are not evenly distributed. Instead, they are divided into four messages and
two messages. The reason is that one consumer corresponds to two partitions, and the
other corresponds to one partition. Messages are consumed based on partitions.
Check the partitions of the cx_topic3 topic.
Figure 8-23
View the details about the consumer group.
Figure 8-24
As the figure shows, the consumer_id values of partition0 and partition1 are both
consumer-1-404604b8-be64-4a1d-9a15-42b7bf5f475f.
Sometimes, the data consumption sequences of different partitions are different. This is
because Kafka messages are stored by partition, and only messages in the same partition
are pulled in sequence.
8.4 Summary
This exercise describes how to generate and consume data in real time the shell end and
enables trainees to have a deeper understanding of Kafka. Multiple consumers in the
same consumer group are equivalent to one consumer, which improves consumption
efficiency.
9.1 Background
Flume is an important data collection tool in the big data components, and is often used
to collect data from various data sources for other components to analyze. In the log
analysis service, server logs are collected to analyze whether servers are running properly.
In real-time services, data is usually collected to the Kafka for analysis and processing by
real-time components such as streaming and Spark. Flume is an important application in
big data services.
9.2 Objectives
Configure and use Flume to collect data.
9.3 Tasks
9.3.1 Task 1: Installing the Flume Client
Step 1 Open the Flume service page.
Access the MRS Manager cluster management page and choose Services > Flume.
Figure 9-1
Step 2 Click Download Client.
Figure 9-2
Figure 9-3
After the download is complete, a dialog box is displayed, indicating the server (Master
node) to which the file is downloaded. The path is /tmp/MRS-client.
Figure 9-4
Use MobaXterm to log in to the ECS of the preceding step and go to the /tmp/MRS-client
directory.
Figure 9-5
Run the following command and decompress the package to obtain the verification file
and client configuration packages:
Figure 9-6
Step 4 Verify the file package.
Figure 9-7
Step 6 Install the Flume environment variables.
Run the following command to install the client running environment to the new
directory /opt/Flumeenv.
The directory is generated automatically during installation.
sh /tmp/MRS-client/MRS_Flume_ClientConfig/install.sh /opt/Flumeenv
Check the command output. If the following information is displayed, the client running
environment has been successfully installed:
Figure 9-8
Step 7 Configure the environment variables.
cd /tmp/MRS-client/MRS_Flume_ClientConfig/Flume
tar -xvf FusionInsight-Flume-1.6.0.tar.gz
Figure 9-9
Step 9 Install the Flume client.
sh /tmp/MRS-client/MRS_Flume_ClientConfig/Flume/install.sh -d /opt/FlumeClient
Figure 9-10
Run the following commands to copy the HDFS configuration file to the conf directory of
Flume:
cp /opt/client/HDFS/hadoop/etc/hadoop/hdfs-site.xml /opt/FlumeClient/fusioninsight-flume-
1.6.0/conf/
cp /opt/client/HDFS/hadoop/etc/hadoop/core-site.xml /opt/FlumeClient/fusioninsight-flume-
1.6.0/conf/
cd /opt/FlumeClient/fusioninsight-flume-1.6.0
sh bin/flume-manage.sh restart
Figure 9-11
Visit https://support.huawei.com/enterprise/en/doc/EDOC1000113257.
After the decompression, start the Flume configuration planning tool. If macros are
disabled, enable them. Otherwise, the tool does not work.
Figure 9-12
Step 3 Configure parameters.
In the Flume Name row of the first sheet, select client. Then, switch to the Flume
Configuration row of the second sheet.
Figure 9-13
Step 4 Configure the source.
Figure 9-14
Figure 9-15
Figure 9-16
Figure 9-17
Step 5 Configure a channel.
Click Add Channel. Set ChannelName to c1, which corresponds to that in Source, set type
to memory, and retain the default values for other parameters.
Figure 9-18
Step 6 Configure a sink.
Figure 9-19
Figure 9-20
Note: The MRS cluster is in non-security mode. Therefore, you do not need to configure
Kerberos in the sink.
Figure 9-21
Figure 9-22
Step 8 Create /tmp/flume_spooldir in Linux
Figure 9-23
Step 9 Upload the Flume configuration file.
/opt/FlumeClient/fusioninsight-flume-1.6.0/conf/
Figure 9-24
Go to the /tmp/flume_spooldir directory, run the vi command to create the test1.txt file,
and enter any content.
Figure 9-25
Step 11 View the result.
Figure 9-26
The data is successfully collected and uploaded to the HDFS. You can continue to create
a data file for testing.
On the tool description page of the configuration tool, change server to client.
Figure 9-27
If the FlumeServer is deployed in a cluster, set this parameter to server. If the
FlumeServer is not deployed in a cluster, set this parameter to client.
In the Flume configuration planning tool, change type of sink to kafka and set the value
of kafka.bootstrap.servers.
Figure 9-28
kafka.topic: cx_topic1
kafka.bootstrap.servers: 192.168.0.152:9092. If there are multiple Kafka instances in the
cluster, you need to configure all of them. If the Kafka is installed in the cluster and
configuration has been synchronized, you do not need to configure this parameter.
kafka.security.protocol: PLAINTEXT. The cluster used in this exercise is a non-security
cluster.
After the configuration is complete, click Generate a configuration file.
Figure 9-29
Note: You can obtain the IP address of the ZooKeeper by referring to the related content
in the Appendix.
/opt/FlumeClient/fusioninsight-flume-1.6.0/conf/
Figure 9-30
Note: The Flume client automatically loads the properties.properties file.
Figure 9-31
Note: The IP address of bootstrap-server corresponds to the IP address of the Kafka
instance. You can obtain the IP address by referring to the related content in the
Appendix.
After this command is executed, the cx_topic1 data is consumed. Do not perform other
operations in this window or close it.
On PuTTY, open a shell connection (do not close the consumer window that is just
started) and go to the /tmp/flume_spooldir directory.
Run the vi command to edit the testkafka.txt file, enter any content, save the file, and
exit.
Figure 9-32
Step 7 View the result.
Switch back to the shell window of the consumer. The data output is displayed.
Figure 9-33
9.4 Summary
This exercise mainly describes how to collect data using Flume SpoolDir and Avro
sources. This exercise aims to help trainees better understand Flume by learning common
offline and real-time data collection methods.
10.1 Background
Big data services often involve data migration, especially data migration between
relational databases and big data components. Loader is often used to migrate data
between MySQL and HDFS/HBase. The graphical operations of Loader make data
migration easier.
10.2 Objectives
Use Loader to migrate data in service scenarios.
10.3 Tasks
10.3.1 Task 1: Preparing MySQL Data
Step 1 Apply for the MySQL service.
Figure 10-1
Click Buy Now and configure the database instance information as follows:
Billing Mode: Pay-per-use
Region: CN East-Shanghai2 (the same region as MRS)
DB Instance Name: Enter a custom name. In this exercise, rds_loader is used as an
example. The instance name must start with a letter and contain 4 to 64 characters. Only
letters, digits, hyphens (-), and underscores (_) are allowed.
DB Engine: MySQL
DB Engine Version: 5.6
DB Instance Type: Single
AZ: default value
Time Zone: default value
Instance Class: 1 vCPU | 2 GB
Storage Type: Ultra-high I/O
Storage Space: 40 GB
Disk Encryption: Disable
VPC: default value (the same network as MRS)
Security Group: default value (the same security group as MRS)
Administrator: root
Administrator Password: set the password as required.
Parameter Template: default value
Tag: not specified
Quantity: 1
Figure 10-2
Step 2 Log in to MySQL.
After the RDS for MySQL DB instance is created, click Log In and enter username root
and password to log in to the MySQL DB instance.
Figure 10-3
The MySQL data service management page is displayed.
Figure 10-4
Step 3 Create a database.
Click Create Database, enter a database name, for example, rdsdb, set Character Set to
utf8, and click OK.
Figure 10-5
Step 4 Create a table.
In the list on the left, choose rdsdb. On the displayed page, click Create Table, name the
table, and change the character set, as shown in the following figure:
Figure 10-6
Click Next, and then Add, and set the fields as follows:
Figure 10-7
Set id to the primary key and click Next. Do not set the index and foreign key. Then click
Create Now.
Figure 10-8
Click Execute.
Click the SQL operation button in the upper part, select SQL Window, select the rdsdb
database on the left, and enter the following statement in the SQL window:
Figure 10-9
Click Execute SQL to insert data.
Figure 10-10
Step 2 Upload the JAR file.
Start WinSCP, connect to the master node, and upload the MySQL JAR package to the
/opt/Bigdata/MRS_2.1.0/1_18_Sqoop/install/FusionInsight-Sqoop-1.99.7/server/jdbc
directory.
Figure 10-11
Note: If the MRS cluster is highly available, upload the package to each master node. In
this exercise, the HA function is not enabled for the MRS cluster. You only need to upload
the package to one master node.
Figure 10-12
Note: If the MRS cluster is highly available, you need to modify this attribute on each
master node. In this exercise, the HA function is not enabled for the MRS cluster. You
only need to modify the attribute for server master nodes.
Modify the jdbc.properties file in the folder in the previous step. Change the key value of
MySQL to the name of the uploaded JDBC driver package mysql-connector-java-
5.1.21.jar. If the name is already mysql-connector-java-5.1.21.jar, you do not need to
change it.
Figure 10-13
Note: If the MRS cluster is in HA mode, you need to change the value of this parameter
on each master node. In this exercise, the HA function is not enabled for the MRS cluster.
You only need to change the value of this parameter on one master node.
Log in to the MRS management page. On the Services tab page, click Loader.
Figure 10-14
Choose More > Restart Service.
Figure 10-15
Enter the verification password and click OK. In the Restart Service dialog box, select
Restart all upper-layer services, and wait for the service to restart.
Figure 10-16
Wait for the service to restart.
Figure 10-17
Log in to MRS Manager and choose Service Hue> Service Status. Click Hue (Active) to
access the Hue page.
Figure 10-18
Step 2 Access Sqoop.
The open-source framework Sqoop is Loader in Huawei products. Click Sqoop in the Data
Browsers drop-down list. The Sqoop page is displayed.
Figure 10-19
Figure 10-20
Step 3 Create a MySQL link.
In the upper right corner, choose Manage links > New link.
Figure 10-21
Name: cx_mysql_conn
Connector: generic-jdbc-connector
Database type: MySQL
Host: Enter the private IP address of the MySQL instance, as shown in the following
figure:
Figure 10-22
Port: 3306
Database: rdsdb
Username: root
Password: the password of user root set when you apply for the MySQL service
Figure 10-23
Figure 10-24
After the configuration is complete, click Test. If the testing succeeds, click Save. The
MySQL link is created.
Figure 10-25
Step 5 Create a Hive link.
Figure 10-26
Step 6 Create an HDFS link.
Figure 10-27
MySQL tables and data have been prepared in task 1. The data is as follows:
Figure 10-28
Step 2 Log in to Hue and create a job.
On the Sqoop page of Hue, click Create Job and set the parameters as follows:
Figure 10-29
Click Next.
Set Schema name to rdsdb, Table name to cx_student, and Partition column to id, as
shown in the following figure:
Figure 10-30
Click Next.
Set Output directory to the /user/stu01/output2 directory. Retain the default values for
other parameters, as shown in the following figure:
Figure 10-31
Note: If output2 does not exist, the system automatically creates one.
Figure 10-32
The task is successfully run.
Figure 10-33
Step 6 View the result.
Use PuTTY to log in to the master node and go to the HDFS directory to view data files.
Figure 10-34
Use PuTTY to log in to a master node, go to Hive, and run the following statement to
create a table:
create table cx_loader_stu01(id int,name string,gender string ,age int) row format delimited fields
terminated by ',' stored as textfile ;
Figure 10-35
Step 3 Log in to Hue and create a job.
Figure 10-36
Click Next.
Set Schema name to rdsdb, Table name to cx_student, and Partition column to id, as
shown in the following figure:
Figure 10-37
Click Next.
Retain the default database name default and set Table to cx_loader_stu01, as shown in
the following figure:
Figure 10-38
Step 6 Configure the field mapping.
Figure 10-39
Click Next.
Figure 10-40
The task is successfully run.
Figure 10-41
Step 8 View the result.
Use PuTTY to log in to the master node, run the beeline command to go to Hive, and run
the select statement to view the result.
Figure 10-42
Use PuTTY to log in to the master node, run hbase shell to go to the HBase window, and
run the following statement to create a table:
create 'cx_table_stu02','cf1'
Figure 10-43
Step 2 Create a data file and upload it to the HDFS.
Edit data file cx_stu_info2.txt on the Linux PC. The file content is as follows:
Figure 10-44
Run the following command to upload the file to the HDFS:
Figure 10-45
Step 3 Log in to the Hue page and create a job.
Figure 10-46
Click Next.
The input path is the path of the HDFS file to be imported. The configuration is as
follows:
Figure 10-47
Click Next.
Set Table name to cx_stu_info2 and Method to PUTLIST, as shown in the following figure:
Figure 10-48
Click Next.
Figure 10-49
Select Row Key in the first row, name the destination field, which the qualifier of the
column in HBase, and click Next.
Figure 10-50
The task is successfully run.
Figure 10-51
Step 8 View the result.
Log in to HBase and run the scan command to view the table data.
Figure 10-52
10.4 Summary
This exercise describes how to use Loader in multiple service scenarios. Trainees can
perform data migration operations in actual services after completing this exercise. Note
that tables must be created before table data is migrated among MySQL, HBase, and
Hive. Otherwise, an error may occur and the exercise may fail.
11.1 Background
In the big data services, multiple components form a service system. The following two
exercises involve these components.
The first one is a typical data analysis exercise. Loader periodically migrates MySQL
database data to Hive. Because data in Hive is stored in HDFS, Loader is used to import
data in the HDFS to HBase. Use HBase to query data in real time and use the big data
processing capability of Hive to analyze related results.
The second one is to use Flume to collect incremental data, upload the data to HDFS,
and use Hive to query and analyze the data.
11.2 Objectives
Use big data components to convert and query data in real time.
11.3 Tasks
Data is imported from the MySQL database to Hive, and then imported from Hive to
HBase for data analysis.
Go to the MySQL instance page and click Log In. You can use MySQL resources
purchased in section 11.3. If no MySQL resource is available, purchase one.
Figure 11-1
Step 2 Create the cx_socker table and set timestr as the primary key.
Create the rdsdb database (if there is no such a database) and create a table.
Figure 11-2
Create a field.
Figure 11-3
Click Create Now and execute the script.
Figure 11-4
Click Create Task. By default, no bucket is available. Click Create OBS bucket.
Figure 11-5
Click OK to return to the import page. Select data file sp500.csv and upload it. The
configuration is as follows:
Figure 11-6
Click Create Import Task and wait for the task to be executed.
Figure 11-7
Figure 11-8
Step 4 View data in cx_socker.
Figure 11-9
Detailed data is shown as follows:
Figure 11-10
Log in to Hive and run the following command to create the cx_hive_socker table:
Figure 11-11
Step 2 Log in to Hue and create a job in Sqoop.
Figure 11-12
Step 3 Set the MySQL information for the job.
Figure 11-13
Step 4 Configure the Hive information for the job.
Figure 11-14
Step 5 Configure the field mapping.
Figure 11-15
Click Next.
Figure 11-16
The task is successfully run.
Figure 11-17
Step 7 View data in Hive.
Figure 11-18
Figure 11-19
Step 2 Run the following command to insert data:
Figure 11-20
Step 3 Run the following command to obtain the total number of the stocks that grow:
Figure 11-21
Access the hbase shell and run the following statement to create a table:
create 'cx_hbase_socker','cf1'
Figure 11-22
Step 2 Create a Loader job.
Figure 11-23
Step 3 Configure the source path.
The input path is the path of the HDFS file to be imported. The address is the data in the
Hive table cx_hive_socker, and is in the Hive data warehouse directory of the HDFS, as
shown in the following figure:
Figure 11-24
To configure the HDFS address of the Loader job, you need to query the specific path. In
this example, the HDFS path is as follows:
/user/hive/warehouse/cx_hive_socker/60f9ba12-3a37-472f-baaf-1b092c82740f
Figure 11-25
Click Next.
Figure 11-26
Click Next.
Figure 11-27
Click Next.
Figure 11-28
The task is successfully run.
Figure 11-29
Step 7 View the result.
Log in to HBase and run the scan command to view the table data.
Figure 11-30
get 'cx_hbase_socker','2009-09-15'
Figure 11-31
Step 2 Query the number of records in a specified period.
scan 'cx_hbase_socker',{COLUMNS=>'cf1:endprice',STARTROW=>'2009-08-15',STOPROW=>'2009-09-
15'}
Figure 11-32
Step 3 Queries all columns whose values are greater than a specified value. Values are
compared as character strings.
Figure 11-33
Step 4 Query all information ending with endprice. The value of the character string must
be greater than 999.
Figure 11-34
11.4 Summary
These exercises integrate the applications of each component, helping trainees better
understand and use big data components.
Visit https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-
2133151.html, select Accept License Agreement, and download the JDK of the Windows
x64 version. If the operating system is 32-bit, select the x86 version.
Figure 12-1
Step 2 Double-click the downloaded .exe file and click Next.
Figure 12-2
Step 3 Select the installation path. You can use the default address.
Figure 12-3
Step 4 On the Change in License Terms page, click OK.
Figure 12-4
Step 5 Retain the default address and click Next.
Figure 12-5
Wait for the installation to complete.
Figure 12-6
After the installation is complete, click Close.
Figure 12-7
Step 6 Configure JDK environment variables.
Choose My Computer > Properties > Advanced system settings > Environment Variables.
Figure 12-8
Click New in the System variables area. Set Variable name to JAVA_HOME (all uppercase
letters) and Variable value to the JDK installation path.
Figure 12-9
Find Path in the system variables and edit the variable.
Figure 12-10
Add a semicolon (;) at the end of the variable value, and then add %JAVA_HOME%\bin.
Figure 12-11
Step 7 Check whether the JDK is installed successfully.
Choose Start > Run, enter cmd, and press Enter. In the displayed dialog box, enter java -
version.
Figure 12-12
If the Java version information is displayed, the installation is successful.
Figure 12-13
Add MAVEN_HOME or M2_HOME to the system environment variables. Its value is the
Maven installation directory D:\apache-maven-3.5.0.
Figure 12-14
Step 2 Verify the Maven installation.
Press Win+R to open the Run window, enter cmd, and run the mvn -version command to
check the version.
Figure 12-15
Figure 12-16
Step 2 Select a version to download.
Select 64-bit to download. If the computer is 32-bit, click Download Packages and select
32-bit to download.
Figure 12-17
Step 3 Decompress the downloaded package and go to the folder. The following figure
shows the Eclipse startup program.
Figure 12-18
Double-click the program to start it. If you open the tool for the first time, you need to
configure a workspace. You can select another location or use the default drive C. Then
click OK.
Figure 12-19
Step 4 Choose Help > Eclipse Marketplace, search for Maven, and click Install to Install the
Maven plug-in in Eclipse.
Figure 12-20
Step 5 Choose Window > Preferences > Maven > User Settings and configure the Maven in
the installation directory.
Figure 12-21
Step 6 Download the MRS2.0 sample code.
The address for downloading the sample project of MRS on HUAWEI CLOUD is
https://github.com/huaweicloud/huaweicloud-mrs-example/tree/mrs-2.0.
Figure 12-22
Download the ZIP package and decompress it.
Figure 12-23
Start Eclipse, choose File > New > Java Working Set, enter the name, for example,
MRS2.0Demo, and click Finish.
Figure 12-24
Step 2 Import a sample project to Eclipse.
Decompress the package and start Eclipse. Then choose File > Import.
Figure 12-25
Click Browse, select the huaweicloud-mrs-example-mrs-2.0 sample project folder in the
decompressed package, select Add project to working set, select the MRS2.0Demo
created in the previous step, and click Finish.
Figure 12-26
Wait until the Maven dependency package is loaded.
Figure 12-27
If an error is reported, ignore it and click OK.
Figure 12-28
Switch to the pom.xml page and add the following code, where the repositories code
after the dependencies tag.
<repositories>
<repository>
<id>huaweicloudsdk</id>
<url>https://mirrors.huaweicloud.com/repository/maven/huaweicloudsdk/</url>
<releases><enabled>true</enabled></releases>
<snapshots><enabled>true</enabled></snapshots>
</repository>
<repository>
<id>central</id>
<name>Mavn Centreal</name>
<url>https://repo1.maven.org/maven2/</url>
</repository>
</repositories>
Figure 12-29
Figure 12-30
After saving the file, wait for Eclipse to download the JAR package and keep the network
connection. Maven downloads the required JAR package from the Huawei image
repository.
If the pom reports an error stating "Missing artifact jdk.tools:jdk.tools:jar:1.8", add the
following information to the pom.xml file:
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.8</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
The following figure shows the content added to the pom.xml file:
Figure 12-31
Step 4 Modify the pom file.
Add the marked code to the pom file, indicating that the JAR package of the Gauss
database is not introduced to the project.
Figure 12-32
<exclusion> <groupId>com.huawei.gaussc10</groupId>
<artifactId>gauss</artifactId>
</exclusion>
Right-click the project name and choose Build Path > Configure Build Path from the
shortcut menu.
Select JRE System Library[[email protected]], click Remove, click Add Library, select JRE System
Library, and click Next.
Figure 12-33
Select JDK1.8 from the Alternate JRE drop-down list and click Finish.
Figure 12-34
Select Java Compiler, set Compiler compliance level to 1.8, and click OK.
Figure 12-35
Select Yes.
Figure 12-36
The project architecture is as follows:
Figure 12-37
Click the cluster name in the cluster list and click Nodes.
Figure 12-38
Step 2 Log in to the server where the streaming core is located.
Figure 12-39
Select EIP and click View EIP to purchase an IP address. If you have purchased sufficient
IP addresses when creating a cluster, click Bind EIP. Select Pay-per-use. After the
purchase is complete, the Elastic Cloud Server page is displayed.
Figure 12-40
Step 3 Bind an IP address.
Figure 12-41
Select an IP address and click OK.
Figure 12-42
Refresh the page. You can see that the EIP is bound successfully.
Figure 12-43
Figure 12-44
Step 2 Log in as user admin.
Figure 12-45
Step 3 Check the status of the Zookeeper service.
Choose Services > Service ZooKeeper > Instance. The business IP address of ZooKeeper is
displayed.
Figure 12-46
Figure 12-47
Step 2 Log in as user admin.
Figure 12-48
Step 3 Check the status of the Zookeeper service.
Choose Services > Service Kafka > Instance. The business IP address of the Kafka Broker is
displayed.
Figure 12-49
hdfs fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
<path>: start directory to be checked
-move: to move the damaged file to /lost+found
-delete: to delete the damaged file
-openforwrite: to show the file that is being written
-files: to show all checked files
-blocks: to show the block report
-locations: to show the location of each block
-racks: to show the network topology of the DataNode
By default, fsck ignores files that are being written, and you can use the -openforwrite
option to report such files.
Run the hdfs fsck /1001/hive.log –racks command to view the topology information of
/1001/hive.log.
Figure 12-50
The detailed information about each block in the file is displayed, including the rack
information of the DataNode.
In the Flink exercise, the yarn-session.sh script is used to start a Flink cluster. This is a
Yarn application. Run the following command to view the Yarn application:
Figure 12-51
2. Run the following command to kill the Yarn application:
For example, to kill a Flink cluster application, run the -list command to view the ID, and
then run the kill command.
Figure 12-52