BDC Final Record
BDC Final Record
1. HDFS (Storage)
A. Hadoop Storage File system: Your first objective is to create a directory
structure in HDFS using HDFS commands. Create the local files using Linux
commands and move the files to HDFS directory and vice versa.
I. Write a command to create the directory structure in HDFS.
II. Write a Command to move file from local unix/linux machine to HDFS.
B. Viewing Data Contents, Files and Directory Try to perform these simple
steps:
Write HDFS command to see the contents of files in HDFS.
C. Getting Files data from the HDFS to Local Disk
I. Write a HDFS command to copy the file from HDFS to local file system. To
process any data first move the data in HDFS. All files stored in HDFS can be
accessed using HDFS commands.
Ans) A)
I) [cloudera@quickstart ~]$ hadoop fs -mkdir Directory
O/P:-
Found 1 items
-rw-r--r-- 1 cloudera cloudera 14 2023-05-02 01:36
Directory/myfile.txt
4. click on finish
5.Now click on wordCount->src create a new class WordCount.java
From apache documentation of hadoop goto MapReduce tutorials copy the java
code
Java code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
6.save the java code and run it.
7.Right click on WordCount folder->export->select jar file->click on next
8.Name the jar file and select the required location the jar file to be exported -
>ok->finish
9.Go to the location where we exported the jar file and verify it.
3. Data Processing Tool – Hive (NOSQL query based language) Hive command
line tool allows you to submit jobs via bash scripts.
Identifying properties of a data set:
We have a table 'user data' that contains the following fields:
data_date: string
user_id: string
properties: string
The properties field is formatted as a series of attribute=value pairs.
Ex: Age=21; state=CA; gender=M;
Ans)
I. Create the table in HIVE using hive nosql based query.
OK
Time taken: 0.091 seconds
II. Fill the table with sample data by using some sample data bases.
Create a text file with dummy data and load it using the following
command.
Dummy Data:
22-04-2023,1,Age@21#State@CA#Gender@M
23-04-2023,2,Age@21#State@NY#Gender@F
24-04-2023,3,Age@22#State@OH#Gender@F
25-04-2023,4,Age@22#State@OH#Gender@M
26-04-2023,5,Age@23#State@TX#Gender@F
27-04-2023,6,Age@23#State@TN#Gender@M
Maximum values:
hive> select t.my_key, max(t.my_key_value) as max_value
> from (
> select explode(properties) as (my_key,my_key_value)
> from user_data1
>)t
> group by t.my_key;
...
...
OK
Age 23
Gender M
State TX
Time taken: 22.254 seconds, Fetched: 3 row(s)
BIG DATA COMPUTING LABORATORY 11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)
Unique Count:
The log files are located on the prepare folder. Load them in HDFS at
data/ pig/ simple_logs folder and use them as the input.
Important: In order to load tab delimited files use pigStorage ('\u0001').
Lab Instructions:
Create a program to:
i. Calculate the total views per hour per day.
ii. Calculate the total views per day.
iii. Calculate the total counts of each hour across all days.
iv. We can write word count script by passing text file as input
"site2",2,30,"2023-04-28"
i, ii & iii)
Create a pig script file with following commands and execute it.
-- Load the log data from HDFS
logs = LOAD '/home/cloudera/sample_logs.txt' USING PigStorage(',') AS
(site:chararray, hour:int, views:int, date:chararray);
O/P:- It is stored at the given path, i.e., /home/cloudera. The output is produced
as a folder containing a text file. The text file has the expected answer.
1) "2023-04-27",1,15
"2023-04-27",2,35
"2023-04-28",1,15
"2023-04-28",2,55
2)1,30
2,90
3)"2023-04-27",50
"2023-04-28",70
4)Create a pig script file with following commands and execute it.
Give some text file’s path as input instead of /home/cloudera/myfile.txt.
data = LOAD '/home/cloudera/myfile.txt' AS (line:chararray);
words = FOREACH data GENERATE FLATTEN(TOKENIZE(line)) AS
word;
word_counts = GROUP words BY word;
word_count_totals = FOREACH word_counts GENERATE group AS word,
COUNT(words) AS count;
STORE word_count_totals INTO '/home/cloudera';
5. SQOOP
I. Create table in HIVE using hive query language.
O/P:-
OK
aks
aksra
ara
default
O/P:-
OK
O/P:-
BIG DATA COMPUTING LABORATORY 17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)
OK
my_table
user_data
user_data1
> id INT,
>)
O/P:-
OK
O/P:-
rawDataSize=98]
Stage-Stage-1: Map: 1 Cumulative CPU: 1.08 sec HDFS Read: 4115 HDFS
OK
II. Import the sql table data into hive using sqoop too.
$ sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root \
--password cloudera \
--table categories \
--hive-import \
--hive-table demo3
III. Export hive table data into local machine and into SQL.
2)Create a MySQL table (cats) with same schema as the hive table
Time series: when all other factors are constant prediction of future values.
o ARIMA model.
o Advanced Models
o Autoregression AR.
o Moving Average MA
industry.
ARIMA is the Most common model used for time series forecasting. It has 3
components.
1. Autoregression AR.
2. Moving Average MA
3. Integrated
1. Autoregression AR.
lagged
irregular component. white noise is just the error. error is the difference between
the actual value and predicted value. so we take into consideration the error also
to
captured in MA.
o q is order of MA.
o ACF gives q.
3. Integrated
it
after
Differencing
In [22]: df['Sales First Difference'] = df['Sales'] - df['Sales'].shift(1)
In [212]: df['Sales'].shift(1)
Out[212]: Month
1964-01-01 NaN
1964-02-01 2815.0
1964-03-01 2672.0
1964-04-01 2755.0
1964-05-01 2721.0
1964-06-01 2946.0
1964-07-01 3036.0
1964-08-01 2282.0
1964-09-01 2212.0
1964-10-01 2922.0
1964-11-01 4301.0
1964-12-01 5764.0
1965-01-01 7312.0
1965-02-01 2541.0
1965-03-01 2475.0
1965-04-01 3031.0
1965-05-01 3266.0
1965-06-01 3776.0
1965-07-01 3230.0
1965-08-01 3028.0
1965-09-01 1759.0
1965-10-01 3595.0
1965-11-01 4474.0
1965-12-01 6838.0
1966-01-01 8357.0
1966-02-01 3113.0
1966-03-01 3006.0
1966-04-01 4047.0
1966-05-01 3523.0
1966-06-01 3937.0
...
1970-04-01 3370.0
1970-05-01 3740.0
1970-06-01 2927.0
1970-07-01 3986.0
1970-08-01 4217.0
1970-09-01 1738.0
1970-10-01 5221.0
1970-11-01 6424.0
1970-12-01 9842.0
khjb debe
▪ For an AR model, the theoretical PACF “shuts off” past the order of the model. The
phrase “shuts off” means that in theory the partial autocorrelations are equal to 0
beyond that point. Put another way, the number of non-zero partial autocorrelations
gives the order of the AR model. By the “order of the model” we mean the most
extreme lag of x that is used as a predictor.
• Identification of an MA model is often best done with the ACF rather than the PACF.
▪ For an MA model, the theoretical PACF does not shut off, but instead tapers toward 0
in some manner. A clearer pattern for an MA model is in the ACF. The ACF will have
non-zero autocorrelations only at lags involved in the model.