0% found this document useful (0 votes)
30 views

BDC Final Record

big data computing

Uploaded by

Pandu snigdha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

BDC Final Record

big data computing

Uploaded by

Pandu snigdha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(CSE-DS, CYS, AI&DS)

EXPERIMENT NO: 1 DATE:

1. HDFS (Storage)
A. Hadoop Storage File system: Your first objective is to create a directory
structure in HDFS using HDFS commands. Create the local files using Linux
commands and move the files to HDFS directory and vice versa.
I. Write a command to create the directory structure in HDFS.
II. Write a Command to move file from local unix/linux machine to HDFS.
B. Viewing Data Contents, Files and Directory Try to perform these simple
steps:
Write HDFS command to see the contents of files in HDFS.
C. Getting Files data from the HDFS to Local Disk
I. Write a HDFS command to copy the file from HDFS to local file system. To
process any data first move the data in HDFS. All files stored in HDFS can be
accessed using HDFS commands.

Ans) A)
I) [cloudera@quickstart ~]$ hadoop fs -mkdir Directory

II) [cloudera@quickstart ~]$ hadoop fs -put /home/cloudera/ Desktop/myfile.txt


Directory

[cloudera@quickstart ~]$ hadoop fs -ls Directory

O/P:-

Found 1 items
-rw-r--r-- 1 cloudera cloudera 14 2023-05-02 01:36
Directory/myfile.txt

BIG DATA COMPUTING LABORATORY 1


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

B) [cloudera@quickstart ~]$ hadoop fs -cat Directory/myfile.txt


Hello, World!

C) [cloudera@quickstart ~]$ hadoop fs -get Directory/myfile.txt


/home/cloudera/Desktop/myfile1.txt

BIG DATA COMPUTING LABORATORY 2


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

EXPERIMENT NO: 2 DATE:

A. Develop MapReduce example program in a MapReduce environment to find


out the number of occurrences of each word in a text file
1 .Create a new java project .Give name as WordCount.
Then click next
2.then go to libraries there click on add external jar files then
File system->usr->lib->hadoop
Copy all the jar files

Then click on ok.


3. Again click on add external jar files, goto
File system->usr->lib->hadoop->client
Select all the files and then click ok.

BIG DATA COMPUTING LABORATORY 3


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

4. click on finish
5.Now click on wordCount->src create a new class WordCount.java
From apache documentation of hadoop goto MapReduce tutorials copy the java
code
Java code:
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

BIG DATA COMPUTING LABORATORY 4


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
BIG DATA COMPUTING LABORATORY 5
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
6.save the java code and run it.
7.Right click on WordCount folder->export->select jar file->click on next

8.Name the jar file and select the required location the jar file to be exported -
>ok->finish

BIG DATA COMPUTING LABORATORY 6


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

9.Go to the location where we exported the jar file and verify it.

10.Now open terminal and do the following operations

Check for wordcount.jar file

BIG DATA COMPUTING LABORATORY 7


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

Check the number /out/part-r-00000

The final output will be

BIG DATA COMPUTING LABORATORY 8


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

EXPERIMENT NO: 3 DATE:

3. Data Processing Tool – Hive (NOSQL query based language) Hive command
line tool allows you to submit jobs via bash scripts.
Identifying properties of a data set:
We have a table 'user data' that contains the following fields:
data_date: string
user_id: string
properties: string
The properties field is formatted as a series of attribute=value pairs.
Ex: Age=21; state=CA; gender=M;
Ans)
I. Create the table in HIVE using hive nosql based query.

hive> CREATE TABLE IF NOT EXISTS user_data1 ( data_date string,


user_id string, properties Map<string,string> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY '#'
> MAP KEYS TERMINATED BY '@'
> STORED AS TEXTFILE;

OK
Time taken: 0.091 seconds

II. Fill the table with sample data by using some sample data bases.

Create a text file with dummy data and load it using the following
command.
Dummy Data:

22-04-2023,1,Age@21#State@CA#Gender@M

BIG DATA COMPUTING LABORATORY 9


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

23-04-2023,2,Age@21#State@NY#Gender@F
24-04-2023,3,Age@22#State@OH#Gender@F
25-04-2023,4,Age@22#State@OH#Gender@M
26-04-2023,5,Age@23#State@TX#Gender@F
27-04-2023,6,Age@23#State@TN#Gender@M

hive> LOAD DATA INPATH 'Directory/user_database.txt' overwrite into


table user_data1;

Loading data to table default.user_data1


chgrp: changing ownership of
'hdfs://quickstart.cloudera:8020/user/hive/warehouse/user_data1/user_datab
ase.txt': User does not belong to supergroup
Table default.user_data1 stats: [numFiles=1, numRows=0, totalSize=234,
rawDataSize=0]
OK
Time taken: 0.699 seconds

hive> select * from user_data1;


OK
22-04-2023 1
{"Age":"21","State":"CA","Gender":"M","":null}
23-04-2023 2
{"Age":"21","State":"NY","Gender":"F","":null}
24-04-2023 3
{"Age":"22","State":"OH","Gender":"F","":null}
25-04-2023 4
{"Age":"22","State":"OH","Gender":"M","":null}
26-04-2023 5
{"Age":"23","State":"TX","Gender":"F","":null}
27-04-2023 6
{"Age":"23","State":"TN","Gender":"M","":null}
Time taken: 0.394 seconds, Fetched: 6 row(s)

BIG DATA COMPUTING LABORATORY 10


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

III. Write a program that produces a list of properties with minimum


value(min_value), largest value(max_value) and number of unique values.
Before you start, execute the prepare step to load the data into HDFS.
IV.
Minimum values:
hive> select t.my_key, min(t.my_key_value) as min_value
> from (
> select explode(properties) as (my_key,my_key_value)
> from user_data1
>)t
> group by t.my_key;
...
...
OK
Age 21
Gender F
State CA
Time taken: 26.099 seconds, Fetched: 3 row(s)

Maximum values:
hive> select t.my_key, max(t.my_key_value) as max_value
> from (
> select explode(properties) as (my_key,my_key_value)
> from user_data1
>)t
> group by t.my_key;
...
...

OK
Age 23
Gender M
State TX
Time taken: 22.254 seconds, Fetched: 3 row(s)
BIG DATA COMPUTING LABORATORY 11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

Unique Count:

hive> select t.my_key, count(distinct t.my_key_value) as unique_count


> from (
> select explode(properties) as (my_key,my_key_value)
> from user_data1
>)t
> group by t.my_key;
...
...
OK
Age 3
Gender 2
State 5
Time taken: 19.956 seconds, Fetched: 3 row(s)

V. Generate a count per state.

hive> select my_value, count(*) as state_count


> from (
> select explode(properties) as (my_state, my_value)
> from user_data1
>)t
> where t.my_state = 'State'
> group by t.my_value;
...
...
OK
CA 1
NY 1
OH 2
TN 1
TX 1
Time taken: 21.772 seconds, Fetched: 5 row(s)
BIG DATA COMPUTING LABORATORY 12
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

EXPERIMENT NO: 4 DATE:

4. Data Processing Tool – Pig (Latin based scripting lang)


Pig command line tool like the Hive allows you to submit jobs via bash scripts.
A) Simple Logs
We have a set of log files and need to create a job that runs every hour
and perform some calculations.
The log files are delimited by a 'tab' character and have the following
fields:
a) site
b) hour_of_day
c) page_views
d) data_date

The log files are located on the prepare folder. Load them in HDFS at
data/ pig/ simple_logs folder and use them as the input.
Important: In order to load tab delimited files use pigStorage ('\u0001').

Lab Instructions:
Create a program to:
i. Calculate the total views per hour per day.
ii. Calculate the total views per day.
iii. Calculate the total counts of each hour across all days.
iv. We can write word count script by passing text file as input

Ans) Create a text file with dummy data.


Dummy data:
"site1",1,10,"2023-04-27"
"site2",1,5,"2023-04-27"
"site1",2,15,"2023-04-27"
"site2",2,20,"2023-04-27"
"site1",1,5,"2023-04-28"
"site2",1,10,"2023-04-28"
"site1",2,25,"2023-04-28"

BIG DATA COMPUTING LABORATORY 13


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

"site2",2,30,"2023-04-28"

i, ii & iii)
Create a pig script file with following commands and execute it.
-- Load the log data from HDFS
logs = LOAD '/home/cloudera/sample_logs.txt' USING PigStorage(',') AS
(site:chararray, hour:int, views:int, date:chararray);

-- Calculate total views per hour per day


hourly_views = GROUP logs BY (date, hour);
hourly_views_sum = FOREACH hourly_views GENERATE group.date AS
date, group.hour AS hour, SUM(logs.views) AS views;

-- Calculate total views per day


daily_views = GROUP logs BY date;
daily_views_sum = FOREACH daily_views GENERATE group AS date,
SUM(logs.views) AS views;

-- Calculate the total counts of each hour across all days


hour_counts = GROUP logs BY hour;
hour_counts_sum = FOREACH hour_counts GENERATE group AS hour,
SUM(logs.views) AS views;
-- Store the results in HDFS
STORE hourly_views_sum INTO '/home/cloudera/hourly_views' USING
PigStorage(',');
BIG DATA COMPUTING LABORATORY 14
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

STORE daily_views_sum INTO '/home/cloudera/daily_views' USING


PigStorage(',');
STORE hour_counts_sum INTO '/home/cloudera/hour_counts' USING
PigStorage(',');

O/P:- It is stored at the given path, i.e., /home/cloudera. The output is produced
as a folder containing a text file. The text file has the expected answer.

1) "2023-04-27",1,15
"2023-04-27",2,35
"2023-04-28",1,15

"2023-04-28",2,55

2)1,30
2,90

3)"2023-04-27",50
"2023-04-28",70

4)Create a pig script file with following commands and execute it.
Give some text file’s path as input instead of /home/cloudera/myfile.txt.
data = LOAD '/home/cloudera/myfile.txt' AS (line:chararray);
words = FOREACH data GENERATE FLATTEN(TOKENIZE(line)) AS
word;
word_counts = GROUP words BY word;
word_count_totals = FOREACH word_counts GENERATE group AS word,

BIG DATA COMPUTING LABORATORY 15


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

COUNT(words) AS count;
STORE word_count_totals INTO '/home/cloudera';

O/P:- It is stored at the given path, i.e., /home/cloudera. The output is


produced as a folder containing a text file. The text file has the expected
answer.
Hello 1
World! 1

BIG DATA COMPUTING LABORATORY 16


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

EXPERIMENT NO: 5 DATE:

5. SQOOP
I. Create table in HIVE using hive query language.

[cloudera@quickstart ~]$ hive

hive> show databases;

O/P:-

OK

aks

aksra

ara

default

hive> use default;

O/P:-

OK

Time taken: 0.026 seconds

hive> show tables;

O/P:-
BIG DATA COMPUTING LABORATORY 17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

OK

my_table

user_data

user_data1

hive> CREATE TABLE my_table (

> id INT,

> name STRING,

> age INT

>)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY ','

> STORED AS TEXTFILE;

O/P:-

OK

BIG DATA COMPUTING LABORATORY 18


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

hive> INSERT INTO my_table VALUES

> (1, 'Alice', 25),

> (2, 'Bob', 30),

> (3, 'Charlie', 35),

> (4, 'Dave', 40),

> (5, 'Eve', 45),

> (6, 'Frank', 50),

> (7, 'Grace', 55),

> (8, 'Henry', 60),

> (9, 'Isabel', 65),

> (10, 'John', 70);

O/P:-

Loading data to table default.my_table

Table default.my_table stats: [numFiles=1, numRows=10, totalSize=108,

rawDataSize=98]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.08 sec HDFS Read: 4115 HDFS

Write: 180 SUCCESS

BIG DATA COMPUTING LABORATORY 19


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

Total MapReduce CPU Time Spent: 1 seconds 80 msec

OK

II. Import the sql table data into hive using sqoop too.

To activate MySQL : $ mysql -u root -pcloudera

MySQL Database name : retail_db

MySQL Table name : categories

$ sqoop import \

--connect jdbc:mysql://localhost/retail_db \

--username root \

--password cloudera \

--table categories \

--hive-import \

--hive-table demo3

BIG DATA COMPUTING LABORATORY 20


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

III. Export hive table data into local machine and into SQL.

1)Create a hive table with data (categories)

a) Use DESCRIBE FORMATTED TABLE_NAME; to check the location

and field delimiters. ex: '/0001',',' etc.

2)Create a MySQL table (cats) with same schema as the hive table

3)Run the following command in cloudera and check in MySQL

$ sqoop export --connect jdbc:mysql://localhost/mysql -m 1 --table cats –

export-dir /user/hive/warehouse/categories --input-fields-terminated-by '\0001' --

username root --password cloudera;

BIG DATA COMPUTING LABORATORY 21


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

EXPERIMENT NO: 6 DATE:

Exploring ARIMA model

Time series: when all other factors are constant prediction of future values.

• Different Methods of doing Time series Analysis and Forecasting

o ARIMA model.

o Seasonally ARIMA. Most used.

o Holt Winter Exponential Smoothing. Easiest and effective model. Link

o Advanced Models

1. Important Concepts and terminology in Time series Analysis.

o Stationarity. To know everything follow the link.

▪ A stationary time series is one whose properties (ie mean, variance,

autocorrelation) does not depend on the time.

o Autoregression AR.

o Moving Average MA

o Integration & Difference

o ACF and PACF Plots

o Time series components:

▪ Trend: long term smooth movement, upward or downward

BIG DATA COMPUTING LABORATORY 22


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

▪ Seasonal: periodic fluctuation, less than 1 year, most commonly found in

industry.

▪ Cyclical: periodic fluctuation, more than 1 year.

▪ irregularity: random movement.

ARIMA is the Most common model used for time series forecasting. It has 3

components.

1. Autoregression AR.

2. Moving Average MA

3. Integrated

1. Autoregression AR.

o Future values of Y is dependent of previous lagged values of Y.

o regression of yt on yt-1, yt-2 .

o P = ORDER OF AR; current value of y is dependent on how many previous

lagged

value of current Y. if p=2 that means yt is dependent on yt-1 and yt-2.

o P from PACF o Interpretation of PACF:

2. Moving Average MA.

o Future values of Y is dependent of previous lagged values of white noise ie the


BIG DATA COMPUTING LABORATORY 23
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

irregular component. white noise is just the error. error is the difference between

the actual value and predicted value. so we take into consideration the error also

to

predict the future value.

o autocorrelation between the errors.

o Trend, s, c components of TS is captured in AR whereas the irregular comp is

captured in MA.

o q is order of MA.

o ACF gives q.

3. Integrated

o Integrated means no of times we difference the data then we have to integrated

it

back to get the original series back.

o We difference to remove trend and seasonality to it stationary series as only

after

making a series stationary we can implement AR and MA.

BIG DATA COMPUTING LABORATORY 24


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 25


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 26


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 27


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 28


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

Differencing
In [22]: df['Sales First Difference'] = df['Sales'] - df['Sales'].shift(1)
In [212]: df['Sales'].shift(1)

Out[212]: Month
1964-01-01 NaN
1964-02-01 2815.0
1964-03-01 2672.0
1964-04-01 2755.0
1964-05-01 2721.0
1964-06-01 2946.0
1964-07-01 3036.0
1964-08-01 2282.0
1964-09-01 2212.0
1964-10-01 2922.0
1964-11-01 4301.0
1964-12-01 5764.0
1965-01-01 7312.0
1965-02-01 2541.0
1965-03-01 2475.0
1965-04-01 3031.0
1965-05-01 3266.0
1965-06-01 3776.0
1965-07-01 3230.0
1965-08-01 3028.0
1965-09-01 1759.0
1965-10-01 3595.0
1965-11-01 4474.0
1965-12-01 6838.0
1966-01-01 8357.0
1966-02-01 3113.0
1966-03-01 3006.0
1966-04-01 4047.0
1966-05-01 3523.0
1966-06-01 3937.0
...
1970-04-01 3370.0
1970-05-01 3740.0
1970-06-01 2927.0
1970-07-01 3986.0
1970-08-01 4217.0
1970-09-01 1738.0
1970-10-01 5221.0
1970-11-01 6424.0
1970-12-01 9842.0

BIG DATA COMPUTING LABORATORY 29


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

khjb debe

BIG DATA COMPUTING LABORATORY 30


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 31


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

▪ For an AR model, the theoretical PACF “shuts off” past the order of the model. The
phrase “shuts off” means that in theory the partial autocorrelations are equal to 0
beyond that point. Put another way, the number of non-zero partial autocorrelations
gives the order of the AR model. By the “order of the model” we mean the most
extreme lag of x that is used as a predictor.

• Identification of an MA model is often best done with the ACF rather than the PACF.
▪ For an MA model, the theoretical PACF does not shut off, but instead tapers toward 0
in some manner. A clearer pattern for an MA model is in the ACF. The ACF will have
non-zero autocorrelations only at lags involved in the model.

p,d,q p AR model lags d differencing q MA lags

BIG DATA COMPUTING LABORATORY 32


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 33


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 34


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 35


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(CSE-DS, CYS, AI&DS)

BIG DATA COMPUTING LABORATORY 36

You might also like