100% found this document useful (1 vote)
86 views20 pages

2.2.piroject Macroeconomrtic

This document summarizes a student's M.Sc. econometrics part 2 project on predicting Thailand's wholesale price index from 1960-1990. The student analyzes the time series data, finds it is non-stationary, takes the first difference to make it stationary, then fits an ARIMA model. Specifically, the student 1) checks for stationarity using ADF test, 2) differences the data once to achieve stationarity, 3) identifies the ARIMA(p,d,q) model parameters are (0,1,0) based on ACF and PACF plots.

Uploaded by

Rame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
86 views20 pages

2.2.piroject Macroeconomrtic

This document summarizes a student's M.Sc. econometrics part 2 project on predicting Thailand's wholesale price index from 1960-1990. The student analyzes the time series data, finds it is non-stationary, takes the first difference to make it stationary, then fits an ARIMA model. Specifically, the student 1) checks for stationarity using ADF test, 2) differences the data once to achieve stationarity, 3) identifies the ARIMA(p,d,q) model parameters are (0,1,0) based on ACF and PACF plots.

Uploaded by

Rame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

POSTGRADUATE PROGRAM DIRECTORATE

College of Computing and Informatics


Department Of Statistics
Program: M.SC. In Econometrics
Part 2 projects of Macro-Econometrics
Instructor: Dr. Salie Ayalew, PhD in statistics (specialized
econometrics)

Prepared By: Kedir Mohammed; ID: 0632/11

Dec, 2020

Haramaya University, Ethiopia


Part 2 projects

1.1. Problem Description


 Check if it is possible to fit a stationary model and get appropriate number of
parameter
 Fit appropriate model, which may be useful for forecasting
 The problem is to predict the number of quarterly Thailand wholesale price
index. We will use the Thailand wholesale price index dataset for this project.
This dataset describes the total number of Thailand wholesale price index over
time. There are 125 quarterly observations from (1960q-1990q). Below is a
sample of the first few rows of the dataset.
> head(dataa)
t Twpi
1 1960q1 30.7
2 1960q2 30.8
3 1960q3 30.7
4 1960q4 30.7
5 1961q1 30.8
6 1961q2 30.5

1.2. Data Preparation and Analysis


This data is obtained from world banks, international monetary fund, international
financial statistics and data files and it is about quarterly reported Thailand
wholesale price index from quarterly data (1960q-1990q)
> head(dataa)
t Twpi
1 1960q1 30.7
2 1960q2 30.8
3 1960q3 30.7
4 1960q4 30.7
5 1961q1 30.8
6 1961q2 30.5

Let’s begin the data analysis by looking into the summary statistics; we will get a
quick idea of the data distribution.

> summary(dataa)
t Twpi
1960q1 : 1 Min. : 30.50
1960q2 : 1 1st Qu.: 32.58
1960q3 : 1 Median : 56.60
1960q4 : 1 Mean : 62.77
1961q1 : 1 3rd Qu.: 96.88
1961q2 : 1 Max. :116.20
(Other):118
We can see the number of observations matches our expectations; the mean is
about 62.77 which we can consider our level in this series. Other statistics like
standard deviation and percentiles suggest a large spread of the data
Time Series Plot of Twpi
120

100

80
Twpi

60

40

20
1 12 24 36 48 60 72 84 96 108 120
Index

Here, the line plot suggests that there is an increasing trend of Thailand wholesale
price index over time. This insight gives us a hint that data may not be stationary
and we can explore differencing with one level to make it stationary before
modeling.

1.3. Set up an Evaluation Framework


 Before proceeding to model building we must develop an evaluation
framework to assess the data and evaluate different models. The first step is
defining a validation dataset. This is historical data, so we cannot collect the
updated data from the future to validate this model. Therefore, we will use the
last 125 quarter the series as the validation dataset. We will split this time
series into two subsets training and validation, throughout this project we will
use this training dataset named ‘dataset’ to build and test different models. The
selected models will be validated through the ‘validation’ dataset. We can see
the training set has 125 observations and the validation set has 12 observations.
The baseline prediction for time series forecasting is also known as the naive
forecast. We will use the walk-forward validation which is also considered as a
k-fold cross-validation technique of the time series world.

1.4. Stationary Check: Augmented Dickey-Fuller test


 The ADF test is comparable with the simple DF test, but it is augmented by
adding lagged values of the first difference of the dependent variable as
additional repressors’ which are required to account for possible occurrence of
autocorrelation. Consider the AR (p) model:

 The test statistics is where the OLS estimate of is


has to be compared against the 95% critical value of
the appropriate DF distribution, which depends on the inclusion of the linear
trend and the lag structure. Then we use the t-statistic on the coefficient to test
whether we need to difference the data to make it stationary or we need to put a
time trend in the regression model to correct for the variables deterministic
trend. The null hypothesis for the test is given as there exists a
unit root problem.
 We will confirm our hypothesis using this Augmented Dickey-Fuller test. This
is a statistical test; it uses an autoregressive model and optimizes an information
criterion across multiple different lag values.
 The null hypothesis of the test is that the time series is not stationary.
 As we have a strong intuition that time series is not stationary, so let’s create a
new series with differenced values and check this transformed series for
stationary.
Log transformed data
 As we observe the great difference between the minimum and the maximum we
try to plot the log-transformed data over time to increase the possibility of mean
stationary. From the plot, we can observe that the result from log-transformed
data is a lot better than that from untransformed data, notice than the range of
data values is much smaller. Although the log-transformed plot looks denser and
better-shaped than before, it is still not enough for us to ensure the mean
stationary.
Time Series Plot of ln_Twpi

4.75

4.50

4.25
ln_Twpi

4.00

3.75

3.50

1 12 24 36 48 60 72 84 96 108 120
Index


ADF Statistic= - 0.0397
P-values=0.9596
Critical values:
1%: -3.4851
5%: -2.8854
10%: -2.5795

 The results show that the test statistic value - 0.0397 is smaller than the critical
value at 1% of -3.4851. This suggests that we can reject the null hypothesis
with a significance level of less than 1%.
 The results show that the test statistic value - 0.0397 is smaller than the critical
value at 5% of -2.8854. This suggests that we can reject the null hypothesis
with a significance level of less than 5%.
 The results show that the test statistic value - 0.0397 is smaller than the critical
value at 10% of -2.5795. This suggests that we can reject the null hypothesis
with a significance level of less than 10%.

 All are rejecting the null hypothesis means that the time series is
stationary.

 It’s ideal to use a differenced dataset as the input for our ARIMA model. As we
know this dataset is stationary, therefore parameter‘d’ can see set to 1.
 These parameters are also known as p and q respectively. We can identify these
parameters using Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF).

1.6 ARIMA Models


 ARIMA stands for Autoregressive Integrated Moving Average.
 This model is the combination of auto regression, a moving average model
and differencing. In this context, integration is the opposite of differencing.
 Differencing is useful to remove the trend in a time series and make it
stationary.
 It simply involves subtracting a point a t-1 from time t. Realize that you will,
therefore, lose the first data point in a time series if you apply differencing
once.
 Mathematically, the ARIMA (p,d,q) now requires three parameters:

 p: the order of the autoregressive process


 d: the degree of differencing (number of times it was differenced)
 q: the order of the moving average process

And the equations is expressed as:

just like with ARMA models, the ACF and PACF cannot be used to identify
reliable values for p and q.

 However, in the presence of an ARIMA(p,d,0) process:

 the ACF is exponentially decaying or sinusoidal


 the PACF has a significant spike at lag p but none after

 Similarly, in the presence of an ARIMA(0,d,q) process:

 the PACF is exponentially decaying or sinusoidal


 the ACF has a significant spike at lag q but none after

 Let’s walk through an example of modelling with ARIMA to get some hands-
on experience and better understand some modelling concepts.
i) Stationary Checking and Differencing
 To test our data is weather stationary or not by plotting the first then
the graph. the below plotting is indicated that there is not stationary

Time Series Plot of ln_Twpi

4.75

4.50

4.25
ln_Twpi

4.00

3.75

3.50

1 12 24 36 48 60 72 84 96 108 120
Index

Autocorrelation Function for ln_Twpi
(with 5% significance limits for the autocorrelations)
1.0

0.8

0.6

0.4
Autocorrelation

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Lag

Partial Autocorrelation Function for ln_Twpi


(with 5% significance limits for the partial autocorrelations)

1.0

0.8

0.6
Partial Autocorrelation

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Lag
 The series is clearly no stationary if this is the pattern of the ACF
and PACF
 The PACF after the lags 1 are not stationary significances

Data Transformation
We would like to detrend our data.

Differenced Log Transformation

We can see there is still an obvious trend after log transformation. Cubic trend is
also fit for our data shown in below plot of dfwpi difference log transformation
change much for the distribution of our data.

Time Series Plot of difTwpi


0.07

0.06

0.05

0.04
difTwpi

0.03

0.02

0.01

0.00

-0.01

-0.02

1 12 24 36 48 60 72 84 96 108 120
Index

ii) Model Identification


To appropriated model be identified
This is to identified the appropriate lag for AR and MA process using
 corrologram ACF and PACFs, ACF AND PACF is using for identified the
order of the model
Differenced Log Transformation
 We can see there is still an obvious trend after log transformation.
Cubic trend is also fit for our data shown in below plot of dfwpi
difference log transformation change much for the distribution of our
data.
Autocorrelation Function for difTwpi
(with 5% significance limits for the autocorrelations)

1.0

0.8

0.6

0.4
Autocorrelation

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Lag

Partial Autocorrelation Function for difTwpi


(with 5% significance limits for the partial autocorrelations)

1.0

0.8

0.6
Partial Autocorrelation

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Lag
 From ACF plot below, autocorrelation exceeds the dashed lines at 1 up to 6 a
few lags. So model may be a good fit for our data.
 The ACF for, lags,1,2,3,4,5 and 6 are statistical siginificantis
 Also for PACF, the lags, 1,2 ,4 and 25 lags are statistical significances
 To find and identified the order of ARIMA model is to consider the ACF
,PACF and the associated the correlogram
 The tentative model are ;- ARIMA(1,1,1), ARIMA(2,1,1), ARIMA(1,1,2),
ARIMA(3,1,3), ARIMA(3,1,1), ARIMA(2,1,2), ARIMA(3,1,2)
 ACF shows a significant lag of 6 lags, which means an ideal value for p is 6.
PACF shows a significant lag of 4 lags, which means an ideal value for q is 4.
 Now, we have all the required parameters for the ARIMA model.

1. Auto regression parameter (p): 6


2. Integrated (d): 1 we could have used 1, had we considered original
observations as the input. We have seen our series transformed into
stationary after one level of differencing
3. Moving Average parameter (q): 4

ii) Parameter Estimation


 The model is ;- ARIMA(1,1,1), ARIMA(2,1,1), ARIMA(1,1,2),
ARIMA(4,1,2), ARIMA(3,1,1), ARIMA(2,1,2), ARIMA(3,1,2)

The best appropriate model should be selected from the above model based on;
the Most significant coefficients, lowest volatility etc is ARIMA(3,1,1)
model

ARIMA(3,1,1) model
 ARIMA Model: DIFF_ln_Twi

 Estimates at each iteration

 Iteration SSE Parameters


 0 0.0199520 0.100 0.100 0.100 0.100
 1 0.0174550 0.113 0.004 0.004 0.250
 2 0.0168836 -0.037 -0.033 -0.020 0.144
 3 0.0166461 -0.187 -0.068 -0.037 0.011
 4 0.0163055 -0.337 -0.112 -0.061 -0.113
 5 0.0158470 -0.487 -0.170 -0.096 -0.227
 6 0.0153017 -0.637 -0.247 -0.145 -0.327
 7 0.0147564 -0.787 -0.347 -0.210 -0.415
 8 0.0143697 -0.937 -0.474 -0.292 -0.490
 9 0.0142966 -1.021 -0.560 -0.345 -0.531
 10 0.0142960 -1.032 -0.569 -0.349 -0.542
 11 0.0142959 -1.036 -0.571 -0.349 -0.546
 12 0.0142959 -1.038 -0.571 -0.349 -0.548
 13 0.0142959 -1.038 -0.572 -0.349 -0.548
 14 0.0142959 -1.039 -0.572 -0.349 -0.549

 Relative change in each estimate less than 0.0010


 Final Estimates of Parameters

 Type Coef SE Coef T P
 AR 1 -1.0387 0.2263 -4.59 0.000
 AR 2 -0.5716 0.1485 -3.85 0.000
 AR 3 -0.3491 0.0887 -3.94 0.000
 MA 1 -0.5485 0.2342 -2.34 0.021


 Differencing: 1 regular difference
 Number of observations: Original series 123, after differencing 122
 Residuals: SS = 0.0142958 (backforecasts excluded)
 MS = 0.0001212 DF = 118


 Modified Box-Pierce (Ljung-Box) Chi-Square statistic

 Lag 12 24 36 48
 Chi-Square 7.7 12.2 25.9 37.3
 DF 8 20 32 44
 P-Value 0.458 0.909 0.768 0.753

 The Most Significant Coefficients, Lowest Volatility, Is ARIMA(3,1,1)


iv). Diagnostic Checking
 Most appropriate model is choice ,Check the residuals correlogrm to see if there
I any information yet to be captured in this model ,A flat correlogram of the
residuals is most ideal ,Check the residuals correlogram again
ACF of Residuals for DIFF_ln_Twi
(with 5% significance limits for the autocorrelations)

1.0

0.8

0.6

0.4
Autocorrelation

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Lag

PACF of Residuals for DIFF_ln_Twi
(with 5% significance limits for the partial autocorrelations)

1.0

0.8

0.6
Partial Autocorrelation

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Lag

 But,avoid “over-fitting”an ARIMA model to a data series


 The forecast is based on the final ARIMA model
Versus Order
(response is dfwpi)

0.04

0.03

0.02

0.01
Residual

0.00

-0.01

-0.02

-0.03

-0.04

-0.05
1 10 20 30 40 50 60 70 80 90 100 110 120
Observation Order

v) Forecasting
ARIMA Model: DIFF_ln_Twi

Estimates at each iteration

Iteration SSE Parameters


0 0.0199520 0.100 0.100 0.100 0.100
1 0.0174550 0.113 0.004 0.004 0.250
2 0.0168836 -0.037 -0.033 -0.020 0.144
3 0.0166461 -0.187 -0.068 -0.037 0.011
4 0.0163055 -0.337 -0.112 -0.061 -0.113
5 0.0158470 -0.487 -0.170 -0.096 -0.227
6 0.0153017 -0.637 -0.247 -0.145 -0.327
7 0.0147564 -0.787 -0.347 -0.210 -0.415
8 0.0143697 -0.937 -0.474 -0.292 -0.490
9 0.0142966 -1.021 -0.560 -0.345 -0.531
10 0.0142960 -1.032 -0.569 -0.349 -0.542
11 0.0142959 -1.036 -0.571 -0.349 -0.546
12 0.0142959 -1.038 -0.571 -0.349 -0.548
13 0.0142959 -1.038 -0.572 -0.349 -0.548
14 0.0142959 -1.039 -0.572 -0.349 -0.549

Relative change in each estimate less than 0.0010

Final Estimates of Parameters

Type Coef SE Coef T P


AR 1 -1.0387 0.2263 -4.59 0.000
AR 2 -0.5716 0.1485 -3.85 0.000
AR 3 -0.3491 0.0887 -3.94 0.000
MA 1 -0.5485 0.2342 -2.34 0.021

Differencing: 1 regular difference


Number of observations: Original series 123, after differencing 122
Residuals: SS = 0.0142958 (backforecasts excluded)
MS = 0.0001212 DF = 118

Modified Box-Pierce (Ljung-Box) Chi-Square statistic

Lag 12 24 36 48
Chi-Square 7.7 12.2 25.9 37.3
DF 8 20 32 44
P-Value 0.458 0.909 0.768 0.753

Forecasts from period 124

95% Limits
Period Forecast Lower Upper Actual
125 0.0212354 -0.0003424 0.0428132
126 0.0164002 -0.0078205 0.0406210
127 0.0221369 -0.0039365 0.0482103
128 0.0218962 -0.0058769 0.0496692
129 0.0205551 -0.0106009 0.0517111
130 0.0200829 -0.0126114 0.0527773
131 0.0214239 -0.0133042 0.0561520
132 0.0207691 -0.0155575 0.0570957
133 0.0208476 -0.0174702 0.0591653
134 0.0206722 -0.0190771 0.0604216
135 0.0210381 -0.0204100 0.0624862
136 0.0207309 -0.0221307 0.0635926

1.5. Residual Analysis


 A final check is to analyze residual errors of the model. Ideally, the distribution
of the residuals should follow a Gaussian distribution with a zero mean. We can
calculate residuals by subtracting predicted values from actual as below.
And then simply use describes function to get summary statistics.

Descriptive Statistics: RESI9

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum


RESI9 123 1 0.00000 0.00128 0.01423 -0.03172 -0.00892 -0.00277 0.00463 0.05894

 We can see, there is a very small bias in the model. Ideally, the mean should
have been zero. We will use this mean value (0.00000) to correct the bias in our
prediction by adding this value to each forecast.
 Residuals from ARIMA (3,1,1) First, we plot the residual to check whether
they fit a mean stationary model with mean 0. This plot probably shows a sign
of mean stationary with mean around 0 The ARIMA does not provide p-values
and so you can calculate t-statistics. If non-constant variance is concern,
look at a plot of residuals versus fits and/or a time series plot of the
residuals
Residual Plots for DIFF_ln_Twi
Normal Probability Plot Versus Fits
99.9 0.04
99
0.02
90

Residual
Percent

0.00
50
-0.02
10
-0.04
1
0.1
-0.04 -0.02 0.00 0.02 0.04 0.00 0.02 0.04 0.06
Residual Fitted Value

Histogram Versus Order


0.04
30
0.02
Frequency

Residual

20 0.00

-0.02
10
-0.04

0
-0.045 -0.030 -0.015 0.000 0.015 0.030 1 10 20 30 40 50 60 70 80 90 100 110 120

Residual Observation Order


PACF of Residuals for RESI9


(with 5% significance limits for the partial autocorrelations)

1.0

0.8

0.6
Partial Autocorrelation

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Lag
Look at the ACF of residuals. For a good model, all autocorrelations for residual
series should be non-significant
 Look at box-pierce(ljung) test for possible residual autocorrelation at
various lags
Modified Box-Pierce (Ljung-Box) Chi-Square statistic

Lag 12 24 36 48
Chi-Square 7.7 12.2 25.9 37.3
DF 8 20 32 44
P-Value 0.458 0.909 0.768 0.753

Look at the significances of the coefficients. Provides p-values and so you may
simply compare the p-values to standard 0.05 cut off.
Final Estimates of Parameters

Type Coef SE Coef T P


AR 1 -1.0387 0.2263 -4.59 0.000
AR 2 -0.5716 0.1485 -3.85 0.000
AR 3 -0.3491 0.0887 -3.94 0.000

MA 1 -0.5485 0.2342 -2.34 0.021

1.6. Bias Model s corrected


 As a last improvement to the model, we will produce a biased adjusted forecast.

 Descriptive Statistics: RESI9

 Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
 RESI9 123 1 0.00000 0.00128 0.01423 -0.03172 -0.00892 -0.00277 0.00463 0.05894

Bias corrected output

 We can see errors have slightly reduced and mean has also shifted towards zero.
The graphs also suggest a Gaussian distribution
Histogram
(response is DIFF_ln_Twi)
35

30

25
Frequency

20

15

10

0
-0.045 -0.030 -0.015 0.000 0.015 0.030
Residual

Normal Probability Plot


(response is DIFF_ln_Twi)
99.9

99

95
90
80
70
P erc en t

60
50
40
30
20
10
5

0.1
-0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04
Residual

 In our example we had a very small bias, so this bias correction may not have
proved to be a significant improvement, but in real-life scenarios, this is an
important technique to be explored at the end in case any bias exists.
 So, our model has passed all the criteria. We can save this model for later use.
1.7. Model Validation
 To generally evaluation the prediction ability of the ARIMA (3, 1, 1)
model, we compare it with the original log-transformed data in the
same plot. For model diagnostics, we need to check the residual of the
model first. The ideal plot is similar as the plot generated by Gaussian
white noise, which means that the residuals are independent and
identically distributed from normal distributed. Moreover the residual
should be fitted in a mean stationary model with mean 0
 We can observe the actual and forecasted values for the validation dataset.
These values are also plotted on a line plot which shows a promising result of
our model.
rend Analysis Plot for DIFF_ln_Twi

Trend Analysis for DIFF_ln_Twi

* NOTE * Zero values of Yt exist; MAPE calculated only for non-zero Yt.

Data DIFF_ln_Twi
Length 124
NMissing 1

Fitted Trend Equation

Yt = 0.00719 + 0.000058×t

Accuracy Measures

MAPE 174.566
MAD 0.010
MSD 0.000

Forecasts

Period Forecast
125 0.0143958
126 0.0144534
127 0.0145111
128 0.0145687
129 0.0146264
130 0.0146840
131 0.0147417
132 0.0147993
133 0.0148570
134 0.0149146
135 0.0149723
136 0.0150299

Trend Analysis Plot for DIFF_ln_Twi

Trend Analysis Plot for DIFF_ln_Twi


Linear Trend Model
Yt = 0.00719 + 0.000058×t

0.07 Variable
Actual
0.06 Fits
Forecasts
0.05
Accuracy Measures
0.04 MAPE 174.566
DIFF_ln_Twi

MAD 0.010
0.03 MSD 0.000

0.02

0.01

0.00

-0.01

-0.02
1 14 28 42 56 70 84 98 112 126
Index

 The trend plot that shows the original data, the fitted trend line, and forecasts. The
window output also displays the fitted trend equation and three measures of accuracy to
help determine the accuracy of the fitted values: MAPE, MAD, and MDS. the Thailand
wholesale price index show a general upward trend, though with an evident seasonal
component. the trend model appears to fit well to the overall trend, but the seasonal
pattern is not well fit. to better fit these data, you also use decomposition on the stored
residuals and add the trend analysis and decomposition fits and forecasts
 the three measures are not very informative by themselves, but you can use them to
compare the fits obtained by using different methods. for all three measures, smaller
values generally indicate a better fitting model

You might also like