Spark batch job template with UnitTest.
This projects serves to provide a template for Spark batch applications where you have a main class which executes a business logic (Moving average in this case) and a test class which extends QueryTest with SharedSQLContext. We use checkAnswer method to check the output with the desired output.
Sample task info:
- StockPrices:(src/test/resources/input/stock_prices) Columns: stockId,timeStamp,stockPrice
- StockInfo: (src/test/resources/input/stock_info) Columns: stockId,stockName,stockCategory
Task 1: Compute the moving average for every stockID with a predefined moving window size.
Window Size : 3
Columns: stockId,timeStamp,stockPrice,moving_average
Ordered by: stockId,timeStamp
Task 2: Join with Projection - Join moving average calculated in Task1 with the stockInfo.
Window Size : 3
Output Data
Columns: stockId,timeStamp,stockPrice,moving_average,stockName,stockCategory
Task 3: Get the projected moving average data for a given stockId.
stockId :103
Columns: stockId,timeStamp,stockPrice,moving_average,stockName,stockCategory
Output Data
Columns: stockId,timeStamp,stockPrice,moving_average,stockName,stockCategory
Extensions:
Task 4:
Handle Nulls
File : src/test/resources/input/stock_prices_dirty/0.csv
contains null values in stockPrice column.
Filter all the non null rows
Columns: stockId,timeStamp,stockPrice
Other Extensions
Task 1 : Count of each files Load all the above files as spark dataframes and print the count of each of files.
Task 2 : Handle Nulls
File : test/resources/input/stock_prices_dirty/0.csv
contains null values in amount column. Replace those null values with mean of column in spark dataframe.
Task 3 : Add a Sequential Row ID
For a joined dataframe, add an id column which contains sequence numbers from 1 to number of rows
Task 4 : Sales with 5% Discount
Add a column discount_amount to joined dataframe, which holds 5% discounted amount.
WHAT IS MOVING AVERAGE
Moving Average is a lagging indicator, which means that it do not predict new trends but confirm trends once they have been established. A moving average is commonly used with time series data to smooth out short-term fluctuations and highlight longer-term trends or cycles. To compute the moving average you would need to input a set of values and a window size, and compute the moving average of the input. If the input values are x1, x2, x3, ..., xn and the window value is k the moving average is computed using the following formula:
yk = [x(k-k+1) + x(k-k+2) + .... + xk]/k
(All the windows with less than k elements gave output as 0)