Skip to content
Harish Butani edited this page May 28, 2012 · 3 revisions

Table Functions.

Time Series Analysis

Forecast

Signature:

Forecast(GroovyExpr DriverNames, // as list of Strings
         GroovyExpr BaseDrivers, // as a list of numeric values
         GroovyExpr measureExpr,
         // as a map of  String -> [numeric vals]
         // specifies the named points to forecast at and
         // the driver values at these points
         GroovyExpr driverForecast
        )

Returns the forecasted Msr for each point specified as a Table of Name, Measure.

Feed

Signature:

Feed(String openingColumn,
     String inColumn,
     String outColumn,
     String closingColumn
    )

Feed calculates the closing balance and “feeds” it to the opening balance of the next time period. Assumes input to function is partitioned by something other than Time, and data within each partitioned by a Time period: Qtr, Month etc. Opening data for the 1st period in each partition is provided. For e.g.

Jan Feb Mar April
Opening 5000
In 1000 2000 5000 2000
Out 3000 2000 7000 2000
Closing

Feed calculates the Opening(except the first) & Closing values for each period.

NPath

Signature: NPath(String Pattern, GroovyExpr Symbols, GroovyExpr Results)

Returns rows that meet a specified pattern. Use Symbols to specify a list of expressions to match. Pattern is used to specify a Path. The results list can contain expressions based on the input columns and also the matched Path.

Pattern
pattern for the Path. Path is ‘dot’ separated list of symbols. Each element is treated as a groovy expression. Elements that end in ‘*’ or ‘+’ are interpreted with the usual meaning of zero or more, one or more respectively. For e.g. “LATE.EARLY*.(ONTIME || EARLY)” implies a sequence of flights where the first occurence was LATE, followed by zero or more EARLY flights, followed by a ONTIME or EARLY flight.
Symbols
specify a map from names to expresssions. For e.g. <[LATE: “arrival\_delay > 0”, EARLY: “arrival\_delay < 0” , ONTIME : “arrival\_delay == 0”]> The names for symbols don’t need to be quoted as long as they are valid groovy names. When in doubt add quotes, for eg: <LATE: “arrival\_delay > 0 and arrival\_delay < 60”, “LATE*2”: “arrival\_delay >= 60” >
Results
specified as a list. Each entry can be just a string, or a list of 3 elems: [expr, type, name]. If an element is just a string, it is interpreted as a reference to a column in the input to this function or as a Symbol. When specified as a list the first element is interepreted as a groovy expression; the second is interpreted as a typename, and the third is the expression’s alias. For eg <[“weight”, [“2*weight”, “double”, ‘doubleWeight”]>. The expressions are evaluated in the context where all the input columns are available, plus the attributes “path”, “count”, “first”, and “last” are available. Path is a collection of nodes that represents the matching Path, count, first, last are convenience fns about the Path. Each node in the path exposes all the attributes of the corresponding input row.

Examples

Late Flights

list incidents where a Flight(to NY) has been more than 15 minutes late 5 or more times in a row.

	  from npath(<select origin_city_name, year, month, day_of_month, arr_delay, fl_num
			from flightsdata
			where dest_city_name = 'New York' and dep_time != ''>
	  		partition by fl_num
	  		order by year, month, day_of_month,
	  	'LATE.LATE.LATE.LATE.LATE+',
		<[LATE : "arr_delay \\> 15"]>,
		<["origin_city_name", "fl_num", "year", "month", "day_of_month",
				["(path.sum() { it.arr_delay})/((double)count)", "double", "avgDelay"],
				["count", "int", "numOfDelays"]
		]>)
	  select origin_city_name, fl_num, year, month, day_of_month, avgDelay, numOfDelays
	  into path='/tmp/wout'
	  serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
	  with serdeproperties('field.delim'=',')
	  format 'org.apache.hadoop.mapred.TextOutputFormat'

Hierarchical navigation

It is possible to treat the rows within a parition as a Table representation of a Level Hierarchy and to provide Navigation functions on the hierarchy. See Hierarchy Evaluation for details.

Allocations

Allocate

Signature:

Allocate( // represents the  amount to spread
          GroovyExpr allocateAmt, 
          // specifies the expr to use to evaluate
          // allocations at each point.
          GroovyExpr ratioExpr,
         )

Returns the allocated amount at each point. Assumes input to function is partitioned based on how amount needs to be spread. For eg a Budget over States in a Country based on Sales figures.

Depreciation

Special allocation algorithm, user can specify the ‘Sum Of Years’ or ‘Straight-Line’ method of depreciation. Can be used for Yearly & Quarterly depreiciation.

Signature:

Depreciate( // represents the  amount to depreciate
            GroovyExprbaseAmt, 
            String yearColumn,
            String qtrColumn,
            // 'sum of years' or 'straight line'
            String depreciationMethod 
           )

Returns the depriciation Amount for each year and quarter. Assumes input to function contains rows for each Year and Qtr, data is partitioned by Year, and data is sorted by Year and Qtr.

Misc. Functions

Data densification for Reporting

  • using partition right join
  • to a table function, provide a set of dimension lists
  • output original facts plus any missing facts.
  • for e.g. output sales for all weeks, even there are no sales for some products in certain weeks

Iterative Algorithms

Frequent Itemsets involve counting how often multiple events occur together. for e.g. how often someone has purchased milk & cereal. Input to itemsets is a collection of items, an itemset.(like the products a given customer has purchased). This is also a iterative algorithm where a n-item frequent itemset can be efficientlycomputed from a table of n-1 frquent itemsets. See Market Basket Analysis for a writeup on how we plan to provide this functionality.

Graph Algorithms like Shortest Path, Transitive closure can be exposed as Table Functions in SQL, and are also highly parallelizable. But these functions are iterative in that they run to a Fixed Point. This requires us to add an additional mechanics into the Engine to orchestrate computations to a fixed point. Stay tuned for support for these algorithms.

Clone this wiki locally