Copyright	(c) Nikos Karagiannidis 2018
License	BSD3
Maintainer	[email protected]
Stability	stable
Portability	POSIX
Safe Haskell	None
Language	Haskell2010

Etl.Internal.Core

Contents

Basic Data Types
Execution of an ETL Mapping
Functions for "Building" an ETL Mapping
Various ETL Operations

Description

This is an internal module (i.e., not to be imported directly) that implements the core ETL functionality that is exposed via the Julius EDSL for ETL/ELT found in the Etl.Julius module)

Synopsis

data RColMapping
- = ColMapEmpty
- | RMap1x1 {
  - srcCol :: ColumnName
  - removeSrcCol :: YesNo
  - trgCol :: ColumnName
  - transform1x1 :: RDataType -> RDataType
  - srcRTupleFilter :: RPredicate
  }
- | RMapNx1 {
  - srcColGrp :: [ColumnName]
  - removeSrcCol :: YesNo
  - trgCol :: ColumnName
  - transformNx1 :: [RDataType] -> RDataType
  - srcRTupleFilter :: RPredicate
  }
- | RMap1xN {
  - srcCol :: ColumnName
  - removeSrcCol :: YesNo
  - trgColGrp :: [ColumnName]
  - transform1xN :: RDataType -> [RDataType]
  - srcRTupleFilter :: RPredicate
  }
- | RMapNxM {
  - srcColGrp :: [ColumnName]
  - removeSrcCol :: YesNo
  - trgColGrp :: [ColumnName]
  - transformNxM :: [RDataType] -> [RDataType]
  - srcRTupleFilter :: RPredicate
  }
type ColXForm = [RDataType] -> [RDataType]
createColMapping :: [ColumnName] -> [ColumnName] -> ColXForm -> YesNo -> RPredicate -> RColMapping
data ETLOperation
- = ETLrOp {
  - rop :: ROperation
  }
- | ETLcOp {
  - cmap :: RColMapping
  }
data ETLMapping
- = ETLMapEmpty
- | ETLMapLD {
  - etlOp :: ETLOperation
  - tabL :: ETLMapping
  - tabR :: RTable
  }
- | ETLMapRD {
  - etlOp :: ETLOperation
  - tabLrd :: RTable
  - tabRrd :: ETLMapping
  }
- | ETLMapBal {
  - etlOp :: ETLOperation
  - tabLbal :: ETLMapping
  - tabRbal :: ETLMapping
  }
data YesNo
- = Yes
- | No
runCM :: RColMapping -> RTable -> RTable
etlOpU :: ETLOperation -> RTable -> RTable
etlOpB :: ETLOperation -> RTable -> RTable -> RTable
etl :: ETLMapping -> RTable
etlRes :: ETLMapping -> RTabResult
rtabToETLMapping :: RTable -> ETLMapping
createLeafETLMapLD :: ETLOperation -> RTable -> ETLMapping
createLeafBinETLMapLD :: ETLOperation -> RTable -> RTable -> ETLMapping
connectETLMapLD :: ETLOperation -> RTable -> ETLMapping -> ETLMapping

Basic Data Types

data RColMapping Source #

This is the basic data type to define the column-to-column mapping from a source RTable to a target RTable. Essentially, an RColMapping represents the column-level transformations of an RTuple that will yield a target RTuple.

A mapping is simply a triple of the form ( Source-Column(s), Target-Column(s), Transformation, RTuple-Filter), where we define the source columns over which a transformation (i.e. a function) will be applied in order to yield the target columns. Also, an RPredicate (i.e. a filter) might be applied on the source RTuple. Remember that an RTuple is essentially a mapping between a key (the Column Name) and a value (the RDataType value). So the various RColMapping data constructors below simply describe the possible modifications of an RTuple orginating from its own columns.

So, we can have the following mapping types: a) single-source column to single-target column mapping (1 to 1), the source column will be removed or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) b) multiple-source columns to single-target column mapping (N to 1), The N columns will be merged to the single target column based on the transformation. The N columns will be removed from the RTuple or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) c) single-source column to multiple-target columns mapping (1 to M) the source column will be "expanded" to M target columns based ont he transformation. the source column will be removed or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) d) multiple-source column to multiple target columns mapping (N to M) The N source columns will be mapped to M target columns based on the transformation. The N columns will be removed from the RTuple or not based on the removeSrcCol flag (dublicate column names are not allow in an RTuple)

Some examples of mapping are the following:

     (Start_Date, No, StartDate, t -> True)  --  copy the source value to target and dont remove the source column, so the target RTuple will have both columns Start_Date and StartDate
                                          --  with the exactly the same value)

     ([Amount, Discount], Yes, FinalAmount, ([a, d] -> a * d) ) -- FinalAmount is a derived column based on a function applied to the two source columns. 
                                                                       --  In the final RTuple we remove the two source columns.

An RColMapping can be applied with the runCM (runColMapping) operator

Constructors

ColMapEmpty
RMap1x1	single-source column to single-target column mapping (1 to 1).
Fields srcCol :: ColumnName removeSrcCol :: YesNo trgCol :: ColumnName transform1x1 :: RDataType -> RDataType srcRTupleFilter :: RPredicate
RMapNx1	multiple-source columns to single-target column mapping (N to 1)
Fields srcColGrp :: [ColumnName] removeSrcCol :: YesNo trgCol :: ColumnName transformNx1 :: [RDataType] -> RDataType srcRTupleFilter :: RPredicate
RMap1xN	single-source column to multiple-target columns mapping (1 to N)
Fields srcCol :: ColumnName removeSrcCol :: YesNo trgColGrp :: [ColumnName] transform1xN :: RDataType -> [RDataType] srcRTupleFilter :: RPredicate
RMapNxM	multiple-source column to multiple target columns mapping (N to M)
Fields srcColGrp :: [ColumnName] removeSrcCol :: YesNo trgColGrp :: [ColumnName] transformNxM :: [RDataType] -> [RDataType] srcRTupleFilter :: RPredicate

type ColXForm = [RDataType] -> [RDataType] Source #

A Column Transformation function data type. It is used in order to define an arbitrary column-level transformation (i.e., from a list of N input Column-Values we produce a list of M derived (output) Column-Values). A Column value is represented with the RDataType.

createColMapping Source #

Arguments

:: [ColumnName]	List of source column names
-> [ColumnName]	List of target column names
-> ColXForm	Column Transformation function
-> YesNo	Remove source column option
-> RPredicate	Filtering predicate
-> RColMapping	Output Column Mapping

Constructs an RColMapping. This is the suggested method for creating a column mapping and not by calling the data constructors directly.

data ETLOperation Source #

An ETL operation applied to an RTable can be either an ROperation (a relational agebra operation like join, filter etc.) defined in RTable.Core module, or an RColMapping applied to an RTable

Constructors

ETLrOp
Fields rop :: ROperation
ETLcOp
Fields cmap :: RColMapping

data ETLMapping Source #

ETLmapping : it is the equivalent of a mapping in an ETL tool and consists of a series of ETLOperations that are applied, one-by-one, to some initial input RTable, but if binary ETLOperations are included in the ETLMapping, then there will be more than one input RTables that the ETLOperations of the ETLMapping will be applied to. When we apply (i.e., run) an ETLOperation of the ETLMapping we get a new RTable, which is then inputed to the next ETLOperation, until we finally run all ETLOperations. The purpose of the execution of an ETLMapping is to produce a single new RTable as the result of the execution of all the ETLOperations of the ETLMapping. In terms of database operations an ETLMapping is the equivalent of an CREATE AS SELECT (CTAS) operation in an RDBMS. This means that anything that can be done in the SELECT part (i.e., column projection, row filtering, grouping and join operations, etc.) in order to produce a new table, can be included in an ETLMapping.

An ETLMapping is executed with the etl (runETLmapping) operator

Implementation: An ETLMapping is implemented as a binary tree where the node represents the ETLOperation to be executed and the left branch is another ETLMapping, while the right branch is an RTable (that might be empty in the case of a Unary ETLOperation). Execution proceeds from bottom-left to top-right. This is similar in concept to a left-deep join tree. In a Left-Deep ETLOperation tree the "pipe" of ETLOperations comes from the left branches always. The leaf node is always an ETLMapping with an ETLMapEmpty in the left branch and an RTable in the right branch (the initial RTable inputed to the ETLMapping). In this way, the result of the execution of each ETLOperation (which is an RTable) is passed on to the next ETLOperation. Here is an example:

    A Left-Deep ETLOperation Tree

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /       
             ETLMapEmpty   rtab1

You see that always on the left branch we have an ETLMapping data type (i.e., a left-deep ETLOperation tree). So how do we implement the following case?

                    final RTable result
                            / 
 A leaf-node -->         etlOp1 
                         /       
                        rtab1   rtab2

The answer is that we "model" the left RTable (rtab1 in our example) as an ETLMapping of the form:

 ETLMapLD { etlOp = ETLcOp{cmap = ColMapEmpty}, tabL = ETLMapEmpty, tabR = rtab1 }

So we embed the rtab1 in a ETLMapping, which is a leaf (i.e., it has an empty prevMap), the rtab1 is in the right branch (tabR) and the ETLOperation is the EmptyColMapping, which returns its input RTable when executed. We can use function rtabToETLMapping for this job. So it becomes A leaf-node --> etlOp1 / rtabToETLMapping rtab1 rtab2

In this manner, a leaf-node can also be implemented like this:

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /     
  rtabToETLMapping rtab1  emptyRTable

Constructors

ETLMapEmpty	an empty node
ETLMapLD	a Left-Deep node
Fields etlOp :: ETLOperation the ETLOperation to be executed tabL :: ETLMapping the left-branch corresponding to the previous ETLOperation, which is input to this one. tabR :: RTable the right branch corresponds to another RTable (for binary ETL operations). If this is a Unary ETLOperation then this field must be an empty RTable.
ETLMapRD	a Right-Deep node
Fields etlOp :: ETLOperation the ETLOperation to be executed tabLrd :: RTable the left-branch corresponds to another RTable (for binary ETL operations). If this is a Unary ETLOperation then this field must be an empty RTable. tabRrd :: ETLMapping the right branch corresponding to the previous ETLOperation, which is input to this one.
ETLMapBal	a Balanced node
Fields etlOp :: ETLOperation the ETLOperation to be executed tabLbal :: ETLMapping the left-branch corresponding to the previous ETLOperation, which is input to this one. If this is a Unary ETLOperation then this field might be an empty ETLMapping. tabRbal :: ETLMapping the right branch corresponding corresponding to the previous ETLOperation, which is input to this one. -- If this is a Unary ETLOperation then this field might be an empty ETLMapping.

Instances

Instances details

Eq ETLMapping Source #
Instance details Defined in Etl.Internal.Core Methods (==) :: ETLMapping -> ETLMapping -> Bool # (/=) :: ETLMapping -> ETLMapping -> Bool #

data YesNo Source #

Constructors

Yes
No

Instances

Instances details

Eq YesNo Source #
Instance details Defined in Etl.Internal.Core Methods (==) :: YesNo -> YesNo -> Bool # (/=) :: YesNo -> YesNo -> Bool #
Show YesNo Source #
Instance details Defined in Etl.Internal.Core Methods showsPrec :: Int -> YesNo -> ShowS # show :: YesNo -> String # showList :: [YesNo] -> ShowS #

Execution of an ETL Mapping

runCM :: RColMapping -> RTable -> RTable Source #

runCM operator executes an RColMapping If a target-column has the same name with a source-column and a DontRemoveSrc (i.e., removeSrcCol == No) has been specified, then the (target-column, target-value) key-value pair, overwrites the corresponding (source-column, source-value) key-value pair

etlOpU :: ETLOperation -> RTable -> RTable Source #

executes a Unary ETL Operation

etlOpB :: ETLOperation -> RTable -> RTable -> RTable Source #

executes a Binary ETL Operation

etl :: ETLMapping -> RTable Source #

This operator executes an ETLMapping

etlRes Source #

Arguments

:: ETLMapping	input ETLMapping
-> RTabResult	output RTabResult

This operator executes an ETLMapping and returns the RTabResult Writer Monad that embedds apart from the resulting RTable, also the number of RTuples returned

Functions for "Building" an ETL Mapping

rtabToETLMapping :: RTable -> ETLMapping Source #

Model an RTable as an ETLMapping which when executed will return the input RTable

createLeafETLMapLD Source #

Arguments

:: ETLOperation	ETL operation of this ETL mapping
-> RTable	input RTable
-> ETLMapping	output ETLMapping

Creates a left-deep leaf ETL Mapping, of the following form:

    A Left-Deep ETLOperation Tree

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /       
             ETLMapEmpty   rtab1

createLeafBinETLMapLD Source #

Arguments

:: ETLOperation	ETL operation of this ETL mapping
-> RTable	input RTable1
-> RTable	input RTable2
-> ETLMapping	output ETLMapping

creates a Binary operation leaf node of the form:

A leaf-node -->    etlOp1    
                   /     
  rtabToETLMapping rtab1  rtab2

connectETLMapLD Source #

Arguments

:: ETLOperation	ETL operation of this ETL Mapping
-> RTable	Right RTable (right branch) (if this is a Unary ETL mapping this should be an emptyRTable)
-> ETLMapping	Previous ETL mapping (left branch)
-> ETLMapping	New ETL Mapping, which has added at the end the new node

Connects an ETL Mapping to a left-deep ETL Mapping tree, of the form

    A Left-Deep ETLOperation Tree

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /       
             ETLMapEmpty   rtab1

Example:

  -- connect a Unary ETL mapping (etlOp2)

                          etlOp2    
                         /       
                      etlOp1    emptyRTab
       
  => connectETLMapLD etlOp2 emptyRTable prevMap

  -- connect a Binary ETL Mapping (etlOp3)

                                etlOp3 
                             /        
                          etlOp2     rtab2

  => connectETLMapLD etlOp3 rtab2 prevMap

Note that the right branch (RTable) appears first in the list of input arguments of this function and the left branch (ETLMapping) appears second. This is strange, and one could thought that it is a mistake (i.e., the left branch should appear first and the right branch second) since we are reading from left to right. However this was a deliberate choice, so that we leave the left branch (which is the connection point with the previous ETLMapping) as the last argument, and thus we can partially apply the argumenets and get a new function with input parameter only the previous mapping. This is very helpfull in function composition

Basic Data Types

Instances

Instances

Execution of an ETL Mapping

Functions for "Building" an ETL Mapping

Various ETL Operations