Copyright | (c) Nikos Karagiannidis 2018 |
---|---|
License | BSD3 |
Maintainer | [email protected] |
Stability | stable |
Portability | POSIX |
Safe Haskell | None |
Language | Haskell2010 |
Etl.Internal.Core
Description
This is an internal module (i.e., not to be imported directly) that implements the core ETL functionality that is exposed via the Julius EDSL for ETL/ELT found in the Etl.Julius module)
Synopsis
- data RColMapping
- = ColMapEmpty
- | RMap1x1 { }
- | RMapNx1 {
- srcColGrp :: [ColumnName]
- removeSrcCol :: YesNo
- trgCol :: ColumnName
- transformNx1 :: [RDataType] -> RDataType
- srcRTupleFilter :: RPredicate
- | RMap1xN {
- srcCol :: ColumnName
- removeSrcCol :: YesNo
- trgColGrp :: [ColumnName]
- transform1xN :: RDataType -> [RDataType]
- srcRTupleFilter :: RPredicate
- | RMapNxM {
- srcColGrp :: [ColumnName]
- removeSrcCol :: YesNo
- trgColGrp :: [ColumnName]
- transformNxM :: [RDataType] -> [RDataType]
- srcRTupleFilter :: RPredicate
- type ColXForm = [RDataType] -> [RDataType]
- createColMapping :: [ColumnName] -> [ColumnName] -> ColXForm -> YesNo -> RPredicate -> RColMapping
- data ETLOperation
- = ETLrOp {
- rop :: ROperation
- | ETLcOp {
- cmap :: RColMapping
- = ETLrOp {
- data ETLMapping
- = ETLMapEmpty
- | ETLMapLD {
- etlOp :: ETLOperation
- tabL :: ETLMapping
- tabR :: RTable
- | ETLMapRD {
- etlOp :: ETLOperation
- tabLrd :: RTable
- tabRrd :: ETLMapping
- | ETLMapBal { }
- data YesNo
- runCM :: RColMapping -> RTable -> RTable
- etlOpU :: ETLOperation -> RTable -> RTable
- etlOpB :: ETLOperation -> RTable -> RTable -> RTable
- etl :: ETLMapping -> RTable
- etlRes :: ETLMapping -> RTabResult
- rtabToETLMapping :: RTable -> ETLMapping
- createLeafETLMapLD :: ETLOperation -> RTable -> ETLMapping
- createLeafBinETLMapLD :: ETLOperation -> RTable -> RTable -> ETLMapping
- connectETLMapLD :: ETLOperation -> RTable -> ETLMapping -> ETLMapping
Basic Data Types
data RColMapping Source #
This is the basic data type to define the column-to-column mapping from a source RTable
to a target RTable
.
Essentially, an RColMapping
represents the column-level transformations of an RTuple
that will yield a target RTuple
.
A mapping is simply a triple of the form ( Source-Column(s), Target-Column(s), Transformation, RTuple-Filter), where we define the source columns
over which a transformation (i.e. a function) will be applied in order to yield the target columns. Also, an RPredicate
(i.e. a filter) might be applied on the source RTuple
.
Remember that an RTuple
is essentially a mapping between a key (the Column Name) and a value (the RDataType
value). So the various RColMapping
data constructors below simply describe the possible modifications of an RTuple
orginating from its own columns.
So, we can have the following mapping types:
a) single-source column to single-target column mapping (1 to 1),
the source column will be removed or not based on the removeSrcCol
flag (dublicate column names are not allowed in an RTuple
)
b) multiple-source columns to single-target column mapping (N to 1),
The N columns will be merged to the single target column based on the transformation.
The N columns will be removed from the RTuple or not based on the removeSrcCol
flag (dublicate column names are not allowed in an RTuple
)
c) single-source column to multiple-target columns mapping (1 to M)
the source column will be "expanded" to M target columns based ont he transformation.
the source column will be removed or not based on the removeSrcCol
flag (dublicate column names are not allowed in an RTuple
)
d) multiple-source column to multiple target columns mapping (N to M)
The N source columns will be mapped to M target columns based on the transformation.
The N columns will be removed from the RTuple or not based on the removeSrcCol
flag (dublicate column names are not allow in an RTuple
)
Some examples of mapping are the following:
(Start_Date, No, StartDate, t -> True) -- copy the source value to target and dont remove the source column, so the target RTuple will have both columns Start_Date and StartDate -- with the exactly the same value) ([Amount, Discount], Yes, FinalAmount, ([a, d] -> a * d) ) -- FinalAmount is a derived column based on a function applied to the two source columns. -- In the final RTuple we remove the two source columns.
An RColMapping
can be applied with the runCM
(runColMapping) operator
Constructors
ColMapEmpty | |
RMap1x1 | single-source column to single-target column mapping (1 to 1). |
Fields
| |
RMapNx1 | multiple-source columns to single-target column mapping (N to 1) |
Fields
| |
RMap1xN | single-source column to multiple-target columns mapping (1 to N) |
Fields
| |
RMapNxM | multiple-source column to multiple target columns mapping (N to M) |
Fields
|
type ColXForm = [RDataType] -> [RDataType] Source #
A Column Transformation function data type.
It is used in order to define an arbitrary column-level transformation (i.e., from a list of N input Column-Values we produce a list of M derived (output) Column-Values).
A Column value is represented with the RDataType
.
Arguments
:: [ColumnName] | List of source column names |
-> [ColumnName] | List of target column names |
-> ColXForm | Column Transformation function |
-> YesNo | Remove source column option |
-> RPredicate | Filtering predicate |
-> RColMapping | Output Column Mapping |
Constructs an RColMapping. This is the suggested method for creating a column mapping and not by calling the data constructors directly.
data ETLOperation Source #
An ETL operation applied to an RTable can be either an ROperation
(a relational agebra operation like join, filter etc.) defined in RTable.Core module,
or an RColMapping
applied to an RTable
Constructors
ETLrOp | |
Fields
| |
ETLcOp | |
Fields
|
data ETLMapping Source #
ETLmapping : it is the equivalent of a mapping in an ETL tool and consists of a series of ETLOperations that are applied, one-by-one, to some initial input RTable, but if binary ETLOperations are included in the ETLMapping, then there will be more than one input RTables that the ETLOperations of the ETLMapping will be applied to. When we apply (i.e., run) an ETLOperation of the ETLMapping we get a new RTable, which is then inputed to the next ETLOperation, until we finally run all ETLOperations. The purpose of the execution of an ETLMapping is to produce a single new RTable as the result of the execution of all the ETLOperations of the ETLMapping. In terms of database operations an ETLMapping is the equivalent of an CREATE AS SELECT (CTAS) operation in an RDBMS. This means that anything that can be done in the SELECT part (i.e., column projection, row filtering, grouping and join operations, etc.) in order to produce a new table, can be included in an ETLMapping.
An ETLMapping is executed with the etl (runETLmapping) operator
Implementation: An ETLMapping is implemented as a binary tree where the node represents the ETLOperation to be executed and the left branch is another ETLMapping, while the right branch is an RTable (that might be empty in the case of a Unary ETLOperation). Execution proceeds from bottom-left to top-right. This is similar in concept to a left-deep join tree. In a Left-Deep ETLOperation tree the "pipe" of ETLOperations comes from the left branches always. The leaf node is always an ETLMapping with an ETLMapEmpty in the left branch and an RTable in the right branch (the initial RTable inputed to the ETLMapping). In this way, the result of the execution of each ETLOperation (which is an RTable) is passed on to the next ETLOperation. Here is an example:
A Left-Deep ETLOperation Tree final RTable result / etlOp3 / etlOp2 rtab2 / A leaf-node --> etlOp1 emptyRTab / ETLMapEmpty rtab1
You see that always on the left branch we have an ETLMapping data type (i.e., a left-deep ETLOperation tree). So how do we implement the following case?
final RTable result / A leaf-node --> etlOp1 / rtab1 rtab2
The answer is that we "model" the left RTable (rtab1 in our example) as an ETLMapping of the form:
ETLMapLD { etlOp = ETLcOp{cmap = ColMapEmpty}, tabL = ETLMapEmpty, tabR = rtab1 }
So we embed the rtab1 in a ETLMapping, which is a leaf (i.e., it has an empty prevMap), the rtab1 is in
the right branch (tabR) and the ETLOperation is the EmptyColMapping, which returns its input RTable when executed.
We can use function rtabToETLMapping
for this job. So it becomes
A leaf-node --> etlOp1
/
rtabToETLMapping rtab1 rtab2
In this manner, a leaf-node can also be implemented like this:
final RTable result / etlOp3 / etlOp2 rtab2 / A leaf-node --> etlOp1 emptyRTab / rtabToETLMapping rtab1 emptyRTable
Constructors
ETLMapEmpty | an empty node |
ETLMapLD | a Left-Deep node |
Fields
| |
ETLMapRD | a Right-Deep node |
Fields
| |
ETLMapBal | a Balanced node |
Fields
|
Instances
Eq ETLMapping Source # | |
Defined in Etl.Internal.Core |
Execution of an ETL Mapping
runCM :: RColMapping -> RTable -> RTable Source #
runCM operator executes an RColMapping If a target-column has the same name with a source-column and a DontRemoveSrc (i.e., removeSrcCol == No) has been specified, then the (target-column, target-value) key-value pair, overwrites the corresponding (source-column, source-value) key-value pair
etl :: ETLMapping -> RTable Source #
This operator executes an ETLMapping
Arguments
:: ETLMapping | input ETLMapping |
-> RTabResult | output RTabResult |
This operator executes an ETLMapping
and returns the RTabResult
Writer
Monad
that embedds apart from the resulting RTable, also the number of RTuple
s returned
Functions for "Building" an ETL Mapping
rtabToETLMapping :: RTable -> ETLMapping Source #
Model an RTable
as an ETLMapping
which when executed will return the input RTable
Arguments
:: ETLOperation | ETL operation of this ETL mapping |
-> RTable | input RTable |
-> ETLMapping | output ETLMapping |
Creates a left-deep leaf ETL Mapping, of the following form:
A Left-Deep ETLOperation Tree final RTable result / etlOp3 / etlOp2 rtab2 / A leaf-node --> etlOp1 emptyRTab / ETLMapEmpty rtab1
createLeafBinETLMapLD Source #
Arguments
:: ETLOperation | ETL operation of this ETL mapping |
-> RTable | input RTable1 |
-> RTable | input RTable2 |
-> ETLMapping | output ETLMapping |
creates a Binary operation leaf node of the form:
A leaf-node --> etlOp1 / rtabToETLMapping rtab1 rtab2
Arguments
:: ETLOperation | ETL operation of this ETL Mapping |
-> RTable | Right RTable (right branch) (if this is a Unary ETL mapping this should be an emptyRTable) |
-> ETLMapping | Previous ETL mapping (left branch) |
-> ETLMapping | New ETL Mapping, which has added at the end the new node |
Connects an ETL Mapping to a left-deep ETL Mapping tree, of the form
A Left-Deep ETLOperation Tree final RTable result / etlOp3 / etlOp2 rtab2 / A leaf-node --> etlOp1 emptyRTab / ETLMapEmpty rtab1
Example:
-- connect a Unary ETL mapping (etlOp2) etlOp2 / etlOp1 emptyRTab => connectETLMapLD etlOp2 emptyRTable prevMap -- connect a Binary ETL Mapping (etlOp3) etlOp3 / etlOp2 rtab2 => connectETLMapLD etlOp3 rtab2 prevMap
Note that the right branch (RTable) appears first in the list of input arguments of this function and the left branch (ETLMapping) appears second. This is strange, and one could thought that it is a mistake (i.e., the left branch should appear first and the right branch second) since we are reading from left to right. However this was a deliberate choice, so that we leave the left branch (which is the connection point with the previous ETLMapping) as the last argument, and thus we can partially apply the argumenets and get a new function with input parameter only the previous mapping. This is very helpfull in function composition