0% found this document useful (0 votes)
44 views

Ab Initio - V1.4

The document describes various components that can be used in dataflow graphs like sort, join, replicate, filter etc. It also discusses parallel processing techniques like component level parallelism, pipeline parallelism and data parallelism. Additionally, it covers topics like partition and departition components, multifile systems, sandboxes and deploying graphs.

Uploaded by

Praveen Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Ab Initio - V1.4

The document describes various components that can be used in dataflow graphs like sort, join, replicate, filter etc. It also discusses parallel processing techniques like component level parallelism, pipeline parallelism and data parallelism. Additionally, it covers topics like partition and departition components, multifile systems, sandboxes and deploying graphs.

Uploaded by

Praveen Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 15

Sample Components

 Sort
 Dedup
 Join
 Replicate
 Rollup
 Filter by Expression
 Merge
 Lookup
 Reformat etc.
Creating Graph – Sort Component
 Sort: The sort component
reorders data. It
comprises two
parameters: Key and
Specify Key for
the Sort
max-core.
 Key: The Key is one of
the parameters for Sort
component which
describes the collation
order.
 Max-core: The max-core
parameter controls how
often the sort component
dumps data from
memory to disk.
Creating Graph – Dedup component
 Dedup component
removes duplicate
records.
 Dedup criteria will
be either unique-
only, First or Last.

Select Dedup criteria.


Creating Graph – Replicate Component
 Replicate
combines the
data records from
the inputs into
one flow and
writes a copy of
that flow to each
of its output ports.
 Use Replicate to
support
component
parallelism.
Creating Graph – Join Component

• Specify the key for join


• Specify Type of Join
Database Configuration (.dbc)

 A file with a .dbc extension which provides the GDE with the
information it needs to connect to a database. A
configuration file contains the following information:
– The name and version number of the database to which you want to
connect.
– The name of the computer on which the database instance or
server to which you want to connect runs, or on which the database
remote access software is installed.
– The name of the database instance, server, or provider to which you
want to connect.
– You generate a configuration file by using the Properties dialog box
for one of the Database components.
Creating Parallel Applications

 Types of Parallel Processing


– Component-level Parallelism: An application with multiple
components running simultaneously on separate data uses
component parallelism.
– Pipeline parallelism: An application with multiple components
running simultaneously on the same data uses pipeline parallelism.
– Data Parallelism: An application with data divided into segments
that operates on each segment simultaneously uses data
parallelism.
Partition Components
 Partition by Expression: Dividing data according to a DML expression.
 Partition by Key: Grouping data by a key.
 Partition with Load balance: Dynamic load balancing.
 Partition by Percentage: Distributing data, so the output is proportional
to fractions of 100.
 Partition by Range: Dividing data evenly among nodes, based on a key
and a set of partitioning ranges.
 Partition by Round-robin: Distributing data evenly, in blocksize chunks,
across the output partitions.
Departition Components
 Concatenate: Concatenate component produces a single output flow
that contains first all the records from the first input partition, then all
the records from the second input partition and so on.
 Gather: Gather component collects inputs from multiple partitions in an
arbitrary manner, and produces a single output flow, does not maintain
sort order.
 Interleave: Interleave component collects records from many sources
in round robin fashion.
 Merge: Merge component collects inputs from multiple sorted partitions
and maintains the sort order.
Multifile systems
 A multifile system is a specially created set of directories, possibly on
different machines, which have identical substructure.
 Each directory is a partition of the multifile system. When a multifile is
placed in a multifile system, its partitions are files within each of the
partitions of the multifile system.
 Multifile system leads to better performance than flat file systems
because multifile systems can divide your data among multiple disks or
CPUs.
 Typically (SMP machine is exception) a multifile system is created with
the control partition on one node and data partitions on other nodes to
distribute the work and improve performance.
 To do this use full internet URLs that specify file and directory names
and locations on remote machines.
Multifile
SANDBOX
 A sandbox is a collection of graphs and related files that
are stored in a single directory tree, and treated as a group
for purposes of version control, navigation, and migration.
 A sandbox can be a file system copy of a datastore project.

 In the graph, instead of specifying the entire path for any


file location ,we specify only the sandbox parameter
variable. For ex : $AI_IN_DATA/customer_info.dat. where
$AI_IN_DATA contains the entire path with reference to the
sandbox $AI_HOME variable.

 The actual in_data dir is $AI_HOME/in_data in sandbox


SANDBOX
 The sandbox provides an excellent mechanism to
maintain uniqueness while moving from
development to production environment by means
switch parameters.

 We can define parameters in sandbox those can


be used across all the graphs pertaining to that
sandbox.

 The topmost variable $PROJECT_DIR contains


the path of the home directory
SANDBOX
Deploying
 Every graph after validation and testing has to be deployed
as .ksh file into the run directory on UNIX.
 This .ksh file is an executable file which is the backbone for
the entire automation/wrapper process.
 The wrapper automation consists of .run, .env, dependency
list ,job list etc
 For a detailed description on wrapper and different
directories and files , Please refer the documentation on
wrapper / UNIX presentation.

You might also like