Informatica Tutorial: Database Partitioning

Friday, November 18, 2011

Optimizing the Bottleneck’s

If the source is a flat file, ensure that the flat file is local to the Informatica server. If source is a relational table, then try not to use synonyms or aliases.
If the source is a flat file, reduce the number of bytes (By default it is 1024 bytes per line) the Informatica reads per line. If we do this, we can decrease the Line Sequential Buffer Length setting of the session properties.
If possible, give a conditional query in the source qualifier so that the records are filtered off as soon as possible in the process.
In the source qualifier, if the query has ORDER BY or GROUP BY, then create an index on the source table and order by the index field of the source table.

PERFORMANCE TUNING OF TARGETS

If the target is a flat file, ensure that the flat file is local to the Informatica server. If target is a relational table, then try not to use synonyms or aliases.

Use bulk load whenever possible.
Increase the commit level.
Drop constraints and indexes of the table before loading.

PERFORMANCE TUNING OF MAPPINGS

Mapping helps to channel the flow of data from source to target with all the transformations in between. Mapping is the skeleton of Informatica loading process.

Avoid executing major sql queries from mapplets or mappings.
Use optimized queries when we are using them.
Reduce the number of transformations in the mapping. Active transformations like rank, joiner, filter, aggregator etc should be used as less as possible.
Remove all the unnecessary links between the transformations from mapping.
If a single mapping contains many targets, then dividing them into separate mappings can improve performance.
If we need to use a single source more than once in a mapping, then keep only one source and source qualifier in the mapping. Then create different data flows as required into different targets or same target.
If a session joins many source tables in one source qualifier, then an optimizing query will improve performance.
In the sql query that Informatica generates, ORDERBY will be present. Remove the ORDER BY clause if not needed or at least reduce the number of column names in that list. For better performance it is best to order by the index field of that table.
Combine the mappings that use same set of source data.
On a mapping, field with the same information should be given the same type and length throughout the mapping. Otherwise time will be spent on field conversions.
Instead of doing complex calculation in query, use an expression transformer and do the calculation in the mapping.
If data is passing through multiple staging areas, removing the staging area will increase performance.
Stored procedures reduce performance. Try to keep the stored procedures simple in the mappings.
Unnecessary data type conversions should be avoided since the data type conversions impact performance.
Transformation errors result in performance degradation. Try running the mapping after removing all transformations. If it is taking significantly less time than with the transformations, then we have to fine-tune the transformation.
Keep database interactions as less as possible.

Wednesday, November 16, 2011

PARTITIONING

A pipeline consists of a source qualifier and all the transformations and Targets that receive data from that source qualifier.
When the Integration Service runs the session, it can achieve higher Performance by partitioning the pipeline and performing the extract, Transformation, and load for each partition in parallel.

A partition is a pipeline stage that executes in a single reader, transformation, or Writer thread. The number of partitions in any pipeline stage equals the number of Threads in the stage. By default, the Integration Service creates one partition in every pipeline stage.

PARTITIONING ATTRIBUTES

1. Partition points

By default, IS sets partition points at various transformations in the pipeline.
Partition points mark thread boundaries and divide the pipeline into stages.
A stage is a section of a pipeline between any two partition points.

2. Number of Partitions

we can define up to 64 partitions at any partition point in a pipeline.
When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all Partition points in the pipeline.
increasing the number of partitions or partition points increases the number of threads.
The number of partitions we create equals the number of connections to the source or target. For one partition, one database connection will be used.

3. Partition types

The Integration Service creates a default partition type at each partition point.
If we have the Partitioning option, we can change the partition type. This option is purchased separately.
The partition type controls how the Integration Service distributes data among partitions at partition points.

PARTITIONING TYPES

1. Round Robin Partition Type

In round-robin partitioning, the Integration Service distributes rows of data evenly to all partitions.
Each partition processes approximately the same number of rows.
Use round-robin partitioning when we need to distribute rows evenly and do not need to group data among partitions.

2. Pass-Through Partition Type

In pass-through partitioning, the Integration Service processes data without Redistributing rows among partitions.
All rows in a single partition stay in that partition after crossing a pass-Through partition point.
Use pass-through partitioning when we want to increase data throughput, but we do not want to increase the number of partitions.

3. Database Partitioning Partition Type

Use database partitioning for Oracle and IBM DB2 sources and IBM DB2 targets only.
Use any number of pipeline partitions and any number of database partitions.
We can improve performance when the number of pipeline partitions equals the number of database partitions.

Database Partitioning with One Source

When we use database partitioning with a source qualifier with one source, the Integration Service generates SQL queries for each database partition and distributes the data from the database partitions among the session partitions Equally.

For example, when a session has three partitions and the database has five partitions, 1^st and 2^nd session partitions will receive data from 2 database partitions each. Thus four DB partitions used. 3^rd Session partition will receive Data from the remaining 1 DB partition.

Partitioning a Source Qualifier with Multiple Sources Tables

The Integration Service creates SQL queries for database partitions based on the Number of partitions in the database table with the most partitions.

If the session has three partitions and the database table has two partitions, one of the session partitions receives no data.

4. Hash Auto-Keys Partition Type

The Integration Service uses all grouped or sorted ports as a compound Partition key.
Use hash auto-keys partitioning at or before Rank, Sorter, Joiner, and Unsorted Aggregator transformations to ensure that rows are grouped Properly before they enter these transformations.

5. Hash User-Keys Partition Type

The Integration Service uses a hash function to group rows of data among Partitions.
we define the number of ports to generate the partition key.
we choose the ports that define the partition key .

6. Key range Partition Type

We specify one or more ports to form a compound partition key.
The Integration Service passes data to each partition depending on the Ranges we specify for each port.
Use key range partitioning where the sources or targets in the pipeline are Partitioned by key range.
Example: Customer 1-100 in one partition, 101-200 in another and so on. We Define the range for each partition.

Pages

Friday, November 18, 2011

Optimizing the Bottleneck’s

Wednesday, November 16, 2011

PARTITIONING