Wednesday, 25 June 2014

My 100 th post

Boom, it's my 100 th post. I just wanna share my happiness at this time. I started this blog two year ago with no idea where to start and where to end. Later I came to know this will be my never ending one till my life end.

I began to love this blog. It has a peaceful place place for me to learn.

Sharing these things with you has done so much for me. I hope whoever reads this blog, learn it, feels encouraged because of what I write here.

'I hope it brings some good to the world'.


Application programming interface 1

What is an API?

An application-programming interface (API) is a set of programming instructions and standards for accessing a Web-based software application or Web tool. A software company releases its API to the public so that other software developers can design products that are powered by its service.
For example, Amazon.com released its API so that Web site developers could more easily access Amazon's product information. Using the Amazon API, a third party Web site can post direct links to Amazon products with updated prices and an option to "buy now."
An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention. When you buy movie tickets online and enter your credit card information, the movie ticket Web site uses an API to send your credit card information to a remote application that verifies whether your information is correct. Once payment is confirmed, the remote application sends a response back to the movie ticket Web site saying it's OK to issue the tickets.
As a user, you only see one interface -- the movie ticket Web site -- but behind the scenes, many applications are working together using APIs. This type of integration is called seamless, since the user never notices when software functions are handed from one application to another [source: TConsult, Inc.]
An API resembles Software as a Service (SaaS), since software developers don't have to start from scratch every time they write a program. Instead of building one core application that tries to do everything -- e-mail, billing, tracking, etcetera -- the same application can contract out certain responsibilities to remote software that does it better.­
Let's use the same example of Web conferencing from before. Web conferencing is SaaS since it can be accessed on-demand using nothing but a Web site. With a conferencing API, that same on-demand service can be integrated into another Web-based software application, like an instant messaging program or a Web calendar.
The user can schedule a Web conference in his Web calendar program and then click a link within the same program to launch the conference. The calendar program doesn't host or run the conference itself. It uses a conferencing API to communicate behind the scenes with the remote Web conferencing service and seamlessly delivers that functionality to the user.

Monday, 23 June 2014

Mapplets

A mapplet is a reusable object that you create in the Mapplet Designer. It contains a set of transformations (more than 1 transformation) and lets you reuse the transformation logic in multiple mappings.

For example, you a similar logic need to add more than 1 mapping. Like concatenate First_Name & Last_Name, Address_Street_1 & Address_Street_2 etc and this logic is need for Employee_Table, Student_Table, Patient_Table 3 mappings.

In this scenario instead of repeating the logic, we can have a mapplet for concatenate (Merge) columns and use this mapplet instance for the above mentioned 3 mappings.

When you use a mapplet in a mapping, you use an instance of the mapplet. Like a reusable transformation, any change made to the mapplet is inherited by all instances of the mapplet.

Mapplets help simplify mappings in the following ways:

Include source definitions. Use multiple source definitions and source qualifiers to provide source data for a mapping.
Accept data from sources in a mapping. If you want the mapplet to receive data from the mapping, use an Input transformation to receive source data.
Include multiple transformations. A mapplet can contain as many transformations as you need.
Pass data to multiple transformations. You can create a mapplet to feed data to multiple transformations. Each Output transformation in a mapplet represents one output group in a mapplet.
Contain unused ports. You do not have to connect all mapplet input and output ports in a mapping.

To use a mapplet in a mapping, first we need to configure it for input and output. In addition to transformation logic that you configure, a mapplet has the following components:


  1. Mapplet input
  2. Mapplet output
  3. Mapplet ports



Mapplet Input

Mapplet input can originate from a source definition and/or from an Input transformation in the mapplet. You can create multiple pipelines in a mapplet. Use multiple source definitions and source qualifiers or Input transformations. You can also use a combination of source definitions and Input transformations.

Mapplet Output

Mapplet Output transformation in a mapplet to pass data through the mapplet into a mapping. A mapplet must contain at least one Output transformation with at least one connected port in the mapplet. Each connected port in an Output transformation displays as a mapplet output port in a mapping. Each Output transformation in a mapplet displays as an output group in a mapping. An output group can pass data to multiple pipelines in a mapping.

Use the following rules and guidelines when you add transformations to a mapplet:
If you use a Sequence Generator transformation, you must use a reusable Sequence Generator transformation.

If you use a Stored Procedure transformation, you must configure the Stored Procedure Type to be Normal.
You cannot include PowerMart 3.5-style LOOKUP functions in a mapplet
You cannot include the following objects in a mapplet:

Normalizer transformations
COBOL sources
XML Source Qualifier transformations
XML sources
Target definitions
Other mapplets

Although reusable transformations and shortcuts in a mapplet can be used, to protect the validity of the mapplet, use a copy of a transformation instead. Reusable transformations and shortcuts inherit changes to their original transformations. This might invalidate the mapplet and the mappings that use the mapplet.

Note:


  • Use the following rules and guidelines when you edit a mapplet that is used by mappings:
  • Do not delete a port from the mapplet. The Designer deletes mapplet ports in the mapping when you delete links to an Input or Output transformation or when you delete ports connected to an Input or Output transformation.
  • Do not change the datatype, precision, or scale of a mapplet port. The datatype, precision, and scale of a mapplet port is defined by the transformation port to which it is connected in the mapplet. Therefore, if you edit a mapplet to change the datatype, precision, or scale of a port connected to a port in an Input or Output transformation, you change the mapplet port.
  • Do not change the mapplet type. If you remove all active transformations from an active mapplet, the mapplet becomes passive. If you add an active transformation to a passive mapplet, the mapplet becomes active. 

Application programming interface

In computer programming, an application programming interface (API) specifies how some software components should interact with each other.

In addition to accessing databases or computer hardware, such as hard disk drives or video cards, an API can be used to ease the work of programming graphical user interface components. In practice, many times an API comes in the form of a library that includes specifications for routines, data structures, object classes, and variables. In some other cases, notably for SOAP and REST services, an API comes as just a specification of remote calls exposed to the API consumers

Another one.,

API, an abbreviation of application program interface, is a set of routines, protocols, and tools for building software applications. The API specifies how software components should interact and are used when programming graphical user interface (GUI) components.  A good API makes it easier to develop a program by providing all the building blocks. A programmer then puts the blocks together.

Popular API Examples

Programmable Web, a site that tracks more than 9,000 APIs, lists Google Maps, Twitter, YouTube, Flickr and Amazon Product Advertising as the most popular APIs (view all sorted by popularity).

1. Google Maps API: Google Maps APIs lets developers embed Google Maps on webpages using a JavaScript or Flash interface. The Google Maps API is designed to work on mobile devices and desktop browsers.

Example API Use : Charting Your Course Using the Google Maps API.

2. YouTube APIs: YouTube API: Google's APIs lets developers integrate YouTube videos and functionality into websites or applications. YouTube APIs include the YouTube Analytics API, YouTube Data API, YouTube Live Streaming API, YouTube Player APIs and others.

3. Flickr API: The Flickr API is used by developers to access the Flick photo sharing community data. The Flickr API consists of a set of callable methods, and some API endpoints.

4. Twitter APIs: Twitter offers two APIs. The REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.
Example API Use : Using the Twitter API to Create an Early Alert System.

5. Amazon Product Advertising API: Amazon's Product Advertising API gives developers access to Amazon's product selection and discovery functionality to advertise Amazon products to monetize a website.

Friday, 20 June 2014

What is a Surrogate key?

A surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.
Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

Tuesday, 17 June 2014

Router Transformation

  • The Router Transformation is an Active and Connected Transformation.

  • A Router transformation is similar to a Filter transformation because both transformations allow you to use a condition to test data. A Filter transformation tests data for one condition and drops the rows of data that do not meet the condition. However, a Router transformation tests data for one or more conditions and gives you the option to route rows of data that do not meet any of the conditions to a default output group.

Sequence Generator Transformation


  • Sequence Generator is an Passive and Connected.
  • The Sequence Generator transformation generates numeric values. Use the Sequence Generator to create unique primary key values, replace missing primary keys, or cycle through a sequential range of numbers.
  • It contains two output ports that you can connect to one or more transformations. The Integration Service generates a block of sequence numbers each time a block of rows enters a connected transformation. If you connect CURRVAL, the Integration Service processes one row in each block. When NEXTVAL is connected to the input port of another transformation, the Integration Service generates a sequence of numbers. When CURRVAL is connected to the input port of another transformation, the Integration Service generates the NEXTVAL value plus the Increment By value.

Common Uses for Sequence Generator
You can complete the following tasks with a Sequence Generator transformation:
  • Create keys.
  • Replace missing values.
  • Cycle through a sequential range of numbers.
Creating Keys
We can create primary or foreign key values with the Sequence Generator transformation by connecting the NEXTVAL port to a target or downstream transformation. You can use a range of values from 1 to 9,223,372,036,854,775,807 with the smallest interval of 1.

When we create primary or foreign keys, use the Cycle option to prevent the Integration Service from creating duplicate primary keys. You might do this by selecting the Truncate Target Table option in the session properties or by creating composite keys.

To create a composite key, you can configure the Integration Service to cycle through a smaller set of values. For example, if you have three stores generating order numbers, you might have a Sequence Generator cycling through values from 1 to 3, incrementing by 1. When you pass the following set of foreign keys, the generated values then create unique composite keys:

COMPOSITE_KEY
ORDER_NO
1
12345
2
12345
3
12345
1
12346
2
12346
3
12346

Sequence Generator Ports
The Sequence Generator transformation has two output ports: NEXTVAL and CURRVAL. You cannot edit or delete these ports. Likewise, you cannot add ports to the transformation.

NEXTVAL
Connect NEXTVAL to multiple transformations to generate unique values for each row in each transformation. Use the NEXTVAL port to generate sequence numbers by connecting it to a downstream transformation or target. You connect the NEXTVAL port to generate the sequence based on the Current Value and Increment By properties. If the Sequence Generator is not configured to cycle through the sequence, the NEXTVAL port generates sequence numbers up to the configured End Value.

Note: When you run a partitioned session on a grid, the Sequence Generator transformation skips values depending on the number of rows in each partition.

CURRVAL
CURRVAL is NEXTVAL plus the Increment By value. You typically only connect the CURRVAL port when the NEXTVAL port is already connected to a downstream transformation. When a row enters a transformation connected to the CURRVAL port, the Integration Service passes the last created NEXTVAL value plus one.

Note: When you run a partitioned session on a grid, the Sequence Generator transformation might skip values depending on the number of rows in each partition.

Sequence Generator Transformation Properties
The Sequence Generator transformation is unique among all transformations because you cannot add, edit, or delete the default ports, NEXTVAL and CURRVAL.

Start Value and Cycle
Use Cycle to generate a repeating sequence, such as numbers 1 through 12 to correspond to the months in a year.

To cycle the Integration Service through a sequence:
  1. Enter the lowest value in the sequence that you want the Integration Service to use for the Start Value.
  2. Enter the highest value to be used for End Value.
  3. Select Cycle.
As it cycles, the Integration Service reaches the configured end value for the sequence, it wraps around and starts the cycle again, beginning with the configured Start Value.


NOTE:

Start Value: By default it is 0, It starts sequence generating values up to Maximum value is 9,223,372,036,854,775,806.
Increment By: By Default is 1, It is difference between two consecutive values from the NEXTVAL port and Maximum value is 2,147,483,647.
End Value: Maximum value is 9,223,372,036,854,775,807.
Current Value: Integration Service to use as the first value in the sequence, the value must be greater than or equal to the start value and less than the end value.
b If enabled, the Integration Service cycles through the sequence range.
Number of Cached Values: Number of sequential values the Integration Service caches at a time.
Reset: Integration Service generates values based on the original current value for each session, Otherwise, the Integration Service updates the current value to reflect the last-generated value for the session plus one, and then uses the updated current value as the basis for the next session run. Disabled for reusable Sequence Generator transformations.

SQL Transformation

  • SQL Transformation is an Active/Passive and Connected transformation.

  • The SQL transformation processes SQL queries midstream in a pipeline. You can insert, delete, update, and retrieve rows from a database. You can pass the database connection information to the SQL transformation as input data at run time. The transformation processes external SQL scripts or SQL queries that you create in an SQL editor. The SQL transformation processes the query and returns rows and database errors.

When you create an SQL transformation, you configure the following options:
  • Mode. The SQL transformation runs in one of the following modes:
1.    Script mode. The SQL transformation runs ANSI SQL scripts that are externally located. You pass a script name to the transformation with each input row. The SQL transformation outputs one row for each input row.
2.    Query mode. The SQL transformation executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters. You can output multiple rows when the query has a SELECT statement.

  • Passive or active transformation. The SQL transformation is an active transformation by default. You can configure it as a passive transformation when you create the transformation.
  • Database type. The type of database the SQL transformation connects to.
  • Connection type. Pass database connection information to the SQL transformation or use a connection object.

XML Parser Transformation

  • XML Parser is an Active and Connected transformation.

  • XML Parser transformation is use to extract XML inside a pipeline. The XML Parser transformation lets you extract XML data from messaging systems, such as TIBCO or MQ Series, and from other sources, such as files or databases. The XML Parser transformation functionality is similar to the XML source functionality, except it parses the XML in the pipeline.

For example, you might want to extract XML data from a Flat File (data will be in XML Format) and pass the data to relational targets.
The XML Parser transformation reads XML data from a single input port and writes data to one or more output ports.

XML Generator Transformation

  • XML Generator is an Active and Connected transformation.

  • XML Generator transformation is use to create XML inside a pipeline. The XML Generator transformation lets you read data from messaging systems, such as TIBCO and MQ Series, or from other sources, such as files or databases. The XML Generator transformation functionality is similar to the XML target functionality, except it generates the XML in the pipeline.

Thursday, 12 June 2014

Caches in Informatica

Static Cache

Static Cache is same as a Cached Lookup in which once a Cache is created the Integration Service always queries the Cache instead of the Lookup Table.

In Static Cache when the Lookup condition is true it return value from lookup table else returns Null or Default value. In Static Cache the important thing is that you cannot insert or update the cache.

Dynamic change

In Dynamic Cache we can insert or update rows in the cache when we pass the rows. The Integration Service dynamically inserts or updates data in the lookup cache and passes the data to the target. The dynamic cache is synchronized with the target.

Shared Cache

When we use shared Cache Informatica server creates the cache memory for multiple lookup transformations in the mapping and once the lookup is done for the first lookup then memory is released and use that memory used by the other look up transformation.

We can share the lookup cache between multiple transformations. Unnamed cache is shared between transformations in the same mapping and named cache between transformations in the same or different mappings.

Persistent cache

If we use Persistent cache Informatica server processes a lookup transformation and saves the lookup cache files and reuses them the next time. The Integration Service saves or deletes lookup cache files after a successful session run based on whether the Lookup cache is checked as persistent or not

In order to make a Lookup Cache as Persistent cache you need to make the following changes

Lookup cache persistent : Needs to be checked
Cache File Name Prefix : Enter the Named Persistent cache file name
Re-cache from lookup source : Needs to be checked

Wednesday, 11 June 2014

MD5 Function in INFORMATICA

MD5 (Message Digest Function) is a hash function in Informatica which is used to evaluate data integrity. The MD5 function uses Message-Digest Algorithm 5 (MD5) and calculates the check-sum of the input value.  MD5 is a one-way cryptography hash function with a 128-bit hash value.

MD5 returns a 32 character string of hexadecimal digits 0-9 & a-f and returns NULL if the input is a null value.

Example:

When you wish to write changed data to a database.  Use MD5 to generate checksum values for rows of data that you read from a source.  

When you run a session, compare the previously generated checksum values against the new checksum values.  Then, write the rows with an updated checksum value to the target.  You can conclude that an updated checksum value would indicate that the data has changed.

Change data capture (CDC) can be done in many ways.  There are methodologies such as Timestamp, Versioning, Status indicators, Triggers and Transaction logs and Checksum.  The advantage of using MD5 function is to reduce overall extract-transform-load (ETL) run-time and the cache memory usage, by caching only the required fields which are of utmost importance.


When To Use MD5 Function in Informatica?

Only update the changed records (any column change) in the target.  Instead of passing all existing records to the target for update, ( whether changed or unchanged ) it's always recommended to compare the records.

Column-to-column comparison is little painful if your column counts are high.  Concatenate all columns and use the MD5 Function (source and target records for the same key) and then compare the output from the MD5 Function.  The changed records can be identified and only those records can be updated into the target.  This calculation is done really fast and the output of MD5 Function can be used as a unique key to differentiate records


MD5 will help in improving performance when compared to lookups only if the comparison columns are more than 10.  MD5 function enhances the performance as compared to lookups only when the comparison columns are more than 10.

Limitations

If we have non-string columns (SMALLINT, INT, NUMBER, etc.) we need to convert them into characters using TO_CHAR function because the MD5 Function only validates character strings.  If you don't use the TO_CHAR, the output port using MD5 may become invalid.  Sometimes it does validate the output port using MD5, but the warning message generated may read validated and non-string data type (e.g. SMALLINT) columns are converted to string.  This warning has a great performance impact.

Normal Approach

To identify records for updates and inserts, we use a lookup transformation.  The cache built by lookup depends on two factors:

The number of columns in the comparison condition
The amount of data in the lookup table.
When there is not a primary key column to identify the changes, there are two options:

Compare all the columns in the lookup or...
Compare the data using the concept of power exchange change data capture.
Beware that this can degrade the performance.  In this scenario, using the MD5 Function is our best option.

Scenario

Consider a scenario where the incoming Product records consist of PRODUCT_NUM, PRODUCT_DESC, and address fields which have no primary key are associated with them.  In such a scenario, it is imperative that a unique identifier be assigned to these records on-the-fly which is immutable.  This on-the-fly unique identifier can also be used in future loads as a key to identify whether an incoming customer record is a potential update or an insert.

The MD5 Function generates a unique hexadecimal string 32 which is character-wide for a given input string.  In this example, the source table of a customer include customer details such as PRODUCT_NUM, PRODUCT_DESC, and address fields.  The MD5 Function in an expression is used to assign a 32 character-wide key to each of these records and load them in target file.


SK Puroshottam_MD5 Function inside Informatica


Conclusion

By using the MD5 values, we can identify whether the data is changed or unchanged without the performance being degraded and data is handled in the most efficient way possible. The MD5 value is always recommended for scenarios with many comparison columns and no primary key columns in the lookup table.   There is limitation, however; the input to the MD5 values needs to be a string by data type and it returns a 32 bit hexadecimal.

Tuesday, 10 June 2014

Update Strategy Transformation

Update strategy transformation is an active and connected transformation. Update strategy transformation is used to insert, update, and delete records in the target table. It can also reject the records without reaching the target table.


The design of the target table decides how to handle the changes to existing rows. In the informatica, you can set the update strategy at two different levels:
  • Session Level: Configuring at session level instructs the integration service to either treat all rows in the same way (Insert or update or delete) or use instructions coded in the session mapping to flag for different database operations.
  • Mapping Level: Use update strategy transformation to flag rows for inert, update, delete or reject.

Flagging Rows in Mapping with Update Strategy:

You have to flag each row for inserting, updating, deleting or rejecting. The constants and their numeric equivalents for each database operation are listed below.
  • DD_INSERT: Numeric value is 0. Used for flagging the row as Insert.
  • DD_UPDATE: Numeric value is 1. Used for flagging the row as Update.
  • DD_DELETE: Numeric value is 2. Used for flagging the row as Delete.
  • DD_REJECT: Numeric value is 3. Used for flagging the row as Reject.

The integration service treats any other numeric value as an insert.


IIF(department_id=10, DD_UPDATE, DD_INSERT)
Important Note:

Update strategy works only when we have a primary key on the target table. If there is no primary key available on the target table, then you have to specify a primary key in the target definition in the mapping for update strategy transformation to work.

Sunday, 8 June 2014

Metadata

Metadata is data that describes other data.

Metadata in a data warehouse is similar to the data dictionary or the data catalog in a database management system. In the data dictionary, you keep the  information about the logical data structures, the information about the files and addresses, the information about the indexes, and so on. The data dictionary contains data about the data in the database. Similarly, the metadata component is the data about the data in the data warehouse.

Types of Metadata

Metadata in a data warehouse fall into three major categories:

  • Operational Metadata
  • Extraction and Transformation Metadata
  • End-User Metadata

Build a data warehouse


Before deciding to build a data warehouse for your organization, you need to ask the following basic and fundamental questions and address the relevant issues:
  • Top-down or bottom-up approach?
  • Enterprise-wide or departmental?
  • Which first—data warehouse or data mart?
  • Build pilot or go with a full-fledged implementation?
  • Dependent or independent data marts?

Top-Down Approach

The advantages of this approach are:
  1. A truly corporate effort, an enterprise view of data
  2. Inherently architected—not a union of disparate data marts
  3.  Single, central storage of data about the content
  4. Centralized rules and control
  5. May see quick results if implemented with iterations
The disadvantages are:
  1. Takes longer to build even with an iterative method
  2. High exposure/risk to failure
  3. Needs high level of cross-functional skills
  4. High outlay without proof of concept
Bottom-Up Approach

The advantages of this approach are:
  1. Faster and easier implementation of manageable pieces
  2. Favorable return on investment and proof of concept
  3. Less risk of failure
  4. Inherently incremental; can schedule important data marts first
  5. Allows project team to learn and grow
The disadvantages are:
  1. Each data mart has its own narrow view of data
  2. Permeates redundant data in every data mart
  3. Perpetuates inconsistent and irreconcilable data
  4. Proliferates unmanageable interfaces

Data Warehouses and Data Marts

Data Warehouse:
  • Holds multiple subject areas
  • Holds very detailed information
  • Works to integrate all data sources
  • Does not necessarily use a dimensional model but feeds dimensional models.
Data Mart
  • Often holds only one subject area- for example, Finance, or Sales
  • May hold more summarised data (although many hold full detail)
  • Concentrates on integrating information from a given subject area or set of source systems
  • Is built focused on a dimensional model using a star schema.



Summary



  • Data mart and data warehousing are tools to assist management to come up with relevant information about the organization at any point of time
  • While data marts are limited for use of a department only, data warehousing applies to an entire organization
  • Data marts are easy to design and use while data warehousing is complex and difficult to manage
  • Data warehousing is more useful as it can come up with information from any department.

Data Warehousing Fundamentals

Data warehouse is an information delivery system.

In this system, you integrate and transform enterprise data into information suitable for strategic decision making. You take all the historic data from the various operational systems,combine this internal data with any relevant data from outside sources, and pull them together. You resolve any conflicts in the  way data resides in different systems and transform the integrated data content into a format suitable for providing information to the various classes of users. Finally, you implement the information delivery methods.

Bill Inmon, considered to be the father of Data Warehousing provides the following definition:

“A Data Warehouse is a subject oriented, integrated, nonvolatile, and time variant collection of data in support of management’s decisions.”

Sean Kelly, another leading data warehousing practitioner defines the data warehouse in the following way. The data in the data warehouse is:
  • Separate
  • Available
  • Integrated
  • Time stamped
  • Subject oriented
  • Non-volatile
  • Accessible

Integrated Data

Here are some of the items that would need standardization:

  • Naming conventions
  • Codes
  • Data attributes
  • Measurements



Time-Variant Data

  • Allows for analysis of the past
  • Relates information to the present
  • Enables forecasts for the future

Friday, 6 June 2014

Slowly Changing Dimensions

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data.

There are many approaches how to deal with SCD. The most popular are:

Type 0 - The passive method
Type 1 - Overwriting the old value
Type 2 - Creating a new additional record
Type 3 - Adding a new column
Type 4 - Using historical table
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)

Type 0 - The passive method. In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.

Type 1 - Overwriting the old value. In this method no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections(e.g. removal special characters, correcting spelling errors).

Before the change: 
Customer_IDCustomer_NameCustomer_Type
1Cust_1Corporate


After the change: 
Customer_IDCustomer_NameCustomer_Type
1Cust_1Retail

Type 2 - Creating a new additional record. In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.

Before the change: 
Customer_IDCustomer_NameCustomer_TypeStart_DateEnd_DateCurrent_Flag
1Cust_1Corporate22-07-201031-12-9999Y


After the change: 
Customer_IDCustomer_NameCustomer_TypeStart_DateEnd_DateCurrent_Flag
1Cust_1Corporate22-07-201017-05-2012N
2Cust_1Retail18-05-201231-12-9999Y

Type 3 - Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly needed techinque. 

Before the change: 
Customer_IDCustomer_NameCurrent_TypePrevious_Type
1Cust_1CorporateCorporate


After the change: 
Customer_IDCustomer_NameCurrent_TypePrevious_Type
1Cust_1RetailCorporate
Type 4 - Using historical table. In this method a separate historical table is used to track all dimension's attribute historical changes for each of the dimension. The 'main' dimension table keeps only the current data e.g. customer and customer_history tables. 

Current table:
Customer_IDCustomer_NameCustomer_Type
1Cust_1Corporate


Historical table: 
Customer_IDCustomer_NameCustomer_TypeStart_DateEnd_Date
1Cust_1Retail01-01-201021-07-2010
1Cust_1Oher22-07-201017-05-2012
1Cust_1Corporate18-05-201231-12-9999
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:
current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.
historical_type - for keeping historical value of the attribute. All history records for given item of attribute could have different values.
start_date - for keeping start date of 'effective date' of attribute's history.
end_date - for keeping end date of 'effective date' of attribute's history.
current_flag - for keeping information about the most recent record.

In this method to capture attribute change we add a new record as in type 2. The current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3. 

Customer_IDCustomer_NameCurrent_TypeHistorical_TypeStart_DateEnd_DateCurrent_Flag
1Cust_1CorporateRetail01-01-201021-07-2010N
2Cust_1CorporateOther22-07-201017-05-2012N
3Cust_1CorporateCorporate18-05-201231-12-9999Y

Thursday, 5 June 2014

Expression Transformation

Expression transformation is a passive and connected transformation. This can be used to calculate values in a single row before writing to the target. For example, to calculate discount of each product or to concatenate first and last names or to convert date to a string field.

Difference between Joiner and Union Transformation

Joiner:


  • Using joiner we can remove duplicate rows
  • Joiner can be Normal,Right Outer,Left Outer,Full Outer Join
  • In Joiner we have one input group and one output group
  • Joiner implemented by using Joiner Transformation in Informatica.
  • Joiner Transformation combines data record horizontally based on a join condition


Union:


  • Union will not remove duplicate rows
  • Union is equlivalent to UNION ALL in SQL
  • In Union we have multiple input groups and one output group.
  • Union implemented by using Union Transformation in Informatica
  • Union Transformation combines data record vertically from multiple sources
  • Union also supports hetregenous(different sources)

Union Transformation


  • Union transformation is connected and active transformation.
  • Union transformation is used to merge data from multiple pipelines into single one.
  • Merges similar data coming from heterogeneous and homogeneous sources.
  • Multiple input groups but only one output group (Opposite to Router)

Lookup Transformation

A Lookup is a Passive, Connected or Unconnected Transformation used to look up data in a relational table, view, synonym or flat file. The integration service queries the lookup table to retrieve a value based on the input source value and the lookup condition.

A connected lookup recieves source data, performs a lookup and returns data to the pipeline;
 
While an unconnected lookup is not connected to source or target and is called by a transformation in the pipeline by :LKP expression which in turn returns only one column value to the calling transformation.

Lookup can be Cached or Uncached. If we cache the lookup then again we can further go for static or dynamic or persistent cache,named cache or unnamed cache .

Source Qualifier Transformation

Source Qualifier transformation is an active and connected transformation. When adding a relational or a flat file source definition to a mapping, it is must to connect it to a Source Qualifier Transformation. The Source Qualifier performs the various tasks such as overriding default SQL query, filtering records; join data from two or more tables etc.

Joiner Transformation

Joiner transformation is an active and connected transformation. This can be used to join two sources coming from two different locations or from same locations. For example, to join a flat file and a relational source or to join two flat files  or to join a relational source and XML source.

In order to join two sources, there must be at least one matching port. While joining two sources it is a must to specify one source as master and the other as detail.

The joiner transformation supports the following types of joins:


  • Normal 
  • Master outer
  • Detail outer 
  • Full outer


Normal join discards all the rows of data from the master and detail source that do not match, based on the condition.

Master outer join discards all the unmatched rows from the master source and keeps all the rows from the detail source and the matching rows from the master source.

Detail outer join keeps all the rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source.

Full outer join keeps all rows of data from both the master and detail sources.

Data Sharing

  1. Create Share CREATE SHARE my_share; 2. Grant privileges to share GRANT USAGE ON DATABASE my_db TO SHARE my_share; GRANT USAGE ...