Friday, 28 March 2014

Informatica Software Architecture

Informatica ETL product, known as Informatica Power Center consists of 3 main components.

1. Informatica PowerCenter Client Tools: 
These are the development tools installed at developer end. These tools enable a developer to

•Define transformation process, known as mapping. (Designer)
•Define run-time properties for a mapping, known as sessions (Workflow Manager)
•Monitor execution of sessions (Workflow Monitor)
•Manage repository, useful for administrators (Repository Manager)
•Report Metadata (Metadata Reporter)

2. Informatica PowerCenter Repository:
Repository is the heart of Informatica tools. Repository is a kind of data inventory where all the data related to mappings, sources, targets etc is kept. This is the place where all the metadata for your application is stored. All the client tools and Informatica Server fetch data from Repository. Informatica client and server without repository is same as a PC without memory/harddisk, which has got the ability to process data but has no data to process. This can be treated as backend of Informatica.

3. Informatica PowerCenter Server: 
Server is the place, where all the executions take place. Server makes physical connections to sources/targets, fetches data, applies the transformations mentioned in the mapping and loads the data in the target system.

Informatica PowerCenter Components

Informatica PowerCenter is not just a tool but an end-to-end data processing and data integration environment. It facilitates organizations to collect, centrally process and redistribute data. It can be used just to integrate two different systems like SAP and MQ Series or to load data warehouses or Operational Data Stores (ODS).

Now Informatica PowerCenter also includes many add-on tools to report the data being processed, business rules applied and quality of data before and after processing.
To facilitate this PowerCenter is divided into different components:
  • PowerCenter Domain: As Informatica says “The Power Center domain is the primary unit for management and administration within PowerCenter”. Doesn’t make much sense? Right… So here is a simpler version. Power Center domain is the collection of all the servers required to support Power Center functionality. Each domain has gateway (called domain server) hosts. Whenever you want to use Power Center services you send a request to domain server. Based on request type it redirects your request to one of the Power Center services.
  • PowerCenter Repository: Repository is nothing but a relational database which stores all the metadata created in Power Center. Whenever you develop mapping, session, workflow, execute them or do anything meaningful (literally), entries are made in the repository.
  • Integration Service: Integration Service does all the real job. It extracts data from sources, processes it as per the business logic and loads data to targets.
  • Repository Service: Repository Service is the one that understands content of the repository, fetches data from the repository and sends it back to the requesting components (mostly client tools and integration service)
  • PowerCenter Client Tools: The PowerCenter Client consists of multiple tools. They are used to manage users, define sources and targets, build mappings and mapplets with the transformation logic, and create workflows to run the mapping logic. The PowerCenter Client connects to the repository through the Repository Service to fetch details. It connects to the Integration Service to start workflows. So essentially client tools are used to code and give instructions to PowerCenter servers.
  • PowerCenter Administration Console: This is simply a web-based administration tool you can use to administer the PowerCenter installation.
There are some more not-so-essential-to-know components discussed below:
  • Web Services Hub: Web Services Hub exposes PowerCenter functionality to external clients through web services.
  • SAP BW Service: The SAP BW Service extracts data from and loads data to SAP BW.
  • Data Analyzer: Data Analyzer is like a reporting layer to perform analytics on data warehouse or ODS data.
  • Metadata Manager: Metadata Manager is a metadata management tool that you can use to browse and analyze metadata from disparate metadata repositories. It shows how the data is acquired, what business rules are applied and where data is populated in readable reports.
  • PowerCenter Repository Reports: PowerCenter Repository Reports are a set of prepackaged Data Analyzer reports and dashboards to help you analyze and manage PowerCenter metadata.

Thursday, 27 March 2014

Why do we need ETL Tools?

Business Case: The company has over 100+ years of history & presence in almost all the industries. Over these years company’s management style has been changed from book keeping to SAP. This transition was not a single day transition. In transition, from book keeping to SAP, they used a wide array of technologies, ranging from mainframes to PCs, data storage ranging from flat files to relational databases, programming languages ranging from Cobol to Java.This transformation resulted into different businesses, or to be precise different sub businesses within a business, running different applications, different hardware and different architecture. Technologies are introduced as and when invented & as and when required.

This directly resulted into the scenario, like HR department of the company running on Oracle Applications, Finance running SAP, some part of process chain supported by mainframes, some data stored on Oracle, some data on mainframes, some data in VSM files & the list goes on. If one day company requires a consolidated reports of assets, there are two ways.

First completely manual, generate different reports from different systems and integrate them.
Second fetch all the data from different systems/applications, make a Data Warehouse, and generate reports as per the requirement.
Obviously second approach is going to be the best.

Now to fetch the data from different systems, making it coherent, and loading into a Data Warehouse requires some kind of extraction, cleansing, integration, and load. ETL stands for Extraction, Transformation & Load.

ETL Tools provide facility to Extract data from different non-coherent systems, cleanse it, merge it and load into target systems.

Tuesday, 25 March 2014

XQuery Select nodes and FLWOR expression

Look at the following path expression:

doc("books.xml")/bookstore/book[price>30]/title

The expression above will select all the title elements under the book elements that are under the bookstore element that have a price element with a value that is higher than 30.

FLWOR expression

FLWOR is an acronym for "For, Let, Where, Order by, Return".


  • The for clause selects all book elements under the bookstore element  into a variable called $x.
  • The where clause selects only book elements with a price element with a value greater than 30.
  • The order by clause defines the sort-order. Will be sort by the title element.
  • The return clause specifies what should be returned. Here it returns the title elements.
Example FOR clause

The for clause binds a variable to each item returned by the in expression. The for clause results in iteration. There can be multiple for clauses in the same FLWOR expression.



To loop a specific number of times in a for clause, you may use the to keyword:

for $x in (1 to 5)
return <test>{$x}</test>

Result:

<test>1</test>
<test>2</test>
<test>3</test>
<test>4</test>
<test>5</test>

The at keyword can be used to count the iteration:

for $x at $i in doc("books.xml")/bookstore/book/title
return <book>{$i}. {data($x)}</book>

Result:

<book>1. Everyday Italian</book>
<book>2. Harry Potter</book>
<book>3. XQuery Kick Start</book>
<book>4. Learning XML</book>

It is also allowed with more than one in expression in the for clause. Use comma to separate each in expression:

for $x in (10,20), $y in (100,200)
return <test>x={$x} and y={$y}</test>

Result:

<test>x=10 and y=100</test>
<test>x=10 and y=200</test>
<test>x=20 and y=100</test>
<test>x=20 and y=200</test>

Example LET Clause

The let clause allows variable assignments and it avoids repeating the same expression many times. The let clause does not result in iteration.

let $x := (1 to 5)
return <test>{$x}</test>

Result:

<test>1 2 3 4 5</test>

Example WHERE Clause

The where clause is used to specify one or more criteria for the result

where $x/price>30 and $x/price<100

Example ORDER BY Clause

The order by clause is used to specify the sort order of the result. Here we want to order the result by category and title

for $x in doc("books.xml")/bookstore/book
order by $x/@category, $x/title
return $x/title

Result:

<title lang="en">Harry Potter</title>
<title lang="en">Everyday Italian</title>
<title lang="en">Learning XML</title>

<title lang="en">XQuery Kick Start</title>

Example RETURN Clause

The return clause specifies what is to be returned

for $x in doc("books.xml")/bookstore/book
return $x/title

Result:

<title lang="en">Everyday Italian</title>
<title lang="en">Harry Potter</title>
<title lang="en">XQuery Kick Start</title>

<title lang="en">Learning XML</title>

Example FLWOR

for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
for - (optional) binds a variable to each item returned by the in expression
let - (optional)
where - (optional) specifies a criteria
order by - (optional) specifies the sort-order of the result

return - specifies what to return in the result

XQuery Conditional Expressions and Comparisons

"If-Then-Else" expressions,

example:

for $x in doc("books.xml")/bookstore/book
return if ($x/@category="CHILDREN")
then <child>{data($x/title)}</child>

else <adult>{data($x/title)}</adult>

Comparisons

In XQuery there are two ways of comparing values.

1. General comparisons: =, !=, <, <=, >, >=

2. Value comparisons: eq, ne, lt, le, gt, ge

The difference between the two comparison methods are shown below.

The following expression returns true if any q attributes have a value greater than 10:

$bookstore//book/@q > 10

The following expression returns true if there is only one q attribute returned by the expression, and its value is greater than 10. If more than one q is returned, an error occurs:


$bookstore//book/@q gt 10


XQuery

  • XQuery is to XML what SQL is to database tables.
  • XQuery is designed to query XML data - not just XML files, but anything that can appear as XML, including databases.
  • XQuery for XML is like SQL for databases
  • XQuery is built on XPath expressions
  • XQuery is supported by all major databases
  • XQuery is a W3C Recommendation
  • XQuery is compatible with several W3C standards, such as XML, Namespaces, XSLT, XPath, and XML Schema.
What You Should Already Know

Before you continue you should have a basic understanding of the following:

  • HTML / XHTML
  • XML / XML Namespaces
  • XPath
Basic Syntax Rules

  • XQuery is case-sensitive
  • XQuery elements, attributes, and variables must be valid XML names
  • An XQuery string value can be in single or double quotes
  • An XQuery variable is defined with a $ followed by a name, e.g. $bookstore
  • XQuery comments are delimited by (: and :), e.g. (: XQuery Comment :)


Monday, 3 March 2014

DATA WAREHOUSE DIMENSIONAL MODELLING (TYPES OF SCHEMAS)

There are four types of schemas are available in data warehouse. Out of which the star schema is mostly used in the data warehouse designs. The second mostly used data warehouse schema is snow flake schema. We will see about these schemas in detail.

Star Schema:

A star schema is the one in which a central fact table is sourrounded by denormalized dimensional tables. A star schema can be simple or complex. A simple star schema consists of one fact table where as a complex star schema have more than one fact table.


Snow Flake Schema:

A snow flake schema is an enhancement of star schema by adding additional dimensions. Snow flake schema are useful when there are low cardinality attributes in the dimensions.


Galaxy Schema:

Galaxy schema contains many fact tables with some common dimensions (conformed dimensions). This schema is a combination of many data marts.



Fact Constellation Schema:

The dimensions in this schema are segregated into independent dimensions based on the levels of hierarchy. For example, if geography has five levels of hierarchy like teritary, region, country, state and city; constellation schema would have five dimensions instead of one. 

TYPES OF FACTS IN DATA WAREHOUSE

A fact table is the one which consists of the measurements, metrics or facts of business process. These measurable facts are used to know the business value and to forecast the future business. The different types of facts are explained in detail below.

Additive:

Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.

Semi-Additive: 

Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.

Non-Additive: 

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. 
Eg: Facts which have percentages, ratios calculated.

Factless Fact Table:

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "Factless Fact tables".

Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.

A fact tables that contain aggregated facts are often called summary tables. 

TYPES OF DIMENSIONS IN DATA WAREHOUSE

A dimension table consists of the attributes about the facts. Dimensions store the textual descriptions of the business. With out the dimensions, we cannot measure the facts. The different types of dimension tables are explained in detail below.

Conformed Dimension:

Conformed dimensions mean the exact same thing with every possible fact table to which they are joined. 

Eg: The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.

Junk Dimension:

A junk dimension is a collection of random transactional codes flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.

Eg: Assume that we have a gender dimension and marital status dimension. In the fact table we need to maintain two keys referring to these dimensions. Instead of that create a junk dimension which has all the combinations of gender and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key in the fact table.

Degenerated Dimension:

A degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table.

Eg: A transactional code in a fact table.

Role-playing dimension:

Dimensions which are often used for multiple purposes within the same database are called role-playing dimensions. For example, a date dimension can be used for “date of sale", as well as "date of delivery", or "date of hire". 

DATA WAREHOUSE DESIGN APPROACHES

Data warehouse design is one of the key technique in building the data warehouse. Choosing a right data warehouse design can save the project time and cost. Basically there are two data warehouse design approaches are popular.

Bottom-Up Design:

In the bottom-up design approach, the data marts are created first to provide reporting capability. A data mart addresses a single business area such as sales, Finance etc. These data marts are then integrated to build a complete data warehouse.  The integration of data marts is implemented using data warehouse bus architecture. In the bus architecture, a dimension is shared between facts in two or more data marts. These dimensions are called conformed dimensions. These conformed dimensions are integrated from data marts and then data warehouse is built.

Advantages of bottom-up design are:
  • This model contains consistent data marts and these data marts can be delivered quickly.
  • As the data marts are created first, reports can be generated quickly.
  • The data warehouse can be extended easily to accommodate new business units. It is just creating new data marts and then integrating with other data marts.

Disadvantages of bottom-up design are:
  • The positions of the data warehouse and the data marts are reversed in the bottom-up approach design.

Top-Down Design:

In the top-down design approach the, data warehouse is built first. The data marts are then created from the data warehouse.

Advantages of top-down design are:
  • Provides consistent dimensional views of data across data marts, as all data marts are loaded from the data warehouse.
  • This approach is robust against business changes. Creating a new data mart from the data warehouse is very easy.

Disadvantages of top-down design are:
  • This methodology is inflexible to changing departmental needs during implementation phase.
  • It represents a very large project and the cost of implementing the project is significant.

LOGICAL AND PHYSICAL DESIGN OF DATA WAREHOUSE

Logical design:

Logical design deals with the logical relationships between objects. Entity-relationship (ER) modeling technique can be used for logical design of data warehouse. ER modeling involves identifying the entities (important objects), attributes (properties about objects) and the relationship among them.

An entity is a chunk of information, which maps to a table in database. An attribute is a part of an entity, which maps to a column in database.

A unique identifier can be used to make sure the data is consistent.

Physical design:

Physical design deals with the effective way of storing and retrieving the data. In the physical design, the logical design needs to be converted into a description of the physical database structures.

Physical design involves creation of the database objects like tables, columns, indexes, primary keys, foreign keys, views, sequences etc. 

EXTRACTION METHODS IN DATA WAREHOUSE

The extraction methods in data warehouse depend on the source system, performance and business requirements. There are two types of extractions, Logical and Physical. We will see in detail about the logical and physical designs.

Logical extraction

There are two types of logical extraction methods:

Full Extraction: Full extraction is used when the data needs to be extracted and loaded for the first time.  In full extraction, the data from the source is extracted completely. This extraction reflects the current data available in the source system.

Incremental Extraction: In incremental extraction, the changes in source data need to be tracked since the last successful extraction. Only these changes in data will be extracted and then loaded. These changes can be detected from the source data which have the last changed timestamp. Also a change table can be created in the source system, which keeps track of the changes in the source data.

One more method to get the incremental changes is to extract the complete source data and then do a difference (minus operation) between the current extraction and last extraction. This approach causes a performance issue.

Physical extraction

The data can be extracted physically by two methods:

Online Extraction:  In online extraction the data is extracted directly from the source system. The extraction process connects to the source system and extracts the source data.

Offline Extraction: The data from the source system is dumped outside of the source system into a flat file. This flat file is used to extract the data. The flat file can be created by a routine process daily. 

DATA MINING

Data mining is the process of finding patterns from large data sets and analyzing data from different perspectives. It allows business users to analyze data from different angles and summarize the relationships identified. Data mining can be useful in increasing the revenue and cut costs.

Example:

In a supermarket, the persons who bought the tooth brush on Sundays also bought tooth paste. This information can be used in increasing the revenue by providing an offer on tooth brush and tooth paste. There by selling more number of products (tooth paste and tooth brush) on Sundays.

Data mining process:

Data mining analyzes relationships and patterns in the stored data based on user queries. Data mining involves four tasks.
·                     Association: Find the relationship between the variables. For example in retail a store, we can determine which products are bought together frequently and this information can be used to market these products.
·                     Clustering: Identifying the logical relationship in the data items and grouping them. For example in a retail store, a tooth paste, tooth brush can be logically grouped.

·                     Classifying: Involves in applying a known pattern to the new data.

DATA MART

A data mart is a subset of the data warehouse, which concentrates on a specific business unit. A data mart, may or may not derived from a data warehouse and is aimed at meeting an immediate requirement.

Data marts may or may not dependent on other data marts in an organization. If the data marts have conformed dimensions and facts, then these data marts will be related to each other.

Benefits of data mart:
·                     Frequently needed data can be accessed very easily.
·                     Performance improvement.
·                     Data marts can be created easily.

·                     Lower cost in implementing data mart than a data warehouse.

DATA WAREHOUSE



A data warehouse is a relational database that is designed for query and business analysis rather than for transaction processing. It contains historical data derived from transaction data. This historical data is used by the business analysts to understand about the business in detail.

A data warehouse should have the following characteristics:

Subject oriented: A data warehouse helps in analyzing the data.  For example, to know about a company's sales, a data warehouse needs to build on sales data. Using this data warehouse we can find the last year sales. This ability to define a data warehouse by subject (sales) makes it a subject oriented.

Integrated: Bringing data from different sources and putting them in to a consistent format. This includes resolving the units of measures, naming conflicts etc.

Non volatile: Once the data enters into the data warehouse, the data should not be updated.

Time variant: To analyze the business, analysts need large amounts of data. So, the data warehouse should contain historical data. 

Data Sharing

  1. Create Share CREATE SHARE my_share; 2. Grant privileges to share GRANT USAGE ON DATABASE my_db TO SHARE my_share; GRANT USAGE ...