Tuesday, 12 May 2015

Transformations, Steps, and Hops

A transformation is a network of logical tasks called steps. Transformations are essentially data flows. In the example below, the database developer has created a transformation that reads a flat file, filters it, sorts it, and loads it to a relational database table. Suppose the database developer detects an error condition and instead of sending the data to a Dummy step, (which does nothing), the data is logged back to a table. The transformation is, in essence, a directed graph of a logical set of data transformation configurations. Transformation file names have a .ktr extension.

The two main components associated with transformations are steps and hops:

Steps are the building blocks of a transformation, for example a text file input or a table output. There are over 140 steps available in Pentaho Data Integration and they are grouped according to function; for example, input, output, scripting, and so on. Each step in a transformation is designed to perform a specific task, such as reading data from a flat file, filtering rows, and logging to a database as shown in the example above. Steps can be configured to perform the tasks you require.

Hops are data pathways that connect steps together and allow schema metadata to pass from one step to another. In the image above, it seems like there is a sequential execution occurring; however, that is not true. Hops determine the flow of data through the steps not necessarily the sequence in which they run. When you run a transformation, each step starts up in its own thread and pushes and passes data.

Note: All steps are started and run in parallel so the initialization sequence is not predictable. That is why you cannot, for example, set a variable in a first step and attempt to use that variable in a subsequent step.

You can connect steps together, edit steps, and open the step contextual menu by clicking

to edit a step. Click the down arrow to open the contextual menu. For information about connecting steps with hop, see More About Hops.

A step can have many connections — some join two steps together, some only serve as an input or output for a step. The data stream flows through steps to the various steps in a transformation. Hops are represented in Spoon as arrows. Hops allow data to be passed from step to step, and also determine the direction and flow of data through the steps. If a step sends outputs to more than one step, the data can either be copied to each step or distributed among them.

Spoon - Editor

Component Name	Name	Function
	Toolbar	Single-click access to common actions such as create a new file, opening existing documents, save and save as.
	Perspectives Toolbar	Switch between the different perspectives. Data Integration — Create ETL transformations and jobs Instaview — Use pre-made templates to create visualizations from PDI transformations Visualize — Test reporting and OLAP metadata models created in the Model perspective using the Report Design Wizard and Analyzer clients Model Editor — Design reporting and OLAP metadata models which can be tested right from within the Visualization perspective or published to the Pentaho BA Server Schedule — Manage scheduled ETL activities on the Data Integration Server
	Sub-toolbar	Provides buttons for quick access to common actions specific to the transformation or job such as Run, Preview, and Debug.
	Design and View Tabs	The Design tab of the Explore pane provides an organized list of transformation steps or job entries used to build transformations and jobs. Transformations are created by simply dragging transformation steps from the Designtab onto the canvas and connecting them with hops to describe the flow of data. The View tab of the Explore pane shows information for each job or transformation. This includes information such as available database connections and which steps and hops are used. In the image, the Design tab is selected.
	Canvas	Main design area for building transformations and jobs describing the ETL activities you want to perform


Icon	Description
	Create a new job or transformation
	Open transformation/job from file if you are not connected to a repository or from the repository if you are connected to one
	Explore the repository
	Save the transformation/job to a file or to the repository
	Save the transformation/job under a different name or file name (Save as)
	Run transformation/job; runs the current transformation from XML file or repository
	Pause transformation
	Stop transformation
	Preview transformation: runs the current transformation from memory. You can preview the rows that are produced by selected steps.
	Run the transformation in debug mode; allows you to troubleshoot execution errors
	Replay the processing of a transformation
	Verify transformation
	Run an impact analysis on the database
	Generate the SQL that is needed to run the loaded transformation.
	Launch the database explorer allowing you to preview data, run SQL queries, generate DDL and more
	Hide execution results pane
	Lock transformation

Pentaho Data Integration Architecture

Spoon is the design interface for building ETL jobs and transformations. Spoon provides a drag-and-drop interface that allows you to graphically describe what you want to take place in your transformations. Transformations can then be executed locally within Spoon, on a dedicated Data Integration Server, or a cluster of servers.

The Data Integration Server is a dedicated ETL server whose primary functions are:

Execution	Executes ETL jobs and transformations using the Pentaho Data Integration engine
Security	Allows you to manage users and roles (default security) or integrate security to your existing security provider such as LDAP or Active Directory
Content Management	Provides a centralized repository that allows you to manage your ETL jobs and transformations. This includes full revision history on content and features such as sharing and locking for collaborative development environments.
Scheduling	Provides the services allowing you to schedule and monitor activities on the Data Integration Server from within the Spoon design environment.

Pentaho Data Integration is composed of the following primary components:

Spoon. Introduced earlier, Spoon is a desktop application that uses a graphical interface and editor for transformations and jobs. Spoon provides a way for you to create complex ETL jobs without having to read or write code. When you think of Pentaho Data Integration as a product, Spoon is what comes to mind because, as a database developer, this is the application on which you will spend most of your time. Any time you author, edit, run or debug a transformation or job, you will be using Spoon.
Pan. A standalone command line process that can be used to execute transformations and jobs you created in Spoon. The data transformation engine Pan reads data from and writes data to various data sources. Pan also allows you to manipulate data.
Kitchen. A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, either in XML or in a database repository. Jobs are usually scheduled to run in batch mode at regular intervals.
Carte. Carte is a lightweight Web container that allows you to set up a dedicated, remote ETL server. This provides similar remote execution capabilities as the Data Integration Server, but does not provide scheduling, security integration, and a content management system.

What's with all the Culinary Terms?

If you are new to Pentaho, you may sometimes see or hear Pentaho Data Integration referred to as, "Kettle." To avoid confusion, all you must know is that Pentaho Data Integration began as an open source project called. "Kettle." The term, K.E.T.T.L.E is a recursive that stands for KettleExtraction Transformation Transport Load Environment. When Pentaho acquired Kettle, the name was changed to Pentaho Data Integration. Other PDI components such as Spoon, Pan, and Kitchen, have names that were originally meant to support a "restaurant" metaphor of ETL offerings.

Pentaho - Introduction

Introduction

Pentaho Data Integration (PDI) is a flexible tool that allows you to collect data from disparate sources such as databases, files, and applications, and turn the data into a unified format that is accessible and relevant to end users. PDI provides the Extraction, Transformation, and Loading (ETL) engine that facilitates the process of capturing the right data, cleansing the data, and storing the data using a uniform and consistent format.

PDI provides support for slowly changing dimensions, and surrogate key for data warehousing, allows data migration between databases and application, is flexible enough to load giant datasets, and can take full advantage of cloud, clustered, and massively parallel processing environments. You can cleanse your data using transformation steps that range from very simple to very complex.

Friday, 8 May 2015

Pentaho Data Integration (Kettle)

Migrating data between applications or databases
Exporting data from databases to flat files
Loading data massively into databases
Data cleansing
Integrating applications

PDI is easy to use. Every process is created with a graphical tool where you specify what to do without writing code to indicate how to do it; because of this, you could say that PDI is metadata oriented.

PDI can be used as a standalone application, or it can be used as part of the larger Pentaho Suite. As an ETL tool, it is the most popular open source tool available. PDI supports a vast array of input and output formats, including text files, data sheets, and commercial and free database engines. Moreover, the transformation capabilities of PDI allow you to manipulate data with very few limitations.

Difference Between Algorithm and Pseudocode

Algorithm vs Pseudocode

An algorithm is simply a solution to a problem. An algorithm presents the solution to a problem as a well defined set of steps or instructions. Pseudo-code is a general way of describing an algorithm. Pseudo-code does not use the syntax of a specific programming language, therefore cannot be executed on a computer. But it closely resembles the structure of a programming language and contains roughly the same level of detail.

Algorithm

An algorithm gives a solution to a particular problem as a well defined set of steps. A recipe in a cookbook is a good example of an algorithm. When a computer is used for solving a particular problem, the steps to the solution should be communicated to the computer. This makes the study of algorithms a very important part in computer science. An algorithm is executed in a computer by combining lot of elementary operations such as additions and subtractions to perform more complex mathematical operations. But translating the idea of the algorithm in to computer code is not straight forward. Specially, converting an algorithm in to a low level language such as assembly language could be very tedious than using a high level language such as C or Java. When designing an algorithm, it is important to do an analysis on the resources (such as time and storage) required by the algorithm. Notations such as big O notation are used for performing time and storage analysis on algorithms. Algorithms can be expressed using natural languages, pseudocode, flowcharts, etc.

Pseudocode

Pseudocode is one of the methods that could be used to represent an algorithm. It is not written in a specific syntax that is used by a programming language and therefore cannot be executed in a computer. There are lots of formats used for writing pseudocodes and most of them borrow some of the structures from popular programming languages such as C, Lisp, FORTRAN, etc. Also, natural language is used when presenting details that are not important. Most of the algorithms are presented using pseudocode since they can be read and understood using programmers who are familiar with different programming languages. Some languages such as Pascal have syntax that is very similar to pseudocode making the transformation from pseudocode to the corresponding program code easier. Pseudocode allows to include control structures such as WHILE, IF-THEN-ELSE, REPEAT-UNTIL, FOR, and CASE, which are present in many high level languages.

What is the difference between Algorithm and Pseudocode?

An algorithm is a well defined sequence of steps that provides a solution for a given problem, while a pseudocode is one of the methods that can be used to represent an algorithm. While algorithms can be written in natural language, pseudocode is written in a format that is closely related to high level programming language structures. But pseudocode does not use specific programming language syntax and therefore could be understood by programmers who are familiar with different programming languages. Additionally, transforming an algorithm presented in pseudocode to programming code could be much easier than converting an algorithm written in natural language.

Subramanian G