Data Warehousing

0
41

Data Warehousing – Oversee

The term "Data Warehouse" was very initial coined simply by Bill Inmon in 1990. According to Inmon, a data battleehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to conpartr informed decisions in an body organization.

An operational database undergoes regular alters on a daily basis on account of the transactions thead wear conpartr place. Suppose a business executive wants to analyze previous feedback on any kind of kind of data such as a item, a supplier, or any kind of kind of consumer data, then the executive will have no data available to analyze because the previous data has been updated because of to transactions.

A data battleehouses provides us generalized and constrongated data in multidimensional see. Alengthy with generalized and constrongated see of data, a data battleehouses also provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data generalization and data mining.

Data mining functions such as association, clustering, courseification, preddish coloured-coloureddish colourediction can be integrated with OLAP operations to enhance the interactive mining of knowintroducadvantage at multiple level of abstraction. Thead wear's why data battleehouse has now become an iminterfaceant platform for data analysis and online analytical processing.

Understanding a Data Warehouse

  • A data battleehouse is a database, which is kept separate from the body organization's operational database.

  • There is no regular updating done in a data battleehouse.

  • It possesses constrongated historical data, which helps the body organization to analyze it’s business.

  • A data battleehouse helps executives to body organise, understand, and use their data to conpartr strategic decisions.

  • Data battleehouse systems help in the integration of diversit downy of application systems.

  • A data battleehouse system helps in constrongated historical data analysis.

Why a Data Warehouse is Separated from Operational Databases

A data battleehouses is kept separate from operational databases because of to the following reasons:

  • An operational database is constructed for well-known tasks and workloads such as oceanrching particular records, indexing, etc. In contract, data battleehouse queries are regularly complex and they present a general form of data.

  • Operational databases supinterface concurrent processing of multiple transactions. Concurrency manage and recovery mechanisms are requireddish coloured-coloureddish coloured for operational databases to ensure robustness and consistency of the database.

  • An operational database query permit’s to read and modify operations, while an OLAP query needs only read only access of storeddish coloured-coloureddish coloured data.

  • An operational database maintains current data. On the other hand, a data battleehouse maintains historical data.

Data Warehouse Features

The key features of a data battleehouse are talk abouted below:

  • Subject Oriented – A data battleehouse is subject oriented because it provides information around a subject instead than the body organization's ongoing operations. These subjects can be item, customers, suppliers, sales, ractuallyue, etc. A data battleehouse does not focus on the ongoing operations, instead it focuses on modelling and analysis of data for decision maruler.

  • Integrated – A data battleehouse is constructed simply by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data.

  • Time Variant – The data collected in a data battleehouse is identified with a particular time period. The data in a data battleehouse provides information from the historical stage of see.

  • Non-volatile – Non-volatile means the previous data is not erased when brand brand new data is added to it. A data battleehouse is kept separate from the operational database and generally correct now therefore regular alters in operational database is not reflected in the data battleehouse.

Note: A data battleehouse does not require transaction processing, recovery, and concurrency manages, because it is physically storeddish coloured-coloureddish coloured and separate from the operational database.

Data Warehouse Applications

As talk abouted before, a data battleehouse helps business executives to body organise, analyze, and use their data for decision maruler. A data battleehouse serves as a sole part of a plan-execute-assess "shutd-loop" feedback system for the enterprise management. Data battleehouses are widely used in the following fields:

  • Financial services
  • Banruler services
  • Consumer greats
  • Retail sectors
  • Controlintroduced manurealityuring

Types of Data Warehouse

Information processing, analytical processing, and data mining are the 3 kinds of data battleehouse applications thead wear are talk abouted below:

  • Information Processing – A data battleehouse permit’s to process the data storeddish coloured-coloureddish coloured in it. The data can be processed simply by means of querying, fundamental statistical analysis, reinterfaceing using combinationtabs, tables, charts, or graphs.

  • Analytical Processing – A data battleehouse supinterfaces analytical processing of the information storeddish coloured-coloureddish coloured in it. The data can be analyzed simply by means of fundamental OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

  • Data Mining – Data mining supinterfaces knowintroducadvantage discovery simply by finding hidden patterns and associations, constructing analytical models, performing courseification and preddish coloured-coloureddish colourediction. These mining results can be presented using the visualization tools.

Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)
1 It involves historical processing of information. It involves day-to-day processing.
2 OLAP systems are used simply by knowintroducadvantage workers such as executives, managers, and analysts. OLTP systems are used simply by clerks, DBAs, or database professionals.
3 It is used to analyze the business. It is used to operate the business.
4 It focuses on Information out generally correct now there. It focuses on Data in.
5 It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema. It is based on Entity Relationship Model.
6 It focuses on Information out generally correct now there. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and constrongated data. It provides primitive and highly detaiintroduced data.
9 It provides summarized and multidimensional see of data. It provides detaiintroduced and flat relational see of data.
10 The number of users is in 100-coloureddish coloureds. The number of users is in thougood sands.
11 The number of records accessed is in millions. The number of records accessed is in tens.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
13 These are highly flexible. It provides high performance.

Data Warehousing – Concepts

Whead wear is Data Warehousing?

Data battleehousing is the process of constructing and using a data battleehouse. A data battleehouse is constructed simply by integrating data from multiple heterogeneous sources thead wear supinterface analytical reinterfaceing, structureddish coloured-coloureddish coloured and/or ad hoc queries, and decision maruler. Data battleehousing involves data thoroughly cleaning, data integration, and data constrongations.

Using Data Warehouse Information

There are decision supinterface technologies thead wear help utilize the data available in a data battleehouse. These technologies help executives to use the battleehouse quickly and effectively. They can gather data, analyze it, and conpartr decisions based on the information present in the battleehouse. The information gagenerally correct now thereddish coloured-coloureddish coloured in a battleehouse can be used in any kind of kind of of the following domains:

  • Tuning Production Strategies – The item strategies can be well tuned simply by reposit downioning the items and managing the item interfacefolios simply by comparing the sales quarterly or yrly.

  • Customer Analysis – Customer analysis is done simply by analyzing the customer's buying preferences, buying time, budget cycles, etc.

  • Operations Analysis – Data battleehousing also helps in customer relationship management, and maruler environment appropriateions. The information also permit’s us to analyze business operations.

Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have 2 approaches:

  • Query-driven Approach
  • Update-driven Approach

Query-Driven Approach

This is the traditional approach to integrate heterogeneous databases. This approach was used to produce wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.

Process of Query-Driven Approach

  • When a query is issued to a client aspect, a metadata dictionary translates the query into an appropriate from for individual heterogeneous sit downes involved.

  • Now these queries are chartped and sent to the local query processor.

  • The results from heterogeneous sit downes are integrated into a global answer set.

Diunhappyvantages

  • Query-driven approach needs complex integration and filtering processes.

  • This approach is very ineffective.

  • It is very expensive for regular queries.

  • This approach is also very expensive for queries thead wear require aggregations.

Update-Driven Approach

This is an alternative to the traditional approach. Today's data battleehouse systems follow update-driven approach instead than the traditional approach talk abouted earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated beforehand and are storeddish coloured-coloureddish coloured in a battleehouse. This information is available for immediate querying and analysis.

Advantages

This approach has the following advantages:

  • This approach provide high performance.

  • The data is copied, processed, integrated, annotated, summarized and restructureddish coloured-coloureddish coloured in semantic data store beforehand.

  • Query processing does not require an interface to process data at local sources.

Functions of Data Warehouse Tools and Utiliconnects

The following are the functions of data battleehouse tools and utiliconnects:

  • Data Extraction – Involves gathering data from multiple heterogeneous sources.

  • Data Cleaning – Involves finding and appropriateing the errors in data.

  • Data Transformation – Involves converting the data from legacy format to battleehouse format.

  • Data Loading – Involves sorting, summarizing, constrongating, checruler integrity, and produceing indices and partitions.

  • Refreshing – Involves updating from data sources to battleehouse.

Note: Data thoroughly cleaning and data transformation are iminterfaceant steps in improving the quality of data and data mining results.

Data Warehousing – Terminologies

In this particular particular chapter, we will talk about some of the the majority of commonly used terms in data battleehousing.

Metadata

Metadata is simply degoodd as data about generally correct now there data. The data thead wear are used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say thead wear metadata is the summarized data thead wear leads us to the detaiintroduced data.

In terms of data battleehouse, we can degood metadata as following:

  • Metadata is a road-chart to data battleehouse.

  • Metadata in data battleehouse degoods the battleehouse objects.

  • Metadata acts as a immediateory. This immediateory helps the decision supinterface system to locate the contents of a data battleehouse.

Metadata Reposit downory

Metadata reposit downory is an integral part of a data battleehouse system. It contains the following metadata:

  • Business metadata – It contains the data ownership information, business definition, and changing policies.

  • Operational metadata – It includes currency of data and data lineage. Currency of data refers to the data being active, archived, or purged. Lineage of data means background of data migrated and transformation applied on it.

  • Data for chartping from operational environment to data battleehouse – It metadata includes source databases and their contents, data extraction, data partition, thoroughly cleaning, transformation rules, data refresh and purging rules.

  • The algorithms for summarization – It includes dimension algorithms, data on granularity, aggregation, summarizing, etc.

Data Cube

A data cube helps us represent data in multiple dimensions. It is degoodd simply by dimensions and realitys. The dimensions are the enticonnects with respect to which an enterprise preserves the records.

Illustration of Data Cube

Suppose a company kind of kind of wants to maintain track of sales records with the help of sales data battleehouse with respect to time, item, branch, and location. These dimensions permit to maintain track of monthly sales and at which branch the items were sold. There is a table associated with every dimension. This table is known as dimension table. For example, "item" dimension table may have attributes such as item_name, item_kind, and item_brand.

The following table represents the 2-D see of Sales Data for a company kind of kind of with respect to time, item, and location dimensions.

data cube 2D

But here in this particular particular 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with respect to time, and item dimensions according to kind of items sold. If we want to see the sales data with one more dimension, say, the location dimension, then the 3-D see would be helpful. The 3-D see of the sales data with respect to time, item, and location is shown in the table below:

data cube 3D

The above 3-D table can be represented as 3-D data cube as shown in the following figure:

data cube 3D

Data Mart

Data marts contain a subset of body organization-wide data thead wear is imslotant to specific groups of people in an body organization. In other words, a data mart contains only those data thead wear is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are congoodd to subjects.

Points to Remember About generally correct now there Data Marts

  • Windows-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.

  • The implementation cycle of a data mart is measureddish coloured-coloureddish coloured in short periods of time, i.e., in weeks instead than months or yrs.

  • The life cycle of data marts may be complex in the lengthy operate, if their planning and style are not body organization-wide.

  • Data marts are small in size.

  • Data marts are customised simply by department.

  • The source of a data mart is departmentally structureddish coloured-coloureddish coloured data battleehouse.

  • Data marts are flexible.

The following figure shows a graphical representation of data marts.

data mart

Virtual Warehouse

The see over an operational data battleehouse is known as virtual battleehouse. It is easy to produce a virtual battleehouse. Building a virtual battleehouse requires excess capacity on operational database servers.

Data Warehousing – Delivery Process

A data battleehouse is never static; it evolves as the business expands. As the business evolves, it’s requirements maintain changing and generally correct now therefore a data battleehouse must be styleed to ride with these alters. Hence a data battleehouse system needs to be flexible.

Ideally generally correct now there need to be a delivery process to deliver a data battleehouse. However data battleehouse projects normally suffer from various issues thead wear produce it difficult to comppermite tasks and deliverables in the rigid and ordereddish coloured-coloureddish coloured fashion demanded simply by the waterfall method. Most of the times, the requirements are not understood comppermitely. The architectures, styles, and produce components can be comppermited only after gathering and studying all the requirements.

Delivery Method

The delivery method is a variant of the sign up fort application producement approach adopted for the delivery of a data battleehouse. We have staged the data battleehouse delivery process to minimise risks. The approach thead wear we will talk about here does not reddish coloured-coloureddish coloureduce the general delivery time-sizes but ensures the business benefit’s are delivereddish coloured-coloureddish coloured incrementally through the producement process.

Note: The delivery process is broken into phases to reddish coloured-coloureddish coloureduce the project and delivery risk.

The following diagram excommons the stages in the delivery process:

Delivery Method

IT Strategy

Data battleehouse are strategic investments thead wear require a business process to generate benefit’s. IT Strategy is requireddish coloured-coloureddish coloured to procure and retain funding for the project.

Business Case

The goal of business case is to estimate business benefit’s thead wear need to be derived from using a data battleehouse. These benefit’s may not be quantifiable but the projected benefit’s need to be clearlier stated. If a data battleehouse does not have a clear business case, then the business tends to suffer from creddish coloured-coloureddish colouredibility issues at some stage during the delivery process. Therefore in data battleehouse projects, we need to understand the business case for investment.

Education and Prototyping

Organizations experiment with the concept of data analysis and educate all of themselves on the value of having a data battleehouse before settling for a solution. This is addressed simply by prototyping. It helps in understanding the feasibility and benefit’s of a data battleehouse. The prototyping activity on a small size can promote educational process as lengthy as:

  • The protokind addresses a degoodd specialised goal.

  • The protokind can be thlinen away after the feasibility concept has been shown.

  • The activity addresses a small subset of actuallytual data content of the data battleehouse.

  • The activity timesize is non-critical.

The following stages are to be kept in mind to produce an earlier release and deliver business benefit’s.

  • Identify the architecture thead wear is capable of evolving.

  • Focus on business requirements and specialised blueprint phases.

  • Limit the scope of the very initial produce phase to the minimum thead wear delivers business benefit’s.

  • Understand the short-term and medium-term requirements of the data battleehouse.

Business Requirements

To provide quality deliverables, we need to produce sure the general requirements are understood. If we understand the business requirements for both short-term and medium-term, then we can style a solution to fulfil short-term requirements. The short-term solution can then be glinen to a comppermite solution.

The following aspects are determined in this particular particular stage:

Things to determine in this particular particular stage are following.

  • The business rule to be applied on data.

  • The logical model for information within the data battleehouse.

  • The query profiles for the immediate requirement.

  • The source systems thead wear provide this particular particular data.

Technical Blueprint

This phase need to deliver an general architecture satisfying the lengthy term requirements. This phase also deliver the components thead wear must be implemented in a short term to derive any kind of kind of business benefit. The blueprint need to identify the followings.

  • The general system architecture.
  • The data retention policy.
  • The backup and recovery strategy.
  • The server and data mart architecture.
  • The capacity plan for hardbattlee and infrastructure.
  • The components of database style.

Building the version

In this particular particular stage, the very initial itemion deliverable is produced. This itemion deliverable is the smallest component of a data battleehouse. This smallest component adds business benefit.

History Load

This is the phase where the remainder of the requireddish coloured-coloureddish coloured background is loaded into the data battleehouse. In this particular particular phase, we do not add brand brand new enticonnects, but additional physical tables would probably be produced to store increased data volumes.

Let us conpartr an example. Suppose the produce version phase has delivereddish coloured-coloureddish coloured a retail sales analysis data battleehouse with 2 months’ worth of background. This information will permit the user to analyze only the recent trends and address the short-term issues. The user in this particular particular case cannot identify annual and oceansonal trends. To help him do so, last 2 yrs’ sales background can be loaded from the archive. Now the 40GB data is extended to 400GB.

Note: The backup and recovery procedures may become complex, generally correct now therefore it is recommended to perform this particular particular activity within a separate phase.

Ad hoc Query

In this particular particular phase, we configure an ad hoc query tool thead wear is used to operate a data battleehouse. These tools can generate the database query.

Note: It is recommended not to use these access tools when the database is being substantially modified.

Automation

In this particular particular phase, operational management processes are comppermitey automated. These would include:

  • Transforming the data into a form suitable for analysis.

  • Monitoring query profiles and determining appropriate aggregations to maintain system performance.

  • Extracting and loading data from various source systems.

  • Generating aggregations from preddish coloured-coloureddish colouredegoodd definitions within the data battleehouse.

  • Bacruler up, restoring, and archiving the data.

Extending Scope

In this particular particular phase, the data battleehouse is extended to address a brand brand new set of business requirements. The scope can be extended in 2 ways:

  • By loading additional data into the data battleehouse.

  • By introducing brand brand new data marts using the existing information.

Note: This phase need to be performed separately, since it involves substantial efforts and complexity.

Requirements Evolution

From the perspective of delivery process, the requirements are always alterable. They are not static. The delivery process must supinterface this particular particular and permit these alters to be reflected within the system.

This issue is addressed simply by styleing the data battleehouse around the use of data within business processes, as opposed to the data requirements of existing queries.

The architecture is styleed to alter and gline to complement the business needs, the process operates as a pseudo-application producement process, where the brand brand new requirements are continually fed into the producement activiconnects and the partial deliverables are produced. These partial deliverables are fed back to the users and then reworked ensuring thead wear the general system is continually updated to meet the business needs.

Data Warehousing – System Processes

We have a fixed number of operations to be applied on the operational databases and we have well-degoodd techniques such as use normalized data, maintain table small, etc. These techniques are suitable for delivering a solution. But in case of decision-supinterface systems, we do not know exactly exactly whead wear query and operation needs to be executed in future. Therefore techniques applied on operational databases are not suitable for data battleehouses.

In this particular particular chapter, we will talk about how to produce data battleehousing solutions on top open up-system technologies like Unix and relational databases.

Process Flow in Data Warehouse

There are four major processes thead wear contribute to a data battleehouse:

  • Extract and load the data.
  • Cleaning and transforming the data.
  • Backup and archive the data.
  • Managing queries and immediateing all of them to the appropriate data sources.

Process Flow

Extract and Load Process

Data extraction conpartrs data from the source systems. Data load conpartrs the extracted data and loads it into the data battleehouse.

Note: Before loading the data into the data battleehouse, the information extracted from the external sources must be reconstructed.

Controlling the Process

Controlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures thead wear the tools, the logic modules, and the programs are executed in appropriate sequence and at appropriate time.

When to Initiate Extract

Data needs to be in a consistent state when it is extracted, i.e., the data battleehouse need to represent a single, consistent version of the information to the user.

For example, in a customer profiling data battleehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription actuallyts up to 8 pm on Tuesday. This would mean thead wear we are finding the customers for whom generally correct now there are no associated subscriptions.

Loading the Data

After extracting the data, it is loaded into a temporary data store where it is thoroughly cleaned up and made consistent.

Note: Consistency checks are executed only when all the data sources have been loaded into the temporary data store.

Clean and Transform Process

Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming:

  • Clean and transform the loaded data into a structure
  • Partition the data
  • Aggregation

Clean and Transform the Loaded Data into a Structure

Cleaning and transforming the loaded data helps speed up the queries. It can be done simply by maruler the data consistent:

  • within it’self.
  • with other data within the same data source.
  • with the data in other source systems.
  • with the existing data present in the battleehouse.

Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data contained in a data battleehouse must be transformed to supinterface performance requirements and manage the ongoing operational costs.

Partition the Data

It will optimize the hardbattlee performance and simplify the management of data battleehouse. Here we partition every reality table into multiple separate partitions.

Aggregation

Aggregation is requireddish coloured-coloureddish coloured to speed up common queries. Aggregation relies on the reality thead wear the majority of common queries will analyze a subset or an aggregation of the detaiintroduced data.

Backup and Archive the Data

In order to recover the data in the actuallyt of data loss, smoothbattlee failure, or hardbattlee failure, it is essential to maintain regular back ups. Archiving involves removing the old data from the system in a format thead wear permit it to be quickly restoreddish coloured-coloureddish coloured whenever requireddish coloured-coloureddish coloured.

For example, in a retail sales analysis data battleehouse, it may be requireddish coloured-coloureddish coloured to maintain data for 3 yrs with the lacheck 6 months data being kept online. In such as scenario, generally correct now there is regularly a requirement to be able to do month-on-month comparisons for this particular particular yr and last yr. In this particular particular case, we require some data to be restoreddish coloured-coloureddish coloured from the archive.

Query Management Process

This process performs the following functions:

  • manages the queries.

  • helps speed up the execution time of queris.

  • immediates the queries to their the majority of effective data sources.

  • ensures thead wear all the system sources are used in the the majority of effective way.

  • monitors actual query profiles.

The information generated in this particular particular process is used simply by the battleehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data battleehouse.

Data Warehousing – Architecture

In this particular particular chapter, we will talk about the business analysis framework for the data battleehouse style and architecture of a data battleehouse.

Business Analysis Framework

The business analyst get the information from the data battleehouses to measure the performance and produce critical adsimplyments in order to win over other business holders in the market. Having a data battleehouse awayers the following advantages:

  • Since a data battleehouse can gather information quickly and effectively, it can enhance business itemivity.

  • A data battleehouse provides us a consistent see of customers and items, hence, it helps us manage customer relationship.

  • A data battleehouse also helps in provideing down the costs simply by tracruler trends, patterns over a lengthy period in a consistent and reliable manner.

To style an effective and effective data battleehouse, we need to understand and analyze the business needs and construct a business analysis framework. Each person has various sees regarding the style of a data battleehouse. These sees are as follows:

  • The top-down see – This see permit’s the selection of relevant information needed for a data battleehouse.

  • The data source see – This see presents the information being captureddish coloured-coloureddish coloured, storeddish coloured-coloureddish coloured, and managed simply by the operational system.

  • The data battleehouse see – This see includes the reality tables and dimension tables. It represents the information storeddish coloured-coloureddish coloured inaspect the data battleehouse.

  • The business query see – It is the see of the data from the seestage of the end-user.

Three-Tier Data Warehouse Architecture

Generally a data battleehouses adopts a 3-connectr architecture. Following are the 3 connectrs of the data battleehouse architecture.

  • Bottom Tier – The underpart connectr of the architecture is the data battleehouse database server. It is the relational database system. We use the back end tools and utiliconnects to feed data into the underpart connectr. These back end tools and utiliconnects perform the Extract, Clean, Load, and refresh functions.

  • Middle Tier – In the middle connectr, we have the OLAP Server thead wear can be implemented in possibly of the following ways.

    • By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP charts the operations on multidimensional data to standard relational operations.

    • By Multidimensional OLAP (MOLAP) model, which immediately implements the multidimensional data and operations.

  • Top-Tier – This connectr is the front side-end client layer. This layer holds the query tools and reinterfaceing tools, analysis tools and data mining tools.

The following diagram depicts the 3-connectr architecture of data battleehouse:

Data Warehousing Architecture

Data Warehouse Models

From the perspective of data battleehouse architecture, we have the following data battleehouse models:

  • Virtual Warehouse
  • Data mart
  • Enterprise Warehouse

Virtual Warehouse

The see over an operational data battleehouse is known as a virtual battleehouse. It is easy to produce a virtual battleehouse. Building a virtual battleehouse requires excess capacity on operational database servers.

Data Mart

Data mart contains a subset of body organization-wide data. This subset of data is imslotant to specific groups of an body organization.

In other words, we can claim thead wear data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are congoodd to subjects.

Points to remember about generally correct now there data marts:

  • Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.

  • The implementation data mart cycles is measureddish coloured-coloureddish coloured in short periods of time, i.e., in weeks instead than months or yrs.

  • The life cycle of a data mart may be complex in lengthy operate, if it’s planning and style are not body organization-wide.

  • Data marts are small in size.

  • Data marts are customised simply by department.

  • The source of a data mart is departmentally structureddish coloured-coloureddish coloured data battleehouse.

  • Data mart are flexible.

Enterprise Warehouse

  • An enterprise battleehouse collects all the information and the subjects spanning an entire body organization

  • It provides us enterprise-wide data integration.

  • The data is integrated from operational systems and external information providers.

  • This information can vary from a couple of gigasimply bytes to 100-coloureddish coloureds of gigasimply bytes, terasimply bytes or beyond.

Load Manager

This component performs the operations requireddish coloured-coloureddish coloured to extract and load process.

The size and complexity of the load manager varies between specific solutions from one data battleehouse to other.

Load Manager Architecture

The load manager performs the following functions:

  • Extract the data from source system.

  • Fast Load the extracted data into temporary data store.

  • Perform fundamental transformations into structure similar to the one in the data battleehouse.

Load Manager

Extract Data from Source

The data is extracted from the operational databases or the external information providers. Gateways is the application programs thead wear are used to extract data. It is supinterfaceed simply by underlying DBMS and permit’s client program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database Connection (JDBC), are examples of gateway.

Fast Load

  • In order to minimise the overalll load window the data need to be loaded into the battleehouse in the quickest probable time.

  • The transformations affects the speed of data processing.

  • It is more effective to load the data into relational database prior to applying transformations and checks.

  • Gateway technology proves to be not suitable, since they tend not be performant when big data volumes are involved.

Simple Transformations

While loading it may be requireddish coloured-coloureddish coloured to perform fundamental transformations. After this particular particular has been comppermited we are in posit downion to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks:

  • Strip out generally correct now there all the columns thead wear are not requireddish coloured-coloureddish coloured within the battleehouse.
  • Convert all the values to requireddish coloured-coloureddish coloured data kinds.

Warehouse Manager

A battleehouse manager is responsible for the battleehouse management process. It consists of third-party system smoothbattlee, C programs, and shell scripts.

The size and complexity of battleehouse managers varies between specific solutions.

Warehouse Manager Architecture

A battleehouse manager includes the following:

  • The manageling process
  • Storeddish coloured-coloureddish coloured procedures or C with SQL
  • Backup/Recovery tool
  • SQL Scripts

Warehouse Manager

Operations Performed simply by Warehouse Manager

  • A battleehouse manager analyzes the data to perform consistency and referential integrity checks.

  • Creates indexes, business sees, partition sees against the base data.

  • Generates brand brand new aggregations and updates existing aggregations. Generates normalizations.

  • Transforms and merges the source data into the published data battleehouse.

  • Backup the data in the data battleehouse.

  • Archives the data thead wear has reveryed the end of it’s captureddish coloured-coloureddish coloured life.

Note: A battleehouse Manager also analyzes query profiles to determine index and aggregations are appropriate.

Query Manager

  • Query manager is responsible for immediateing the queries to the suitable tables.

  • By immediateing the queries to appropriate tables, the speed of querying and response generation can be increased.

  • Query manager is responsible for scheduling the execution of the queries posed simply by the user.

Query Manager Architecture

The following screenshot shows the architecture of a query manager. It includes the following:

  • Query reddish coloured-coloureddish coloureimmediateion via C tool or RDBMS
  • Storeddish coloured-coloureddish coloured procedures
  • Query management tool
  • Query scheduling via C tool or RDBMS
  • Query scheduling via third-party smoothbattlee

Query Manager

Detaiintroduced Information

Detaiintroduced information is not kept online, instead it is aggregated to the next level of detail and then archived to tape. The detaiintroduced information part of data battleehouse maintains the detaiintroduced information in the starflake schema. Detaiintroduced information is loaded into the data battleehouse to supplement the aggregated data.

The following diagram shows a pictorial impression of where detaiintroduced information is storeddish coloured-coloureddish coloured and how it is used.

Detaiintroduced Information

Note: If detaiintroduced information is held awayline to minimise disk storage, we need to produce sure thead wear the data has been extracted, thoroughly cleaned up, and transformed into starflake schema before it is archived.

Summary Information

Summary Information is a part of data battleehouse thead wear stores preddish coloured-coloureddish colouredegoodd aggregations. These aggregations are generated simply by the battleehouse manager. Summary Information must be treated as transient. It alters on-the-go in order to respond to the changing query profiles.

Points to remember about generally correct now there summary information.

  • Summary information speeds up the performance of common queries.

  • It increases the operational cost.

  • It needs to be updated whenever brand brand new data is loaded into the data battleehouse.

  • It may not have been backed up, since it can be generated fresh from the detaiintroduced information.

Data Warehousing – OLAP

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It permit’s managers, and analysts to get an insight of the information through quick, consistent, and interactive access to information. This chapter cover the kinds of OLAP, operations on OLAP, difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers

We have four kinds of OLAP servers:

  • Relational OLAP (ROLAP)
  • Multidimensional OLAP (MOLAP)
  • Hybrid OLAP (HOLAP)
  • Specialized SQL Servers

Relational OLAP

ROLAP servers are placed between relational back-end server and client front side-end tools. To store and manage battleehouse data, ROLAP uses relational or extended-relational DBMS.

ROLAP includes the following:

  • Implementation of aggregation navigation logic.
  • Optimization for every DBMS back end.
  • Additional tools and services.

Multidimensional OLAP

MOLAP uses array-based multidimensional storage engines for multidimensional sees of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many kind of kind of MOLAP server use 2 levels of data storage representation to handle dense and sparse data sets.

Hybrid OLAP (HOLAP)

Hybrid OLAP is a combination of both ROLAP and MOLAP. It awayers higher scalpotential of ROLAP and quicker complaceation of MOLAP. HOLAP servers permit’s to store the big data volumes of detaiintroduced information. The aggregations are storeddish coloured-coloureddish coloured separately in MOLAP store.

Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing supinterface for SQL queries over star and snowflake schemas in a read-only environment.

OLAP Operations

Since OLAP servers are based on multidimensional see of data, we will talk about OLAP operations in multidimensional data.

Here is the list of OLAP operations:

  • Roll-up
  • Drill-down
  • Slice and dice
  • Pivot (rotate)

Roll-up

Roll-up performs aggregation on a data cube in any kind of kind of of the following ways:

  • By climbing up a concept hierarchy for a dimension
  • By dimension reddish coloured-coloureddish coloureduction

The following diagram illustrates how roll-up works.

Roll-up

  • Roll-up is performed simply by climbing up a concept hierarchy for the dimension location.

  • Initially the concept hierarchy was "street < city < province < counattempt".

  • On rolling up, the data is aggregated simply by ascending the location hierarchy from the level of city to the level of counattempt.

  • The data is grouped into ciconnects instead than countries.

  • When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down

Drill-down is the reverse operation of roll-up. It is performed simply by possibly of the following ways:

  • By stepping down a concept hierarchy for a dimension
  • By introducing a brand brand new dimension.

The following diagram illustrates how drill-down works:

Drill-Down

  • Drill-down is performed simply by stepping down a concept hierarchy for the dimension time.

  • Initially the concept hierarchy was "day < month < quarter < yr."

  • On drilling down, the time dimension is descended from the level of quarter to the level of month.

  • When drill-down is performed, one or more dimensions from the data cube are added.

  • It navigates the data from less detaiintroduced data to highly detaiintroduced data.

Slice

The slice operation selects one particular dimension from a given cube and provides a brand brand new sub-cube. Conaspectr the following diagram thead wear shows how slice works.

Slice

  • Here Slice is performed for the dimension "time" using the criterion time = "Q1".

  • It will form a brand brand new sub-cube simply by selecting one or more dimensions.

Dice

Dice selects 2 or more dimensions from a given cube and provides a brand brand new sub-cube. Conaspectr the following diagram thead wear shows the dice operation.

Dice

The dice operation on the cube based on the following selection criteria involves 3 dimensions.

  • (location = "Toronto" or "Vancouver")
  • (time = "Q1" or "Q2")
  • (item =" Mobile" or "Modem")

Pivot

The pivot operation is also known as rotation. It rotates the data axes in see in order to provide an alternative presentation of data. Conaspectr the following diagram thead wear shows the pivot operation.

Pivot

In this particular particular the item and location axes in 2-D slice are rotated.

OLAP vs OLTP

Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
1 Involves historical processing of information. Involves day-to-day processing.
2 OLAP systems are used simply by knowintroducadvantage workers such as executives, managers and analysts. OLTP systems are used simply by clerks, DBAs, or database professionals.
3 Useful in analyzing the business. Useful in operatening the business.
4 It focuses on Information out generally correct now there. It focuses on Data in.
5 Based on Star Schema, Snowflake, Schema and Fact Constellation Schema. Based on Entity Relationship Model.
6 Contains historical data. Contains current data.
7 Provides summarized and constrongated data. Provides primitive and highly detaiintroduced data.
8 Provides summarized and multidimensional see of data. Provides detaiintroduced and flat relational see of data.
9 Number or users is in 100-coloureddish coloureds. Number of users is in thougood sands.
10 Number of records accessed is in millions. Number of records accessed is in tens.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.

Data Warehousing – Relational OLAP

Relational OLAP servers are placed between relational back-end server and client front side-end tools. To store and manage the battleehouse data, the relational OLAP uses relational or extended-relational DBMS.

ROLAP includes the following:

  • Implementation of aggregation navigation logic
  • Optimization for every DBMS back-end
  • Additional tools and services

Points to Remember

  • ROLAP servers are highly scalable.

  • ROLAP tools analyze big volumes of data acombination multiple dimensions.

  • ROLAP tools store and analyze highly volatile and alterable data.

Relational OLAP Architecture

ROLAP includes the following components:

  • Database server
  • ROLAP server
  • Front-end tool.

Rolap Architecture

Advantages

  • ROLAP servers can be easily used with existing RDBMS.
  • Data can be storeddish coloured-coloureddish coloured effectively, since no zero realitys can be storeddish coloured-coloureddish coloured.
  • ROLAP tools do not use pre-calculated data cubes.
  • DSS server of micro-strategy adopts the ROLAP approach.

Diunhappyvantages

  • Poor query performance.

  • Some limitations of scalpotential depending on the technology architecture thead wear is utilized.

Data Warehousing – Multidimensional OLAP

Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for multidimensional sees of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many kind of kind of MOLAP servers use 2 levels of data storage representation to handle dense and sparse data-sets.

Points to Remember:

  • MOLAP tools process information with consistent response time regardless of level of summarizing or calculations selected.

  • MOLAP tools need to avoid many kind of kind of of the complexiconnects of creating a relational database to store data for analysis.

  • MOLAP tools need quickest probable performance.

  • MOLAP server adopts 2 level of storage representation to handle dense and sparse data sets.

  • Denser sub-cubes are identified and storeddish coloured-coloureddish coloured as array structure.

  • Sparse sub-cubes employ compression technology.

MOLAP Architecture

MOLAP includes the following components:

  • Database server.
  • MOLAP server.
  • Front-end tool.

Molap Architecture

Advantages

  • MOLAP permit’s quickest indexing to the pre-complaceed summarized data.
  • Helps the users connected to a ne2rk who need to analyze bigr, less-degoodd data.
  • Easier to use, generally correct now therefore MOLAP is suitable for inexperienced users.

Diunhappyvantages

  • MOLAP are not capable of containing detaiintroduced data.
  • The storage utilization may be low if the data set is sparse.

MOLAP vs ROLAP

Sr.No. MOLAP ROLAP
1 Information retrieval is quick. Information retrieval is comparatively gradual.
2 Uses sparse array to store data-sets. Uses relational table.
3 MOLAP is best suited for inexperienced users, since it is very easy to use. ROLAP is best suited for experienced users.
4 Maintains a separate database for data cubes. It may not require space other than available in the Data battleehouse.
5 DBMS facility is weak. DBMS facility is strong.

Data Warehousing – Schemas

Schema is a logical description of the entire database. It includes the name and description of records of all record kinds including all associated data-items and aggregates. Much like a database, a data battleehouse also requires to maintain a schema. A database uses relational model, while a data battleehouse uses Star, Snowflake, and Fact Constellation schema. In this particular particular chapter, we will talk about the schemas used in a data battleehouse.

Star Schema

  • Each dimension in a star schema is represented with only one-dimension table.

  • This dimension table contains the set of attributes.

  • The following diagram shows the sales data of a company kind of kind of with respect to the four dimensions, namely time, item, branch, and location.

Start Schema

  • There is a reality table at the centre. It contains the keys to every of four dimensions.

  • The reality table also contains the attributes, namely moneys sold and device’s sold.

Note: Each dimension has only one dimension table and every table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,counattempt}. This constraint may cause data reddish coloured-coloureddish colouredundancy. For example, "Vancouver" and "Victoria" both the ciconnects are in the Canadian province of British Columbia. The entries for such ciconnects may cause data reddish coloured-coloureddish colouredundancy alengthy the attributes province_or_state and counattempt.

Snowflake Schema

  • Some dimension tables in the Snowflake schema are normalized.

  • The normalization split’s up the data into additional tables.

  • Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into 2 dimension tables, namely item and supplier table.

Snowflake Schema

  • Now the item dimension table contains the attributes item_key, item_name, kind, brand, and supplier-key.

  • The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_kind.

<bNote: Due to normalization in the Snowflake schema, the reddish coloured-coloureddish colouredundancy is reddish coloured-coloureddish coloureduced and generally correct now therefore, it becomes easy to maintain and the save storage space.</b

Fact Constellation Schema

  • A reality constellation has multiple reality tables. It is also known as galaxy schema.

  • The following diagram shows 2 reality tables, namely sales and shipping.

Fact Constellation Schema

  • The sales reality table is same as thead wear in the star schema.

  • The shipping reality table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.

  • The shipping reality table also contains 2 measures, namely moneys sold and device’s sold.

  • It is also probable to share dimension tables between reality tables. For example, time, item, and location dimension tables are shareddish coloured-coloureddish coloured between the sales and shipping reality table.

Schema Definition

Multidimensional schema is degoodd using Data Mining Query Language (DMQL). The 2 primitives, cube definition and dimension definition, can be used for defining the data battleehouses and data marts.

Syntax for Cube Definition

degood cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for Dimension Definition

degood dimension < dimension_name > as ( < attribute_or_dimension_list > )

Star Schema Definition

The star schema thead wear we have talk abouted can be degoodd using Data Mining Query Language (DMQL) as follows:

degood cube sales star [time, item, branch, location]:   
    	   
moneys sold = sum(sales in moneys), device's sold = count(*)    	  

degood dimension time as (time key, day, day of week, month, quarter, yr)
degood dimension item as (item key, item name, brand, kind, supplier kind)        	
degood dimension branch as (branch key, branch name, branch kind)              	
degood dimension location as (location key, street, city, province or state, counattempt)

Snowflake Schema Definition

Snowflake schema can be degoodd using DMQL as follows:

degood cube sales snowflake [time, item, branch, location]:

moneys sold = sum(sales in moneys), device's sold = count(*)

degood dimension time as (time key, day, day of week, month, quarter, yr)
degood dimension item as (item key, item name, brand, kind, supplier (supplier key, supplier kind))
degood dimension branch as (branch key, branch name, branch kind)
degood dimension location as (location key, street, city (city key, city, province or state, counattempt))

Fact Constellation Schema Definition

Fact constellation schema can be degoodd using DMQL as follows:

degood cube sales [time, item, branch, location]:

moneys sold = sum(sales in moneys), device's sold = count(*)

degood dimension time as (time key, day, day of week, month, quarter, yr)
degood dimension item as (item key, item name, brand, kind, supplier kind)
degood dimension branch as (branch key, branch name, branch kind)
degood dimension location as (location key, street, city, province or state,counattempt)
degood cube shipping [time, item, shipper, from location, to location]:

moneys cost = sum(cost in moneys), device's shipped = count(*)

degood dimension time as time in cube sales
degood dimension item as item in cube sales
degood dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper kind)
degood dimension from location as location in cube sales
degood dimension to location as location in cube sales

Data Warehousing – Partitioning Strategy

Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. It optimizes the hardbattlee performance and simplifies the management of data battleehouse simply by partitioning every reality table into multiple separate partitions. In this particular particular chapter, we will talk about various partitioning strategies.

Why is it Necessary to Partition?

Partitioning is iminterfaceant for the following reasons:

  • For easy management,
  • To assist backup/recovery,
  • To enhance performance.

For Easy Management

The reality table in a data battleehouse can gline up to 100-coloureddish coloureds of gigasimply bytes in size. This huge size of reality table is very difficult to manage as a single entity. Therefore it needs partitioning.

To Assist Backup/Recovery

If we do not partition the reality table, then we have to load the comppermite reality table with all the data. Partitioning permit’s us to load only as a lot data as is requireddish coloured-coloureddish coloured on a regular basis. It reddish coloured-coloureddish coloureduces the time to load and also enhances the performance of the system.

Note: To cut down on the backup size, all partitions other than the current partition can be marked as read-only. We can then place these partitions into a state where they cannot be modified. Then they can be backed up. It means only the current partition is to be backed up.

To Enhance Performance

By partitioning the reality table into sets of data, the query procedures can be enhanced. Query performance is enhanced because now the query scans only those partitions thead wear are relevant. It does not have to scan the whole data.

Horizontal Partitioning

There are various ways in which a reality table can be partitioned. In horizontal partitioning, we have to maintain in mind the requirements for managepotential of the data battleehouse.

Partitioning simply by Time into Equal Segments

In this particular particular partitioning strategy, the reality table is partitioned on the basis of time period. Here every time period represents a significan not retention period within the business. For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments. We can reuse the partitioned tables simply by removing the data in all of them.

Partition simply by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed inregularly. It is implemented as a set of small partitions for relatively current data, bigr partition for inactive data.

Partitioning simply by time into various-sized segments

Points to Note

  • The detaiintroduced information remains available online.

  • The number of physical tables is kept relatively small, which reddish coloured-coloureddish coloureduces the operating cost.

  • This technique is suitable where a mix of data dipping recent background and data mining through entire background is requireddish coloured-coloureddish coloured.

  • This technique is not helpful where the partitioning profile alters on a regular basis, because repartitioning will increase the operation cost of data battleehouse.

Partition on a Different Dimension

The reality table can also be partitioned on the basis of dimensions other than time such as item group, area, supplier, or any kind of kind of other dimension. Let's have an example.

Suppose a market function has been structureddish coloured-coloureddish coloured into unique areaal departments like on a state simply by state basis. If every area wants to query on information captureddish coloured-coloureddish coloured within it’s area, it would prove to be more effective to partition the reality table into areaal partitions. This will cause the queries to speed up because it does not require to scan information thead wear is not relevant.

Points to Note

  • The query does not have to scan irrelevant data which speeds up the query process.

  • This technique is not appropriate where the dimensions are unlikely to alter in future. So, it is worth determining thead wear the dimension does not alter in future.

  • If the dimension alters, then the entire reality table would have to be repartitioned.

Note: We recommend to perform the partition only on the basis of time dimension, unless you are particular thead wear the suggested dimension grouping will not alter within the life of the data battleehouse.

Partition simply by Size of Table

When generally correct now there are no clear basis for partitioning the reality table on any kind of kind of dimension, then we need to partition the reality table on the basis of their size. We can set the preddish coloured-coloureddish colouredetermined size as a critical stage. When the table exceeds the preddish coloured-coloureddish colouredetermined size, a brand brand new table partition is produced.

Points to Note

  • This partitioning is complex to manage.

It requires metadata to identify exactly exactly whead wear data is storeddish coloured-coloureddish coloured in every partition.

Partitioning Dimensions

If a dimension contains big number of entries, then it is requireddish coloured-coloureddish coloured to partition the dimensions. Here we have to check the size of a dimension.

Conaspectr a big style thead wear alters over time. If we need to store all the variations in order to apply comparisons, thead wear dimension may be very big. This would definitely affect the response time.

Round Robin Partitions

In the round robin technique, when a brand brand new partition is needed, the old one is archived. It uses metadata to permit user access tool to refer to the appropriate table partition.

This technique produces it easy to automate table management faciliconnects within the data battleehouse.

Vertical Partition

Vertical partitioning, split’s the data vertically. The following images depicts how vertical partitioning is done.

Vertical Partitioning

Vertical partitioning can be performed in the following 2 ways:

  • Normalization
  • Row Splitting

Normalization

Normalization is the standard relational method of database body organization. In this particular particular method, the lines are collapsed into a single line, hence it reddish coloured-coloureddish coloureduce space. Take a look at the following tables thead wear show how normalization is performed.

Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region
30 5 3.67 3-Aug-13 16 sunlightny Bangalore S
35 4 5.33 3-Sep-13 16 sunlightny Bangalore S
40 5 2.50 3-Sep-13 64 san Mumbai W
45 7 5.66 3-Sep-13 16 sunlightny Bangalore S

Table after Normalization

Store_id Store_name Location Region
16 sunlightny Bangalore W
64 san Mumbai S
Product_id Quantity Value sales_date Store_id
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16

Row Splitting

Row splitting tends to depart a one-to-one chart between partitions. The motive of line splitting is to speed up the access to big table simply by reddish coloured-coloureddish coloureducing it’s size.

Note: While using vertical partitioning, produce sure thead wear generally correct now there is no requirement to perform a major sign up for operation between 2 partitions.

Identify Key to Partition

It is very crucial to select the appropriate partition key. Choosing a wrong partition key will lead to rebody organizing the reality table. Let's have an example. Suppose we want to partition the following table.

Account_Txn_Table
transaction_id
account_id
transaction_kind
value
transaction_date
area
branch_name

We can select to partition on any kind of kind of key. The 2 probable keys can be

  • area
  • transaction_date

Suppose the business is body organised in 30 geographical areas and every area has various number of branches. Thead wear will give us 30 partitions, which is reasonable. This partitioning is great enough because our requirements capture has shown thead wear a vast majority of queries are rerigided to the user's own business area.

If we partition simply by transaction_date instead of area, then the lacheck transaction from every area will be in one partition. Now the user who wants to look at data within his own area has to query acombination multiple partitions.

Hence it is worth determining the appropriate partitioning key.

Data Warehousing – Metadata Concepts

Whead wear is Metadata?

Metadata is simply degoodd as data about generally correct now there data. The data thead wear is used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say thead wear metadata is the summarized data thead wear leads us to detaiintroduced data. In terms of data battleehouse, we can degood metadata as follows.

  • Metadata is the road-chart to a data battleehouse.

  • Metadata in a data battleehouse degoods the battleehouse objects.

  • Metadata acts as a immediateory. This immediateory helps the decision supinterface system to locate the contents of a data battleehouse.

Note: In a data battleehouse, we produce metadata for the data names and definitions of a given data battleehouse. Alengthy with this particular particular metadata, additional metadata is also produced for time-stamping any kind of kind of extracted data, the source of extracted data.

Categories of Metadata

Metadata can be broadly categorized into 3 categories:

  • Business Metadata – It has the data ownership information, business definition, and changing policies.

  • Technical Metadata – It includes database system names, table and column names and sizes, data kinds and permited values. Technical metadata also includes structural information such as primary and foreign key attributes and indices.

  • Operational Metadata – It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the background of data migrated and transformation applied on it.

Metadata Categories

Role of Metadata

Metadata has an extremely iminterfaceant role in a data battleehouse. The role of metadata in a battleehouse is various from the battleehouse data, yet it plays an iminterfaceant role. The various roles of metadata are excommoned below.

  • Metadata acts as a immediateory.

  • This immediateory helps the decision supinterface system to locate the contents of the data battleehouse.

  • Metadata helps in decision supinterface system for chartping of data when data is transformed from operational environment to data battleehouse environment.

  • Metadata helps in summarization between current detaiintroduced data and highly summarized data.

  • Metadata also helps in summarization between lightly detaiintroduced data and highly summarized data.

  • Metadata is used for query tools.

  • Metadata is used in extraction and thoroughly cleansing tools.

  • Metadata is used in reinterfaceing tools.

  • Metadata is used in transformation tools.

  • Metadata plays an iminterfaceant role in loading functions.

The following diagram shows the roles of metadata.

Role of Metadata

Metadata Respiratory

Metadata respiratory is an integral part of a data battleehouse system. It has the following metadata:

  • Definition of data battleehouse – It includes the description of structure of data battleehouse. The description is degoodd simply by schema, see, hierarchies, derived data definitions, and data mart locations and contents.

  • Business metadata – It contains has the data ownership information, business definition, and changing policies.

  • Operational Metadata – It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the background of data migrated and transformation applied on it.

  • Data for chartping from operational environment to data battleehouse – It includes the source databases and their contents, data extraction, data partition thoroughly cleaning,
    transformation rules, data refresh and purging rules.

  • Algorithms for summarization – It includes dimension algorithms, data on granularity, aggregation, summarizing, etc.

Challenges for Metadata Management

The iminterfaceance of metadata can not be overstated. Metadata helps in driving the accuracy of reinterfaces, validates data transformation, and ensures the accuracy of calculations. Metadata also enforces the definition of business terms to business end-users. With all these uses of metadata, it also has it’s challenges. Some of the challenges are talk abouted below.

  • Metadata in a big body organization is scattereddish coloured-coloureddish coloured acombination the body organization. This metadata is spread in spreadsheets, databases, and applications.

  • Metadata can be present in text files or multimedia files. To use this particular particular data for information management solutions, it has to be appropriately degoodd.

  • There are no indusattempt-wide accepted standards. Data management solution vendors have narline focus.

  • There are no easy and accepted methods of moveing metadata.

Data Warehousing – Data Marting

Why Do We Need a Data Mart?

Listed below are the reasons to produce a data mart:

  • To partition data in order to impose access manage strategies.

  • To speed up the queries simply by reddish coloured-coloureddish coloureducing the volume of data to be scanned.

  • To segment data into various hardbattlee platforms.

  • To structure data in a form suitable for a user access tool.

Note: Do not data mart for any kind of kind of other reason since the operation cost of data marting can be very high. Before data marting, produce sure thead wear data marting strategy is appropriate for your particular solution.

Cost-effective Data Marting

Follow the steps given below to produce data marting cost-effective:

  • Identify the Functional Split’s
  • Identify User Access Tool Requirements
  • Identify Access Control Issues

Identify the Functional Split’s

In this particular particular step, we determine if the body organization has natural functional split’s. We look for departmental split’s, and we determine whether the way in which departments use information tend to be in isolation from the rest of the body organization. Let's have an example.

Conaspectr a retail body organization, where every merchant is accountable for maximizing the sales of a group of items. For this particular particular, the following are the imslotant information:

  • sales transaction on a daily basis
  • sales forecast on a weekly basis
  • stock posit downion on a daily basis
  • stock movements on a daily basis

As the merchant is not attentioned in the items they are not dealing with, the data marting is a subset of the data dealing which the item group of attention. The following diagram shows data marting for various users.

Data Marting

Given below are the issues to be conpartrn into account while determining the functional split:

  • The structure of the department may alter.

  • The items may switch from one department to other.

  • The merchant could query the sales trend of other items to analyze exactly exactly whead wear is happening
    to the sales.

Note: We need to determine the business benefit’s and specialised feasibility of using a data mart.

Identify User Access Tool Requirements

We need data marts to supinterface user access tools thead wear require internal data structures. The data in such structures are out generally correct now thereaspect the manage of data battleehouse but need to be populated and updated on a regular basis.

There are some tools thead wear populate immediately from the source system but some cannot. Therefore additional requirements out generally correct now thereaspect the scope of the tool are needed to be identified for future.

Note: In order to ensure consistency of data acombination all access tools, the data need to not be immediately populated from the data battleehouse, instead every tool must have it’s own data mart.

Identify Access Control Issues

There need to to be privacy rules to ensure the data is accessed simply by authorised users only. For example a data battleehouse for retail banruler institution ensures thead wear all the accounts belengthy to the same legal entity. Privacy laws can force you to overalllly practuallyt access to information thead wear is not owned simply by the specific bank.

Data marts permit us to produce a comppermite wall simply by physically separating data segments within the data battleehouse. To avoid probable privacy issues, the detaiintroduced data can be removed from the data battleehouse. We can produce data mart for every legal entity and load it via data battleehouse, with detaiintroduced account data.

Designing Data Marts

Data marts need to be styleed as a smaller version of starflake schema within the data battleehouse and need to complement with the database style of the data battleehouse. It helps in maintaining manage over database instances.

Designing Data Mart

The summaries are data marted in the same way as they would have been styleed within the data battleehouse. Summary tables help to utilize all dimension data in the starflake schema.

Cost of Data Marting

The cost measures for data marting are as follows:

  • Hardbattlee and Softbattlee Cost
  • Ne2rk Access
  • Time Window Constraints

Hardbattlee and Softbattlee Cost

Although data marts are produced on the same hardbattlee, they require some additional hardbattlee and smoothbattlee. To handle user queries, it requires additional processing power and disk storage. If detaiintroduced data and the data mart exist within the data battleehouse, then we would face additional cost to store and manage replicated data.

Note: Data marting is more expensive than aggregations, generally correct now therefore it need to be used as an additional strategy and not as an alternative strategy.

Ne2rk Access

A data mart can be on a various location from the data battleehouse, so we need to ensure thead wear the LAN or WAN has the capacity to handle the data volumes being transferreddish coloured-coloureddish coloured within the data mart load process.

Time Window Constraints

The extent to which a data mart loading process will eat into the available time window depends on the complexity of the transformations and the data volumes being shipped. The determination of how many kind of kind of data marts are probable depends on:

  • Ne2rk capacity.
  • Time window available
  • Volume of data being transferreddish coloured-coloureddish coloured
  • Mechanisms being used to insert data into a data mart

Data Warehousing – System Managers

System management is mandatory for the successful implementation of a data battleehouse. The the majority of iminterfaceant system managers are:

  • System configuration manager
  • System scheduling manager
  • System actuallyt manager
  • System database manager
  • System backup recovery manager

System Configuration Manager

  • The system configuration manager is responsible for the management of the setup and configuration of data battleehouse.

  • The structure of configuration manager varies from one operating system to another.

  • In Unix structure of configuration, the manager varies from vendor to vendor.

  • Configuration managers have single user interface.

  • The interface of configuration manager permit’s us to manage all aspects of the system.

Note: The the majority of iminterfaceant configuration tool is the I/O manager.

System Scheduling Manager

System Scheduling Manager is responsible for the successful implementation of the data battleehouse. It’s purpose is to schedule ad hoc queries. Every operating system has it’s own scheduler with some form of batch manage mechanism. The list of features a system scheduling manager must have is as follows:

  • Work acombination cluster or MPP boundaries
  • Deal with international time differences
  • Handle job failure
  • Handle multiple queries
  • Supinterface job prioriconnects
  • Restart or re-queue the faiintroduced jobs
  • Notify the user or a process when job is comppermited
  • Maintain the job schedules acombination system out generally correct now thereages
  • Re-queue jobs to other queues
  • Supinterface the ceaseping and starting of queues
  • Log Queued jobs
  • Deal with inter-queue processing

Note: The above list can be used as evaluation parameters for the evaluation of a great scheduler.

Some iminterfaceant jobs thead wear a scheduler must be able to handle are as follows:

  • Daily and ad hoc query scheduling
  • Execution of regular reinterface requirements
  • Data load
  • Data processing
  • Index creation
  • Backup
  • Aggregation creation
  • Data transformation

Note: If the data battleehouse is operatening on a cluster or MPP architecture, then the system scheduling manager must be capable of operatening acombination the architecture.

System Event Manager

The actuallyt manager is a kind of a smoothbattlee. The actuallyt manager manages the actuallyts thead wear are degoodd on the data battleehouse system. We cannot manage the data battleehouse manually because the structure of data battleehouse is very complex. Therefore we need a tool thead wear automatically handles all the actuallyts without generally correct now there any kind of kind of intervention of the user.

Note: The Event manager monitors the actuallyts occurrences and deals with all of them. The actuallyt manager also tracks the myriad of things thead wear can go wrong on this particular particular complex data battleehouse system.

Events

Events are the actions thead wear are generated simply by the user or the system it’self. It may be noted thead wear the actuallyt is a measurable, observable, occurrence of a degoodd action.

Given below is a list of common actuallyts thead wear are requireddish coloured-coloureddish coloured to be tracked.

  • Hardbattlee failure
  • Running out generally correct now there of space on particular key disks
  • A process dying
  • A process returning an error
  • CPU usage exceeding an 805 threshold
  • Internal contention on database serialization stages
  • Buffer cache hit ratios exceeding or failure below threshold
  • A table reverying to maximum of it’s size
  • Excessive memory swapping
  • A table failing to extend because of to lack of space
  • Disk exhibiting I/O bottlenecks
  • Usage of temporary or sort area reverying a particular thresholds
  • Any other database shareddish coloured-coloureddish coloured memory usage

The the majority of iminterfaceant thing about generally correct now there actuallyts is thead wear they need to be capable of executing on their own. Event packages degood the procedures for the preddish coloured-coloureddish colouredegoodd actuallyts. The code associated with every actuallyt is known as actuallyt handler. This code is executed whenever an actuallyt occurs.

System and Database Manager

System and database manager may be 2 separate pieces of smoothbattlee, but they do the same job. The goal of these tools is to automate particular processes and to simplify the execution of others. The criteria for choosing a system and the database manager are as follows:

  • increase user's quota.
  • assign and de-assign roles to the users
  • assign and de-assign the profiles to the users
  • perform database space management
  • monitor and reinterface on space usage
  • tidy up fragmented and unused space
  • add and expand the space
  • add and remove users
  • manage user moveword
  • manage summary or temporary tables
  • assign or deassign temporary space to and from the user
  • reclaim the space form old or out generally correct now there-of-date temporary tables
  • manage error and trace logs
  • to blinese log and trace files
  • reddish coloured-coloureddish coloureimmediate error or trace information
  • switch on and away error and trace logging
  • perform system space management
  • monitor and reinterface on space usage
  • thoroughly clean up old and unused file immediateories
  • add or expand space.

System Backup Recovery Manager

The backup and recovery tool produces it easy for operations and management staff to back-up the data. Note thead wear the system backup manager must be integrated with the schedule manager smoothbattlee being used. The iminterfaceant features thead wear are requireddish coloured-coloureddish coloured for the management of backups are as follows:

  • Scheduling
  • Backup data tracruler
  • Database abattleeness

Backups are conpartrn only to protect against data loss. Following are the iminterfaceant stages to remember.

  • The backup smoothbattlee will maintain some form of database of where and when the piece of data was backed up.

  • The backup recovery manager must have a great front side-end to thead wear database.

  • The backup recovery smoothbattlee need to be database abattlee.

  • Being abattlee of the database, the smoothbattlee then can be addressed in database terms, and will not perform backups thead wear would not be viable.

Data Warehousing – Process Managers

Process managers are responsible for maintaining the flow of data both into and out generally correct now there of the data battleehouse. There are 3 various kinds of process managers:

  • Load manager
  • Warehouse manager
  • Query manager

Data Warehouse Load Manager

Load manager performs the operations requireddish coloured-coloureddish coloured to extract and load the data into the database. The size and complexity of a load manager varies between specific solutions from one data battleehouse to another.

Load Manager Architecture

The load manager does performs the following functions:

  • Extract data from the source system.

  • Fast load the extracted data into temporary data store.

  • Perform fundamental transformations into structure similar to the one in the data battleehouse.

Load Manager

Extract Data from Source

The data is extracted from the operational databases or the external information providers. Gateways are the application programs thead wear are used to extract data. It is supinterfaceed simply by underlying DBMS and permit’s the client program to generate SQL to be executed at a server. Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples of gateway.

Fast Load

  • In order to minimise the overalll load window, the data needs to be loaded into the battleehouse in the quickest probable time.

  • Transformations affect the speed of data processing.

  • It is more effective to load the data into a relational database prior to applying transformations and checks.

  • Gateway technology is not suitable, since they are ineffective when big data volumes are involved.

Simple Transformations

While loading, it may be requireddish coloured-coloureddish coloured to perform fundamental transformations. After comppermiting fundamental transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we need to perform the following checks:

  • Strip out generally correct now there all the columns thead wear are not requireddish coloured-coloureddish coloured within the battleehouse.
  • Convert all the values to requireddish coloured-coloureddish coloured data kinds.

Warehouse Manager

The battleehouse manager is responsible for the battleehouse management process. It consists of a third-party system smoothbattlee, C programs, and shell scripts. The size and complexity of a battleehouse manager varies between specific solutions.

Warehouse Manager Architecture

A battleehouse manager includes the following:

  • The manageling process
  • Storeddish coloured-coloureddish coloured procedures or C with SQL
  • Backup/Recovery tool
  • SQL scripts

Warehouse Manager

Functions of Warehouse Manager

A battleehouse manager performs the following functions:

  • Analyzes the data to perform consistency and referential integrity checks.

  • Creates indexes, business sees, partition sees against the base data.

  • Generates brand brand new aggregations and updates the existing aggregations.

  • Generates normalizations.

  • Transforms and merges the source data of the temporary store into the published data battleehouse.

  • Backs up the data in the data battleehouse.

  • Archives the data thead wear has reveryed the end of it’s captureddish coloured-coloureddish coloured life.

Note: A battleehouse Manager analyzes query profiles to determine whether the index and aggregations are appropriate.

Query Manager

The query manager is responsible for immediateing the queries to suitable tables. By immediateing the queries to appropriate tables, it speeds up the query request and response process. In addition, the query manager is responsible for scheduling the execution of the queries posted simply by the user.

Query Manager Architecture

A query manager includes the following components:

  • Query reddish coloured-coloureddish coloureimmediateion via C tool or RDBMS
  • Storeddish coloured-coloureddish coloured procedures
  • Query management tool
  • Query scheduling via C tool or RDBMS
  • Query scheduling via third-party smoothbattlee

Query Manager

Functions of Query Manager

  • It presents the data to the user in a form they understand.

  • It schedules the execution of the queries posted simply by the end-user.

  • It stores query profiles to permit the battleehouse manager to determine which indexes and aggregations are appropriate.

Data Warehousing – Security

The goal of a data battleehouse is to produce big amounts of data easily accessible to the users, hence permiting the users to extract information about generally correct now there the business as a whole. But we know thead wear generally correct now there can be some security rerigidions applied on the data thead wear can be an obstacle for accessing the information. If the analyst has a rerigided see of data, then it is improbable to capture a comppermite picture of the trends within the business.

The data from every analyst can be summarized and moveed on to management where the various summaries can be aggregated. As the aggregations of summaries cannot be the same as thead wear of the aggregation as a whole, it is probable to miss some information trends in the data unless severon your own is analyzing the data as a whole.

Security Requirements

Adding security features affect the performance of the data battleehouse, generally correct now therefore it is iminterfaceant to determine the security requirements as earlier as probable. It is difficult to add security features after the data battleehouse has gone live.

During the style phase of the data battleehouse, we need to maintain in mind exactly exactly whead wear data sources may be added later and exactly exactly whead wear would be the impact of adding those data sources. We need to conaspectr the following possibiliconnects during the style phase.

  • Whether the brand brand new data sources will require brand brand new security and/or audit rerigidions to be implemented?

  • Whether the brand brand new users added who have rerigided access to data thead wear is already generally available?

This sit downuation arises when the future users and the data sources are not well known. In such a sit downuation, we need to use the knowintroducadvantage of business and the goal of data battleehouse to know likely requirements.

The following activiconnects get affected simply by security measures:

  • User access
  • Data load
  • Data movement
  • Query generation

User Access

We need to very initial courseify the data and then courseify the users on the basis of the data they can access. In other words, the users are courseified according to the data they can access.

Data Classification

The following 2 approaches can be used to courseify the data:

  • Data can be courseified according to it’s sensit downivity. Highly-sensit downive data is courseified as highly rerigided and less-sensit downive data is courseified as less rerigidive.

  • Data can also be courseified according to the job function. This rerigidion permit’s only specific users to see particular data. Here we rerigid the users to see only thead wear part of the data in which they are attentioned and are responsible for.

There are some issues in the 2nd approach. To understand, permit's have an example. Suppose you are produceing the data battleehouse for a bank. Conaspectr thead wear the data being storeddish coloured-coloureddish coloured in the data battleehouse is the transaction data for all the accounts. The question here is, who is permited to see the transaction data. The solution lies in courseifying the data according to the function.

User courseification

The following approaches can be used to courseify the users:

  • Users can be courseified as per the hierarchy of users in an body organization, i.e., users can be courseified simply by departments, sections, groups, and so on.

  • Users can also be courseified according to their role, with people grouped acombination departments based on their role.

Classification on basis of Department

Let's have an example of a data battleehouse where the users are from sales and marketing department. We can have security simply by top-to-down company kind of kind of see, with access centreed-coloureddish coloured on the various departments. But generally correct now there can be some rerigidions on users at various levels. This structure is shown in the following diagram.

User Access Hierarchy

But if every department accesses various data, then we need to style the security access for every department separately. This can be achieved simply by departmental data marts. Since these data marts are separated from the data battleehouse, we can enforce separate security rerigidions on every data mart. This approach is shown in the following figure.

using data mart enforce rerigidions on access to data

Classification on basis of Role

If the data is generally available to all the departments, then it is helpful to follow the role access hierarchy. In other words, if the data is generally accessed simply by all If the data is generally available to all the departments, then it is helpful to follow the role access hierarchy. In other words, if the data is generally accessed simply by all

Role Access Hierarchy

Audit Requirements

Auditing is a subset of security, a costly activity. Auditing can cause big overminds on the system. To comppermite an audit in time, we require more hardbattlee and generally correct now therefore, it is recommended thead wear wherever probable, auditing need to be switched away. Audit requirements can be categorized as follows:

  • Connections
  • Disconnections
  • Data access
  • Data alter

Note : For every of the above-mentioned categories, it is essential to audit success, failure, or both. From the perspective of security reasons, the auditing of failures are very iminterfaceant. Auditing of failure is iminterfaceant because they can highlight unauthorised or fraudulent access.

Ne2rk Requirements

Ne2rk security is as iminterfaceant as other securiconnects. We cannot ignore the ne2rk security requirement. We need to conaspectr the following issues:

  • Is it essential to enweeppt data before transferring it to the data battleehouse?

  • Are generally correct now there rerigidions on which ne2rk rout generally correct now therees the data can conpartr?

These rerigidions need to be conaspectreddish coloured-coloureddish coloured carecomppermitey. Following are the stages to remember:

  • The process of enweepption and deweepption will increase overminds. It would require more processing power and processing time.

  • The cost of enweepption can be high if the system is already a loaded system because the enweepption is borne simply by the source system.

Data Movement

There exist possible security implications while moving the data. Suppose we need to transfer some rerigided data as a flat file to be loaded. When the data is loaded into the data battleehouse, the following questions are raised:

  • Where is the flat file storeddish coloured-coloureddish coloured?
  • Who has access to thead wear disk space?

If we speak about generally correct now there the backup of these flat files, the following questions are raised:

  • Do you backup enweeppted or deweeppted versions?
  • Do these backups need to be made to special tapes thead wear are storeddish coloured-coloureddish coloured separately?
  • Who has access to these tapes?

Some other forms of data movement like query result sets also need to be conaspectreddish coloured-coloureddish coloured. The questions raised while creating the temporary table are as follows:

  • Where is thead wear temporary table to be held?
  • How do you produce such table noticeable?

We need to avoid the accidental flout generally correct now thereing of security rerigidions. If a user with access to the rerigided data can generate accessible temporary tables, data can be noticeable to non-authorised users. We can overcome this particular particular issue simply by having a separate temporary area for users with access to rerigided data.

Documentation

The audit and security requirements need to be properly documented. This will be treated as a part of simplyification. This document can contain all the information gagenerally correct now thereddish coloured-coloureddish coloured from:

  • Data courseification
  • User courseification
  • Ne2rk requirements
  • Data movement and storage requirements
  • All auditable actions

Impact of Security on Design

Security affects the application code and the producement timesizes. Security affects the following area.

  • Application producement
  • Database style
  • Testing

Application Development

Security affects the general application producement and it also affects the style of the iminterfaceant components of the data battleehouse such as load manager, battleehouse manager, and query manager. The load manager may require checruler code to filter record and place all of them in various locations. More transformation rules may also be requireddish coloured-coloureddish coloured to hide particular data. Also generally correct now there may be requirements of extra metadata to handle any kind of kind of extra objects.

To produce and maintain extra sees, the battleehouse manager may require extra codes to enforce security. Extra checks may have to be coded into the data battleehouse to practuallyt it from being foointroduced into moving data into a location where it need to not be available. The query manager requires the alters to handle any kind of kind of access rerigidions. The query manager will need to be abattlee of all extra sees and aggregations.

Database style

The database layout generally correct now there is also affected because when security measures are implemented, generally correct now there is an increase in the number of sees and tables. Adding security increases the size of the database and hence increases the complexity of the database style and management. It will also add complexity to the backup management and recovery plan.

Testing

Testing the data battleehouse is a complex and durationy process. Adding security to the data battleehouse also affects the checking time complexity. It affects the checking in the following 2 ways:

  • It will increase the time requireddish coloured-coloureddish coloured for integration and system checking.

  • There is added functionality to be checked which will increase the size of the checking suite.

Data Warehousing – Backup

A data battleehouse is a complex system and it contains a huge volume of data. Therefore it is iminterfaceant to back up all the data so thead wear it becomes available for recovery in future as per requirement. In this particular particular chapter, we will talk about the issues in styleing the backup strategy.

Backup Terminologies

Before proceeding further, you need to know some of the backup terminologies talk abouted below.

  • Comppermite backup – It backs up the entire database at the same time. This backup includes all the database files, manage files, and journal files.

  • Partial backup – As the name suggests, it does not produce a comppermite backup of the database. Partial backup is very helpful in big databases because they permit a strategy where’simply by various parts of the database are backed up in a round-robin fashion on a day-to-day basis, so thead wear the whole database is backed up effectively once a week.

  • Cold backup – Cold backup is conpartrn while the database is comppermitely shut down. In multi-instance environment, all the instances need to be shut down.

  • Hot backup – Hot backup is conpartrn when the database engine is up and operatening. The requirements of hot backup varies from RDBMS to RDBMS.

  • Online backup – It is very similar to hot backup.

Hardbattlee Backup

It is iminterfaceant to select which hardbattlee to use for the backup. The speed of processing the backup and restore depends on the hardbattlee being used, how the hardbattlee is connected, bandwidth of the ne2rk, backup smoothbattlee, and the speed of server's I/O system. Here we will talk about some of the hardbattlee choices thead wear are available and their pros and cons. These choices are as follows:

  • Tape Technology
  • Disk Backups

Tape Technology

The tape choice can be categorized as follows:

  • Tape media
  • Standasingle tape drives
  • Tape stackers
  • Tape silos

Tape Media

There exists many kind of varieconnects of tape media. Some tape media standards are listed in the table below:

Tape Media Capacity I/O rates
DLT 40 GB 3 MB/s
3490e 1.6 GB 3 MB/s
8 mm 14 GB 1 MB/s

Other realityors thead wear need to be conaspectreddish coloured-coloureddish coloured are as follows:

  • Relipotential of the tape medium
  • Cost of tape medium per device
  • Scalpotential
  • Cost of upgrades to tape system
  • Cost of tape medium per device
  • Shelf life of tape medium

Standasingle tape drives

The tape drives can be connected in the following ways:

  • Direct to the server
  • As ne2rk available devices
  • Remotely to other machine

There can be issues in connecting the tape drives to a data battleehouse.

  • Conaspectr the server is a 48node MPP machine. We do not know the node to connect the tape drive and we do not know how to spread all of them over the server nodes to get the optimal performance with minimum disruption of the server and minimum internal I/O latency.

  • Connecting the tape drive as a ne2rk available device requires the ne2rk to be up to the job of the huge data transfer rates. Make sure thead wear sufficient bandwidth is available during the time you require it.

  • Connecting the tape drives remotely also require high bandwidth.

Tape Stackers

The method of loading multiple tapes into a single tape drive is known as tape stackers. The stacker dismounts the current tape when it has comppermiteed with it and loads the next tape, hence only one tape is available at a time to be accessed. The price and the capabiliconnects may vary, but the common potential is thead wear they can perform unattended backups.

Tape Silos

Tape silos provide big store capaciconnects. Tape silos can store and manage thougood sands of tapes. They can integrate multiple tape drives. They have the smoothbattlee and hardbattlee to label and store the tapes they store. It is very common for the silo to be connected remotely over a ne2rk or a dedicated link. We need to ensure thead wear the bandwidth of the connection is up to the job.

Disk Backups

Methods of disk backups are:

  • Disk-to-disk backups
  • Mirror brearuler

These methods are used in the OLTP system. These methods minimise the database downtime and maximize the availpotential.

Disk-to-disk backups

Here backup is conpartrn on the disk instead on the tape. Disk-to-disk backups are done for the following reasons:

  • Speed of initial backups
  • Speed of restore

Bacruler up the data from disk to disk is a lot quicker than to the tape. However it is the intermediate step of backup. Later the data is backed up on the tape. The other advantage of disk-to-disk backups is thead wear it gives you an online duplicate of the lacheck backup.

Mirror Brearuler

The idea is to have disks mirroreddish coloured-coloureddish coloured for resilience during the worruler day. When backup is requireddish coloured-coloureddish coloured, one of the mirror sets can be broken out generally correct now there. This technique is a variant of disk-to-disk backups.

Note: The database may need to be shutdown to guarantee consistency of the backup.

Optical JUKepackagees

Optical jUKepackagees permit the data to be storeddish coloured-coloureddish coloured near line. This technique permit’s a big number of optical disks to be managed in the same way as a tape stacker or a tape silo. The drawback of this particular particular technique is thead wear it has gradual write speed than disks. But the optical media provides lengthy-life and relipotential thead wear produces all of them a great choice of medium for archiving.

Softbattlee Backups

There are smoothbattlee tools available thead wear help in the backup process. These smoothbattlee tools come as a package. These tools not only conpartr backup, they can effectively manage and manage the backup strategies. There are many kind of kind of smoothbattlee packages available in the market. Some of all of them are listed in the following table:

Package Name Vendor
Ne2rker Legato
ADSM IBM
Epoch Epoch Systems
Omniback II HP
Alexandria Sequent

Criteria for Choosing Softbattlee Packages

The criteria for choosing the best smoothbattlee package are listed below:

  • How scalable is the item as tape drives are added?
  • Does the package have client-server option, or must it operate on the database server it’self?
  • Will it work in cluster and MPP environments?
  • Whead wear degree of parallelism is requireddish coloured-coloureddish coloured?
  • Whead wear platforms are supinterfaceed simply by the package?
  • Does the package supinterface easy access to information about generally correct now there tape contents?
  • Is the package database abattlee?
  • Whead wear tape drive and tape media are supinterfaceed simply by the package?

Data Warehousing – Tuning

A data battleehouse maintains evolving and it is unpreddish coloured-coloureddish colouredictable exactly exactly whead wear query the user is going to post in the future. Therefore it becomes more difficult to tune a data battleehouse system. In this particular particular chapter, we will talk about how to tune the various aspects of a data battleehouse such as performance, data load, queries, etc.

Difficulconnects in Data Warehouse Tuning

Tuning a data battleehouse is a difficult procedure because of to following reasons:

  • Data battleehouse is dynamic; it never remains constant.

  • It is very difficult to preddish coloured-coloureddish colouredict exactly exactly whead wear query the user is going to post in the future.

  • Business requirements alter with time.

  • Users and their profiles maintain changing.

  • The user can switch from one group to another.

  • The data load on the battleehouse also alters with time.

Note: It is very iminterfaceant to have a comppermite knowintroducadvantage of data battleehouse.

Performance Assessment

Here is a list of goal measures of performance:

  • Average query response time
  • Scan rates
  • Time used per day query
  • Memory usage per process
  • I/O throughplace rates

Following are the stages to remember.

  • It is essential to specify the measures in service level concurment (SLA).

  • It is of no use attempting to tune response time, if they are already better than those requireddish coloured-coloureddish coloured.

  • It is essential to have realistic expectations while maruler performance assessment.

  • It is also essential thead wear the users have feasible expectations.

  • To hide the complexity of the system from the user, aggregations and sees need to be used.

  • It is also probable thead wear the user can write a query you had not tuned for.

Data Load Tuning

Data load is a critical part of overnight processing. Nothing else can operate until data load is comppermite. This is the enattempt stage into the system.

Note: If generally correct now there is a delay in transferring the data, or in arrival of data then the entire system is affected badly. Therefore it is very iminterfaceant to tune the data load very initial.

There are various approaches of tuning data load thead wear are talk abouted below:

  • The very common approach is to insert data using the SQL Layer. In this particular particular approach, normal checks and constraints need to be performed. When the data is inserted into the table, the code will operate to check for enough space to insert the data. If sufficient space is not available, then more space may have to be allocated to these tables. These checks conpartr time to perform and are costly to CPU.

  • The 2nd approach is to simply bymove all these checks and constraints and place the data immediately into the preformatted blocks. These blocks are later composed to the database. It is quicker than the very initial approach, but it can work only with whole blocks of data. This can lead to some space wastage.

  • The third approach is thead wear while loading the data into the table thead wear already contains the table, we can maintain indexes.

  • The fourth approach says thead wear to load the data in tables thead wear already contain data, fall the indexes & reproduce all of them when the data load is comppermite. The choice between the third and the fourth approach depends on how a lot data is already loaded and how many kind of kind of indexes need to be rebuilt.

Integrity Checks

Integrity checruler highly affects the performance of the load. Following are the stages to remember.

  • Integrity checks need to be limited because they require big processing power.

  • Integrity checks need to be applied on the source system to avoid performance degrade of data load.

Tuning Queries

We have 2 kinds of queries in data battleehouse:

  • Fixed queries
  • Ad hoc queries

Fixed Queries

Fixed queries are well degoodd. Following are the examples of fixed queries:

  • regular reinterfaces
  • Canned queries
  • Common aggregations

Tuning the fixed queries in a data battleehouse is same as in a relational database system. The only difference is thead wear the amount of data to be queried may be various. It is great to store the the majority of successful execution plan while checking fixed queries. Storing these executing plan will permit us to spot changing data size and data skew, as it will cause the execution plan to alter.

Note: We cannot do more on reality table but while dealing with dimension tables or the aggregations, the usual collection of SQL tuseuler, storage mechanism, and access methods can be used to tune these queries.

Ad hoc Queries

To understand ad hoc queries, it is iminterfaceant to know the ad hoc users of the data battleehouse. For every user or group of users, you need to know the following:

  • The number of users in the group
  • Whether they use ad hoc queries at regular intervals of time
  • Whether they use ad hoc queries regularly
  • Whether they use ad hoc queries occasionally at unknown intervals.
  • The maximum size of query they tend to operate
  • The average size of query they tend to operate
  • Whether they require drill-down access to the base data
  • The elapsed login time per day
  • The peak time of daily usage
  • The number of queries they operate per peak hr

Points to Note

  • It is iminterfaceant to track the user's profiles and identify the queries thead wear are operate on a regular basis.

  • It is also iminterfaceant thead wear the tuning performed does not affect the performance.

  • Identify similar and ad hoc queries thead wear are regularly operate.

  • If these queries are identified, then the database will alter and brand brand new indexes can be added for those queries.

  • If these queries are identified, then brand brand new aggregations can be produced specifically for those queries thead wear would result in their effective execution.

Data Warehousing – Testing

Testing is very iminterfaceant for data battleehouse systems to produce all of them work appropriately and effectively. There are 3 fundamental levels of checking performed on a data battleehouse:

  • Unit checking
  • Integration checking
  • System checking

Unit Testing

  • In device checking, every component is separately checked.

  • Each module, i.e., procedure, program, SQL Script, Unix shell is checked.

  • This check is performed simply by the produceer.

Integration Testing

  • In integration checking, the various modules of the application are delivereddish coloured with every other and then checked against the number of inplaces.

  • It is performed to check whether the various components do well after integration.

System Testing

  • In system checking, the whole data battleehouse application is checked with every other.

  • The purpose of system checking is to check whether the entire system works appropriately with every other or not.

  • System checking is performed simply by the checking team.

  • Since the size of the whole data battleehouse is very big, it is usually probable to perform minimal system checking before the check plan can be enacted.

Test Schedule

First of all, the check schedule is produced in the process of produceing the check plan. In this particular particular schedule, we preddish coloured-coloureddish colouredict the estimated time requireddish coloured-coloureddish coloured for the checking of the entire data battleehouse system.

There are various methodologies available to produce a check schedule, but none of all of them are perfect because the data battleehouse is very complex and big. Also the data battleehouse system is evolving in character. One may face the following issues while creating a check schedule:

  • A fundamental issue may have a big size of query thead wear can conpartr a day or more to comppermite, i.e., the query does not comppermite in a desireddish coloured-coloureddish coloured time size.

  • There may be hardbattlee failures such as losing a disk or individual errors such as accidentally depermiting a table or overwriting a big table.

Note: Due to the above-mentioned difficulconnects, it is recommended to always double the amount of time you would normally permit for checking.

Testing Backup Recovery

Testing the backup recovery strategy is extremely iminterfaceant. Here is the list of scenarios for which this particular particular checking is needed:

  • Media failure
  • Loss or damage of table space or data file
  • Loss or damage of reddish coloured-coloureddish colouredo log file
  • Loss or damage of manage file
  • Instance failure
  • Loss or damage of archive file
  • Loss or damage of table
  • Failure during data failure

Testing Operational Environment

There are lots of aspects thead wear need to be checked. These aspects are listed below.

  • Security – A separate security document is requireddish coloured-coloureddish coloured for security checking. This document contains a list of dispermited operations and devising checks for every.

  • Scheduler – Scheduling smoothbattlee is requireddish coloured-coloureddish coloured to manage the daily operations of a data battleehouse. It needs to be checked during system checking. The scheduling smoothbattlee requires an interface with the data battleehouse, which will need the scheduler to manage overnight processing and the management of aggregations.

  • Disk Configuration. – Disk configuration also needs to be checked to identify I/O bottlenecks. The check need to be performed with multiple times with various settings.

  • Management Tools. – It is requireddish coloured-coloureddish coloured to check all the management tools during system checking. Here is the list of tools thead wear need to be checked.

    • Event manager
    • System manager
    • Database manager
    • Configuration manager
    • Backup recovery manager

Testing the Database

The database is checked in the following 3 ways:

  • Testing the database manager and monitoring tools – To check the database manager and the monitoring tools, they need to be used in the creation, operatening, and management of check database.

  • Testing database features – Here is the list of features thead wear we have to check:

    • Querying in parallel
    • Create index in parallel
    • Data load in parallel
  • Testing database performance – Query execution plays an extremely iminterfaceant role in data battleehouse performance measures. There are sets of fixed queries thead wear need to be operate regularly and they need to be checked. To check ad hoc queries, one need to go through the user requirement document and understand the business comppermitely. Take time to check the the majority of awkbattintroduced queries thead wear the business is likely to ask against various index and aggregation strategies.

Testing the Application

  • All the managers need to be integrated appropriately and work in order to ensure thead wear the end-to-end load, index, aggregate and queries work as per the expectations.

  • Each function of every manager need to work appropriately

  • It is also essential to check the application over a period of time.

  • Week end and month-end tasks need to also be checked.

Logistic of the Test

The aim of system check is to check all of the following areas.

  • Scheduling smoothbattlee
  • Day-to-day operational procedures
  • Backup recovery strategy
  • Management and scheduling tools
  • Overnight processing
  • Query performance

Note: The the majority of iminterfaceant stage is to check the scalpotential. Failure to do so will depart us a system style thead wear does not work when the system glines.

Data Warehousing – Future Aspects

Following are the future aspects of data battleehousing.

  • As we have seen thead wear the size of the open up database has glinen approximately double it’s magnitude in the last couple of yrs, it shows the significan not value thead wear it contains.

  • As the size of the databases gline, the estimates of exactly exactly whead wear constitutes an extremely big database continues to gline.

  • The hardbattlee and smoothbattlee thead wear are available today do not permit to maintain a big amount of data online. For example, a Telco call record requires 10TB of data to be kept online, which is simply a size of one month’s record. If it requires to maintain records of sales, marketing customer, employees, etc., then the size will be more than 100 TB.

  • The record contains textual information and some multimedia data. Multimedia data cannot be easily manipulated as text data. Searching the multimedia data is not an easy task, whereas textual information can be retrieved simply by the relational smoothbattlee available today.

  • Apart from size planning, it is complex to produce and operate data battleehouse systems thead wear are ever increasing in size. As the number of users increases, the size of the data battleehouse also increases. These users will also require to access the system.

  • With the glineth of the Internet, generally correct now there is a requirement of users to access data online.

Hence the future form of data battleehouse will be very various from exactly exactly whead wear is being produced today.

Data Warehousing – Intersee Questions

Dear readers, these Data Warehousing Intersee Questions have been styleed especially to get you acquainted with the character of questions you may encounter during your intersee for the subject of Data Warehousing.

Q: Degood data battleehouse?

A : Data battleehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data thead wear supinterfaces management's decision-maruler process.

Q: Whead wear does subject-oriented data battleehouse signify?

A : Subject oriented signifies thead wear the data battleehouse stores the information around a particular subject such as item, customer, sales, etc.

Q: List any kind of kind of five applications of data battleehouse.

A : Some applications include financial services, banruler services, customer greats, retail sectors, manageintroduced manurealityuring.

Q: Whead wear do OLAP and OLTP stand for?

A : OLAP is an acronym for Online Analytical Processing and OLTP is an acronym of Online Transactional Processing.

Q: Whead wear is the very fundamental difference between data battleehouse and operational databases?

A : A data battleehouse contains historical information thead wear is made available for analysis of the business whereas an operational database contains current information thead wear is requireddish coloured-coloureddish coloured to operate the business.

Q: List the Schema thead wear a data battleehouse system can implements.

A : A data Warehouse can implement star schema, snowflake schema, and reality constellation schema.

Q: Whead wear is Data Warehousing?

A : Data Warehousing is the process of constructing and using the data battleehouse.

Q: List the process thead wear are involved in Data Warehousing.

A : Data Warehousing involves data thoroughly cleaning, data integration and data constrongations.

Q: List the functions of data battleehouse tools and utiliconnects.

A : The functions performed simply by Data battleehouse tool and utiliconnects are Data Extraction, Data Cleaning, Data Transformation, Data Loading and Refreshing.

Q: Whead wear do you mean simply by Data Extraction?

A : Data extraction means gathering data from multiple heterogeneous sources.

Q: Degood metadata?

A : Metadata is simply degoodd as data about generally correct now there data. In other words, we can say thead wear metadata is the summarized data thead wear leads us to the detaiintroduced data.

Q: Whead wear does Metadata Respiratory contain?

A : Metadata respiratory contains definition of data battleehouse, business metadata, operational metadata, data for chartping from operational environment to data battleehouse, and the algorithms for summarization.

Q: How does a Data Cube help?

A : Data cube helps us to represent the data in multiple dimensions. The data cube is degoodd simply by dimensions and realitys.

Q: Degood dimension?

A : The dimensions are the enticonnects with respect to which an enterprise maintains the records.

Q: Excommon data mart.

A : Data mart contains the subset of body organization-wide data. This subset of data is imslotant to specific groups of an body organization. In other words, we can say thead wear a data mart contains data specific to a particular group.

Q: Whead wear is Virtual Warehouse?

A : The see over an operational data battleehouse is known as virtual battleehouse.

Q: List the phases involved in the data battleehouse delivery process.

A : The stages are IT strategy, Education, Business Case Analysis, specialised Blueprint, Build the version, History Load, Ad hoc query, Requirement Evolution, Automation, and Extending Scope.

Q: Degood load manager.

A : A load manager performs the operations requireddish coloured-coloureddish coloured to extract and load the process. The size and complexity of load manager varies between specific solutions from data battleehouse to data battleehouse.

Q: Degood the functions of a load manager.

A : A load manager extracts data from the source system. Fast load the extracted data into temporary data store. Perform fundamental transformations into structure similar to the one in the data battleehouse.

Q: Degood a battleehouse manager.

A : Warehouse manager is responsible for the battleehouse management process. The battleehouse manager consist of third party system smoothbattlee, C programs and shell scripts. The size and complexity of battleehouse manager varies between specific solutions.

Q: Degood the functions of a battleehouse manager.

A : The battleehouse manager performs consistency and referential integrity checks, produces the indexes, business sees, partition sees against the base data, transforms and merge the source data into the temporary store into the published data battleehouse, backs up the data in the data battleehouse, and archives the data thead wear has reveryed the end of it’s captureddish coloured-coloureddish coloured life.

Q: Whead wear is Summary Information?

A : Summary Information is the area in data battleehouse where the preddish coloured-coloureddish colouredegoodd aggregations are kept.

Q: Whead wear does the Query Manager responsible for?

A : Query Manager is responsible for immediateing the queries to the suitable tables.

Q: List the kinds of OLAP server

A : There are four kinds of OLAP servers, namely Relational OLAP, Multidimensional OLAP, Hybrid OLAP, and Specialized SQL Servers.

Q: Which one is quicker, Multidimensional OLAP or Relational OLAP?

A : Multidimensional OLAP is quicker than Relational OLAP.

Q: List the functions performed simply by OLAP.

A : OLAP performs functions such as roll-up, drill-down, slice, dice, and pivot.

Q: How many kind of kind of dimensions are selected in Slice operation?

A : Only one dimension is selected for the slice operation.

Q: How many kind of kind of dimensions are selected in dice operation?

A : For dice operation 2 or more dimensions are selected for a given cube.

Q: How many kind of kind of reality tables are generally correct now there in a star schema?

A : There is only one reality table in a star Schema.

Q: Whead wear is Normalization?

A : Normalization split’s up the data into additional tables.

Q: Out of star schema and snowflake schema, in in whose dimension table is normalized?

A : Snowflake schema uses the concept of normalization.

Q: Whead wear is the benefit of normalization?

A : Normalization helps in reddish coloured-coloureddish coloureducing data reddish coloured-coloureddish colouredundancy.

Q: Which language is used for defining Schema Definition?

A : Data Mining Query Language (DMQL) is used for Schema Definition.

Q: Whead wear language is the base of DMQL?

A : DMQL is based on Structureddish coloured-coloureddish coloured Query Language (SQL).

Q: Whead wear are the reasons for partitioning?

A : Partitioning is done for various reasons such as easy management, to assist backup recovery, to enhance performance.

Q: Whead wear kind of costs are involved in Data Marting?

A : Data Marting involves hardbattlee & smoothbattlee cost, ne2rk access cost, and time cost.

SHARE
Previous articleApache Storm
Next articlePDFBox

NO COMMENTS

LEAVE A REPLY