Quantcast
Channel: SCN : Document List - Data Services and Data Quality
Viewing all articles
Browse latest Browse all 401

Business Objects Data Services Standards & Best Practices

$
0
0

1      Introduction


Objective

To identify all the best practices that can be followed while developing Data Conversion routines   using Business Objects Data Services .

 

About BODS

Business Objects Data Services provides tools for Data Integration and Data Quality within a single product. Formerly, Data Integrator and Data Quality were two different products, which have been merged in Data Services to a single one, joining forces for a best of breed product in both areas. Data Services establishes an intuitive development user interface (UI), administration environment and single runtime architecture.

 

2 Naming Standards


2.1 Overview


The use of naming conventions within SAP Business Objects Data Services (DS) will assist in the ability to support a single or multi user development environment in a controlled fashion.  It will also assist in the production of documentation as through correct naming and object descriptions DS can produce component based documentation through its Auto Documentation tool found within the Management Console web based application.

The following sections describe the naming conventions for each type of object in DS.

The use of naming conventions can result in long names being used. To avoid very long object names being truncated in the Designer workspace, it is possible to increase the number of characters displayed for an object. To do so:

 

DI Designer > Tools > Options Menus:

 

The parameter “Number of characters in workspace icon name” defines the maximum number of characters displayed in the workplace. Set this parameter to the desired value.

 

As a general note, DS object names should not have the following imbedded in them:

 

  • Object versions (i.e. naming a data flow DF_LOAD_SALES_V0.3). Versioning should be handled by central repositories, not by naming conventions.
  • Environment specific information (i.e. naming a datastore DS_WAREHOUSE_DEV_1).  Environment information should be configured using datastore configurations, not by creating different names for each datastore.

 

 

 

 

 

2.2 Reusable Objects

 


Project

PRJ_<Module_Name><RICEFW ID>

PRJ_FI2365

JOB

JOB_<Module_Name><RICEFW ID>

JOB_FI2365

Script

Script name should start with:SCR_

SCR_<Description>

Work_Flow

WF_<Obect_ Name>_<Process>

WF_VendorMaster_BasicData

<Process>

Data_Flow

DF_<Object_Name>>_<Sub Process>

DF_VendorMaster_CompanyCode_Map

<Sub Process>

 

2.3 Sources and Targets

 


Datastore that connects to database

Prefix <DS>_<Description>

DS_Source

Datastore that connects to web service

DS_WS_{Description}

DS_WS_Customers

Datastore that connects to custom adapter

DS_{Type} _{Description}

DS_HTTP_Legacy_Customers

Application Datastore that connects to an application e.g. SAP R/3

AP_{Application}_{Description}

DS_R3_Finance

Application Datastore that connects to SAP BW Source

AP _BW_{Description}

DS_BW_Sales

File Format Template

FMT_{Delimiter}_{Description}

 

Delimiter = CSV,TAB,FIX

FMT_CSV_Customers

Files

Text Files : FMT_<Description>

Excel Files :<ObjectName>_<Description>

 

DTD’s

DTD_{Name}

DTD_Customer_Hierarchy

XSD Schema

XSD_{Name}

XSD_Customer_Hierarchy

SAP IDoc

IDC_{Name}

IDC_Sap_Customers

Cobol Copy Book

CCB_{Name}

CCB_Account

 

2.4    Work Flow Objects


Script

Script name should start with:SCR_

SCR_<Description>

Condition

WF_<Obect_ Name>_<Process>

WF_VendorMaster_BasicData

<Process>

While Loop

WHL_{Description}

WHL_No_More_Files

Try

TRY_{Description}

TRY_Dimension_Load

Catch

CAT_{Description}_{Error group}

CAT_Dimension_Load_All

 

2.5    Variables

 


Global Variable

$G_<Description>

$G_Time_Format

Parameter Variable – Input

$P_I_<Description>

$P_I_File_Name

Parameter Variable – Output

$P_O_<Description>

$P_O_Customer_ID

Parameter Variable – Input/Output

$P_IO_<Description>

$P_IO_File_Name

Local Variable

$L_<Description>

$L_Message_Type

Substitute Variables

 

 

 

2.6    Transforms


CASE

CSE_Description

CSE_JOB_STATUS

Date Generation

DTE_Description

DTE_GENERATION

Data Transfer

DTF_Description

DTF_STAGE1

Effective Date

EFD_Description

EFD_EFFECTIVE_TILL_DATE

Hierarchy Flattening (Horizontal)

HFH_Description

HFH_EMPLOYEE_ID

Hierarchy Flattening (Vertical)

HFV_Description

HFV_EMPLOYEE_ID

History Preservation

HSP_Description

HSP_ABSENCE_DATA

Map CDC Operation

CDC_Description

CDC_CHANGED_EMPLOYEE_ID

Map Operation

MAP_Description

MAP_ADDRESSES

Merge

MRG_Description

MRG_CUSTOMERS

Pivot

PVT_Description

PVT_MATERIAL_NUMBER

Query

QRY_Description

QRY_JOIN_EMPLOYEE_ID

Reverse Pivot

RPT_Description

RPT_MATERIAL_NUMBER

Row Generation

ROWGEN_Number of Rows

ROWGEN_50

SQL

SQL_Description

SQL_JOIN_VENDORS

Table Comparison

TCP_target table

TCP_STAGE1

Validation

VAL_Description

VAL_ORG_MANAGEMENT

XML Pipeline

XPL_{Description}

XPL_Cust_Hierachy

Key_Generation

KEYGEN_{Description}

KEYGEN_ID

Mandatory Check

Validate_Mandatory

 

Value Mapping Failure Check

Validate_Lookups

 

Date Format Check

Validate_Format

 

Referential Integrity Check

Validate_Intigrity

 

 

3 General Design Standards


3.1 Batch Jobs

Batch jobs should generally contain all the logic for a related set of activities. The content and functionality of each job should be driven by the scheduling requirements.  This mechanism generally separates jobs by source system accessed and by frequency of execution i.e. for each period (such as nightly, weekly, etc) that needs to be delivered.  This is because different systems will have different availability times, and hence the jobs will have different scheduling requirements.

 

Jobs should also be built with the following guidelines:

 

  • Workflows should be the only object used at the job level.  The only exceptions are try/catch and conditionals where job level replication is required.
  • Parallel workflows should be avoided at the job level as try/catch can not be applied when items are in parallel.

 

3.2 Real-Time Jobs


Real time jobs should only be considered when there is a need to process xml messages in real-time or where real-time integration is required with another application i.e. SAP R/3 IDocs.  Real time jobs should not be used where:

 

  • Systems only need to be near-time.  A better approach is to create a batch job and run it regularly (i.e. every 5 minutes).
  • Complex ETL processing is required, such as aggregations etc.

 

Often real-time jobs will be used to process xml into a staging area, and a batch job will run regularly to complete the processing and perform aggregations and other complex business functions.

 

3.3 Comments


Comments should be included throughout DS jobs.  With the Auto Documentation functionality of DS, comments can be passed straight through into the technical documentation. Comments should be added in the following places:

  • Description field of each object.  Every reusable object (i.e. job, work flow, data flow, etc) has a description field available.  This should include the author, date, and a short description of the object.
  • Scripts and functions – comments are indicated by a # in scripts and functions.  At the top of any code should be the author, create date, and a short description of the script.  Comments should be included within the code to describe tasks that are not self-explanatory.
  • Annotations – these should be used to describe areas of a work flow or data flow that are not self-explanatory.  It is not necessary to clutter the design areas with irrelevant comments such as “this query joins the table”.
  • Field comments – tables should have comments attached to each field.  These can be manually entered, imported from the database, or imported from any tool that supports the Common Warehouse Meta model (CWM).

 

3.4 Global Variables


Variables that are specific to a data flow or work flow should not be declared as global variables.  They should be declared as local variables and passed as parameters to the dependent objects.  The reasoning behind this is two fold: Firstly, due to the ability for DS to run these objects in a sequential or parallel execution framework, local variables and parameters allow for values to be modified without affecting other processes. Secondly, work flows and data flows can be reused in multiple Jobs and by declaring local variables and parameters, you break the reliance on the job level global variables having been configured and assigned the appropriate values.   Some examples of variables that should be defined locally are:

  • The filename for a flat file source for a data flow to load
  • Incremental variables used for conditionals or while-loops

 

The global variables that are used should be standardised across the user community. Some examples of valid global variables are:

 

Recovery Flag

A flag that is used to indicate the job should be executed in recovery mode

$G_Recovery

Start Date-Time

The start time variable should indicate the date and time that the job should start loading data from.  This is often the finish date of the last execution

$G_Start_Datetime

End Time

The end time variable should indicate the date and time that the job should end loading data from.  This should be set when the job starts in order to avoid overlaps.

$G_End_Datetime

Debug

A flag that tells the job to run in a debug mode.  The debug allows custom debug commands to run.

$G_Debug

Log

A flag that tells the job to run in Logging mode. 

$G_Log

Execution Id

An ID that represents the current execution of the job.  This is used as a reference point when writing to audit tables.

$G_Exec_ID

Job Id

An ID that represents the job.  This is used as a reference point when writing to audit tables.

$G_Job_ID

Version Number

A constant variable that represents the version of the job.

$G_Version

Database Type

When developing generic jobs, it can often be useful to know the underlying database type (SQL Server, Oracle etc.)

$G_DB_Type

 

3.5 Work Flows


The following guidelines should be followed when building work flows:

 

  • Objects can be left unconnected to run in parallel if they are not dependent on each other.  Parallel execution is particularly useful for workflows that are replicating a large number of tables into a different environment, or mass loading of flat files (common in extract jobs).  However, care needs to be taken when running parallel data flows, particularly if the parallel data flows are using the same source and target tables.  A limit can be set on the number of available parallel execution streams under tools – options – Job Server – Environment settings (default is 8) within the DS Designer tool.
  • Workflows should not rely on global variables for local tasks; instead local variables should be declared as local and passed as parameters to data flows that require them.  It is acceptable to use global variables for environment and global references, however other than the “initialisation” workflow that starts a job(s), generally work flows should only be referencing global variables, not modifying them.

 

3.6 Try/Catch


The try/catch objects should generally be used at the start of a job, and at the end of a job.  The end of the try/catch can be used to log a failure to audit tables, notify someone of the failure or provide other required custom functionality.  Try/catch objects can be placed at the job and work flow level and can also be programmatically referenced within the scripting language.

 

Generally try/catch shouldn’t be used as you would in typical programming languages, such as java, as in DS, if something goes wrong, generally the best approach is to stop all processing and investigate. 

 

It is quite common in the catch object to have a script that raises an exception (using the raise_exception() or raise_exception_ext functions).  This allows the error to be trapped, logged, and at the same time the DS administrator job is marked with a red-light to indicate that it failed.

 

3.7 While Loops


While loops are mostly used for jobs that need to load a series of flat files or xml files, and perform some additional functions on them such as moving them to a backup directory and updating control tables to indicate load success and fail.

 

The same standards regarding the use of global variables should also be applied to while loops.  This means variables that need to be updated (such as an iteration variable) should be declared as local variables.  The local variables should be passed to underlying data flows using parameters.

 

3.8 Conditionals


Conditionals are used to choose which object(s) should be used for a particular execution.  Conditionals can contain all objects that a work flow can.  They are generally used for the following types of tasks:

 

  • Indicating if a job should run in recovery mode or not.
  • Indicating if a job should be an initial or delta load.
  • Indicating whether a job is the nightly batch or a weekly batch (i.e. the weekly batch may have additional business processing).
  • Indicating whether parts of a job should be executed, such as executing the extract, cleanses, and confirm steps, but don’t execute the deliver step.

 

 

 

 

3.9 Scripts and Custom Functions


The following guidelines should be followed when building scripts and custom functions:

 

  • The sql() function should be used only as a last resort.  This is because tables accessed in sql() function are not visible in the metadata manager.  The lookup_ext function can be used for lookup related queries, and a data flow should be built for insert/update/delete queries.
  • Custom functions should be written where the logic is too complex to write directly into the mapping part of a data flow or the logic needs to be componentised, reused and documented in more detail.
  • Global variables should never be referenced in a custom function. They should be passed in and out as parameters.  A custom function can be shared across multiple jobs and therefore referencing job level global variables is bad practice.

 

Note the following areas to be careful of when using custom functions:

  • Often custom functions will cause the data flow’s pushdown SQL to generate ineffectively.  This often happens when using a custom function in the where clause of a query.
  • Calling custom functions in high volume data flows can cause performance degradation (particularly where parallel execution is used).

 

3.10 Data Flows


In general, a data flow should be designed to load information from one or more sources into a single target.  A single data flow should generally not have multiple tables as a target.  Exceptions are:

  • Writing to auditing tables (i.e. writing out the row count).
  • Writing invalid rows to a backup table.

 

The following items should be considered best practices in designing efficient and clean data flows:

  • All template and temporary tables should be imported, approved and optimized by database experts before releasing to a production environment.
  • The pushdown SQL should be reviewed to ensure indexes and partitions are being used efficiently.
  • All redundant code (such as useless transforms or extra fields) should be removed before releasing.
  • Generally the most efficient method of building a data flow is to use as few transforms as possible.

 

There are several common practices that can cause instability and performance problems in the design of data flow.  These are mostly caused when DS needs to load entire datasets into memory in order to achieve a task.  Some tips to avoid these are as follows:

  • Ensure all sources tables in the data flow are from the same datastore, thus allowing the entire SQL command to be pushed down to the database.
  • Each data flow should one use one main target table (this excludes tables used for auditing and rejecting rows).
  • Generally the pushdown SQL should contain only one SQL command.  There are cases where more commands are acceptable. For example, if one of the tables being queried only returns a small number of rows. However, generally multiple SQL commands will mean that DS needs to perform in-memory joins, which can cause memory problems.
  • Check that all order by, where and group by clauses in the queries are included in the pushdown SQL.
  • If reverse pivot transforms are used, check that the input volume is known and consistent and can therefore be tested.
  • If the PRE_LOAD_CACHE option is being used on lookups, ensure that the translation table dataset is small enough to fit into memory and will always be of a comparable size.
  • Always try and use the sorted input option on table comparisons, being careful to ensure that the input is sorted in the pushdown sql.

Viewing all articles
Browse latest Browse all 401

Trending Articles