1 Introduction

Objective

To identify all the best practices that can be followed while developing Data Conversion routines using Business Objects Data Services .

About BODS

Business Objects Data Services provides tools for Data Integration and Data Quality within a single product. Formerly, Data Integrator and Data Quality were two different products, which have been merged in Data Services to a single one, joining forces for a best of breed product in both areas. Data Services establishes an intuitive development user interface (UI), administration environment and single runtime architecture.

2 Naming Standards

2.1 Overview

The use of naming conventions within SAP Business Objects Data Services (DS) will assist in the ability to support a single or multi user development environment in a controlled fashion. It will also assist in the production of documentation as through correct naming and object descriptions DS can produce component based documentation through its Auto Documentation tool found within the Management Console web based application.

The following sections describe the naming conventions for each type of object in DS.

The use of naming conventions can result in long names being used. To avoid very long object names being truncated in the Designer workspace, it is possible to increase the number of characters displayed for an object. To do so:

DI Designer > Tools > Options Menus:

The parameter “Number of characters in workspace icon name” defines the maximum number of characters displayed in the workplace. Set this parameter to the desired value.

As a general note, DS object names should not have the following imbedded in them:

Object versions (i.e. naming a data flow DF_LOAD_SALES_V0.3). Versioning should be handled by central repositories, not by naming conventions.
Environment specific information (i.e. naming a datastore DS_WAREHOUSE_DEV_1). Environment information should be configured using datastore configurations, not by creating different names for each datastore.

2.2 Reusable Objects

Project	PRJ_<Module_Name><RICEFW ID>	PRJ_FI2365
JOB	JOB_<Module_Name><RICEFW ID>	JOB_FI2365
Script	Script name should start with:SCR_	SCR_<Description>
Work_Flow	WF_<Obect_ Name>_<Process>	WF_VendorMaster_BasicData <Process>
Data_Flow	DF_<Object_Name>>_<Sub Process>	DF_VendorMaster_CompanyCode_Map <Sub Process>

2.3 Sources and Targets

Datastore that connects to database	Prefix <DS>_<Description>	DS_Source
Datastore that connects to web service	DS_WS_{Description}	DS_WS_Customers
Datastore that connects to custom adapter	DS_{Type} _{Description}	DS_HTTP_Legacy_Customers
Application Datastore that connects to an application e.g. SAP R/3	AP_{Application}_{Description}	DS_R3_Finance
Application Datastore that connects to SAP BW Source	AP _BW_{Description}	DS_BW_Sales
File Format Template	FMT_{Delimiter}_{Description} Delimiter = CSV,TAB,FIX	FMT_CSV_Customers
Files	Text Files : FMT_<Description> Excel Files :<ObjectName>_<Description>
DTD’s	DTD_{Name}	DTD_Customer_Hierarchy
XSD Schema	XSD_{Name}	XSD_Customer_Hierarchy
SAP IDoc	IDC_{Name}	IDC_Sap_Customers
Cobol Copy Book	CCB_{Name}	CCB_Account

2.4 Work Flow Objects

Script	Script name should start with:SCR_	SCR_<Description>
Condition	WF_<Obect_ Name>_<Process>	WF_VendorMaster_BasicData <Process>
While Loop	WHL_{Description}	WHL_No_More_Files
Try	TRY_{Description}	TRY_Dimension_Load
Catch	CAT_{Description}_{Error group}	CAT_Dimension_Load_All

2.5 Variables

Global Variable	$G_<Description>	$G_Time_Format
Parameter Variable – Input	$P_I_<Description>	$P_I_File_Name
Parameter Variable – Output	$P_O_<Description>	$P_O_Customer_ID
Parameter Variable – Input/Output	$P_IO_<Description>	$P_IO_File_Name
Local Variable	$L_<Description>	$L_Message_Type
Substitute Variables

2.6 Transforms

CASE	CSE_Description	CSE_JOB_STATUS
Date Generation	DTE_Description	DTE_GENERATION
Data Transfer	DTF_Description	DTF_STAGE1
Effective Date	EFD_Description	EFD_EFFECTIVE_TILL_DATE
Hierarchy Flattening (Horizontal)	HFH_Description	HFH_EMPLOYEE_ID
Hierarchy Flattening (Vertical)	HFV_Description	HFV_EMPLOYEE_ID
History Preservation	HSP_Description	HSP_ABSENCE_DATA
Map CDC Operation	CDC_Description	CDC_CHANGED_EMPLOYEE_ID
Map Operation	MAP_Description	MAP_ADDRESSES
Merge	MRG_Description	MRG_CUSTOMERS
Pivot	PVT_Description	PVT_MATERIAL_NUMBER
Query	QRY_Description	QRY_JOIN_EMPLOYEE_ID
Reverse Pivot	RPT_Description	RPT_MATERIAL_NUMBER
Row Generation	ROWGEN_Number of Rows	ROWGEN_50
SQL	SQL_Description	SQL_JOIN_VENDORS
Table Comparison	TCP_target table	TCP_STAGE1
Validation	VAL_Description	VAL_ORG_MANAGEMENT
XML Pipeline	XPL_{Description}	XPL_Cust_Hierachy
Key_Generation	KEYGEN_{Description}	KEYGEN_ID
Mandatory Check	Validate_Mandatory
Value Mapping Failure Check	Validate_Lookups
Date Format Check	Validate_Format
Referential Integrity Check	Validate_Intigrity

3 General Design Standards

3.1 Batch Jobs

Batch jobs should generally contain all the logic for a related set of activities. The content and functionality of each job should be driven by the scheduling requirements. This mechanism generally separates jobs by source system accessed and by frequency of execution i.e. for each period (such as nightly, weekly, etc) that needs to be delivered. This is because different systems will have different availability times, and hence the jobs will have different scheduling requirements.

Jobs should also be built with the following guidelines:

Workflows should be the only object used at the job level. The only exceptions are try/catch and conditionals where job level replication is required.
Parallel workflows should be avoided at the job level as try/catch can not be applied when items are in parallel.

3.2 Real-Time Jobs

Real time jobs should only be considered when there is a need to process xml messages in real-time or where real-time integration is required with another application i.e. SAP R/3 IDocs. Real time jobs should not be used where:

Systems only need to be near-time. A better approach is to create a batch job and run it regularly (i.e. every 5 minutes).
Complex ETL processing is required, such as aggregations etc.

Often real-time jobs will be used to process xml into a staging area, and a batch job will run regularly to complete the processing and perform aggregations and other complex business functions.

3.3 Comments

Comments should be included throughout DS jobs. With the Auto Documentation functionality of DS, comments can be passed straight through into the technical documentation. Comments should be added in the following places:

Description field of each object. Every reusable object (i.e. job, work flow, data flow, etc) has a description field available. This should include the author, date, and a short description of the object.
Scripts and functions – comments are indicated by a # in scripts and functions. At the top of any code should be the author, create date, and a short description of the script. Comments should be included within the code to describe tasks that are not self-explanatory.
Annotations – these should be used to describe areas of a work flow or data flow that are not self-explanatory. It is not necessary to clutter the design areas with irrelevant comments such as “this query joins the table”.
Field comments – tables should have comments attached to each field. These can be manually entered, imported from the database, or imported from any tool that supports the Common Warehouse Meta model (CWM).

3.4 Global Variables

Variables that are specific to a data flow or work flow should not be declared as global variables. They should be declared as local variables and passed as parameters to the dependent objects. The reasoning behind this is two fold: Firstly, due to the ability for DS to run these objects in a sequential or parallel execution framework, local variables and parameters allow for values to be modified without affecting other processes. Secondly, work flows and data flows can be reused in multiple Jobs and by declaring local variables and parameters, you break the reliance on the job level global variables having been configured and assigned the appropriate values. Some examples of variables that should be defined locally are:

The filename for a flat file source for a data flow to load
Incremental variables used for conditionals or while-loops

The global variables that are used should be standardised across the user community. Some examples of valid global variables are:

Recovery Flag	A flag that is used to indicate the job should be executed in recovery mode	$G_Recovery
Start Date-Time	The start time variable should indicate the date and time that the job should start loading data from. This is often the finish date of the last execution	$G_Start_Datetime
End Time	The end time variable should indicate the date and time that the job should end loading data from. This should be set when the job starts in order to avoid overlaps.	$G_End_Datetime
Debug	A flag that tells the job to run in a debug mode. The debug allows custom debug commands to run.	$G_Debug
Log	A flag that tells the job to run in Logging mode.	$G_Log
Execution Id	An ID that represents the current execution of the job. This is used as a reference point when writing to audit tables.	$G_Exec_ID
Job Id	An ID that represents the job. This is used as a reference point when writing to audit tables.	$G_Job_ID
Version Number	A constant variable that represents the version of the job.	$G_Version
Database Type	When developing generic jobs, it can often be useful to know the underlying database type (SQL Server, Oracle etc.)	$G_DB_Type

3.5 Work Flows

The following guidelines should be followed when building work flows:

Objects can be left unconnected to run in parallel if they are not dependent on each other. Parallel execution is particularly useful for workflows that are replicating a large number of tables into a different environment, or mass loading of flat files (common in extract jobs). However, care needs to be taken when running parallel data flows, particularly if the parallel data flows are using the same source and target tables. A limit can be set on the number of available parallel execution streams under tools – options – Job Server – Environment settings (default is 8) within the DS Designer tool.
Workflows should not rely on global variables for local tasks; instead local variables should be declared as local and passed as parameters to data flows that require them. It is acceptable to use global variables for environment and global references, however other than the “initialisation” workflow that starts a job(s), generally work flows should only be referencing global variables, not modifying them.

3.6 Try/Catch

The try/catch objects should generally be used at the start of a job, and at the end of a job. The end of the try/catch can be used to log a failure to audit tables, notify someone of the failure or provide other required custom functionality. Try/catch objects can be placed at the job and work flow level and can also be programmatically referenced within the scripting language.

Generally try/catch shouldn’t be used as you would in typical programming languages, such as java, as in DS, if something goes wrong, generally the best approach is to stop all processing and investigate.

It is quite common in the catch object to have a script that raises an exception (using the raise_exception() or raise_exception_ext functions). This allows the error to be trapped, logged, and at the same time the DS administrator job is marked with a red-light to indicate that it failed.

3.7 While Loops

While loops are mostly used for jobs that need to load a series of flat files or xml files, and perform some additional functions on them such as moving them to a backup directory and updating control tables to indicate load success and fail.

The same standards regarding the use of global variables should also be applied to while loops. This means variables that need to be updated (such as an iteration variable) should be declared as local variables. The local variables should be passed to underlying data flows using parameters.

3.8 Conditionals

Conditionals are used to choose which object(s) should be used for a particular execution. Conditionals can contain all objects that a work flow can. They are generally used for the following types of tasks:

Indicating if a job should run in recovery mode or not.
Indicating if a job should be an initial or delta load.
Indicating whether a job is the nightly batch or a weekly batch (i.e. the weekly batch may have additional business processing).
Indicating whether parts of a job should be executed, such as executing the extract, cleanses, and confirm steps, but don’t execute the deliver step.

3.9 Scripts and Custom Functions

The following guidelines should be followed when building scripts and custom functions:

The sql() function should be used only as a last resort. This is because tables accessed in sql() function are not visible in the metadata manager. The lookup_ext function can be used for lookup related queries, and a data flow should be built for insert/update/delete queries.
Custom functions should be written where the logic is too complex to write directly into the mapping part of a data flow or the logic needs to be componentised, reused and documented in more detail.
Global variables should never be referenced in a custom function. They should be passed in and out as parameters. A custom function can be shared across multiple jobs and therefore referencing job level global variables is bad practice.

Note the following areas to be careful of when using custom functions:

Often custom functions will cause the data flow’s pushdown SQL to generate ineffectively. This often happens when using a custom function in the where clause of a query.
Calling custom functions in high volume data flows can cause performance degradation (particularly where parallel execution is used).

3.10 Data Flows

In general, a data flow should be designed to load information from one or more sources into a single target. A single data flow should generally not have multiple tables as a target. Exceptions are:

Writing to auditing tables (i.e. writing out the row count).
Writing invalid rows to a backup table.

The following items should be considered best practices in designing efficient and clean data flows:

All template and temporary tables should be imported, approved and optimized by database experts before releasing to a production environment.
The pushdown SQL should be reviewed to ensure indexes and partitions are being used efficiently.
All redundant code (such as useless transforms or extra fields) should be removed before releasing.
Generally the most efficient method of building a data flow is to use as few transforms as possible.

There are several common practices that can cause instability and performance problems in the design of data flow. These are mostly caused when DS needs to load entire datasets into memory in order to achieve a task. Some tips to avoid these are as follows:

Ensure all sources tables in the data flow are from the same datastore, thus allowing the entire SQL command to be pushed down to the database.
Each data flow should one use one main target table (this excludes tables used for auditing and rejecting rows).
Generally the pushdown SQL should contain only one SQL command. There are cases where more commands are acceptable. For example, if one of the tables being queried only returns a small number of rows. However, generally multiple SQL commands will mean that DS needs to perform in-memory joins, which can cause memory problems.
Check that all order by, where and group by clauses in the queries are included in the pushdown SQL.
If reverse pivot transforms are used, check that the input volume is known and consistent and can therefore be tested.
If the PRE_LOAD_CACHE option is being used on lookups, ensure that the translation table dataset is small enough to fit into memory and will always be of a comparable size.
Always try and use the sorted input option on table comparisons, being careful to ensure that the input is sorted in the pushdown sql.