SAP DataServices (DS) supports two ways of accessing data on a Hadoop cluster:
- HDFS:
DS reads directly HDFS files from Hadoop. In DataServices you need to create HDFS file formats in order to use this setup. Depending on your dataflow DataServices might read the HDFS file directly into the DS engine and then handle the further processing in its own engine.
If your dataflow contains more logic that could be pushed down to Hadoop, DS may as well generate a Pig script. The Pig script will then not just read the HDFS file but also handle other transformations, aggregations etc. from your dataflow.
The latter scenario is usually a preferred setup for large amount of data because this way the Hadoop cluster can provide processing power of many Hadoop nodes on inexpensive commodity hardware. The pushdown of dataflow logic to Pig/Hadoop is similar to the pushdown to relational database systems. It depends on whether the dataflow uses functions that could be processed in the underlying systems (and of course whether DS knows about all the capabilities in the underlying system). In general, the most common functions, joins and aggregations in DS should be eligible to be pushed down to Hadoop. - Hive / HCatalog:
DS reads data from Hive. Hive accesses data that is defined in HCatalog tables. In DataServices you need to setup a datastore that connects to a Hive adapter. The Hive adapter will in turn read tables from Hadoop by accessing a hive server. DS will generate HiveQL commands. HiveQL is similar to SQL. Therefore the pushdown of dataflow logic to Hive works similar as the pushdown to relational database systems.
It is difficult to say which approach better to use in DataServices: HDFS files/Pig or Hive? Both, Pig and Hive generate MapReduce jobs in Hadoop and therefore performance should be similar. Nevertheless, some aspects can be considered before deciding which way to go. I have tried to describe these aspects in the table below. They are probably not complete and they overlap, so in many cases they will not identify a clear favorite. But still the table below may give some guidance in the decision process.
Subject | HDFS / Pig | Hive |
---|---|---|
Setup connectivity | Simple via HDFS file formats | Not simple. Hive adapter and Hive datastore need to be setup. The CLASSPATH setting for the hive adapter is not trivial to determine. Different setup scenarios for DS 4.1 and 4.2.2 or later releases (see also Connecting SAP DataServices to Hadoop Hive) |
Overall purpose |
| Hive are queries mainly intended for DWH-like queries. They might suit well for DataServices jobs that need to join and aggregate large data in Hadoop and write the results into pre-aggregated tables in a DWH or some other datastores accessible for BI tools. |
Runtime performance |
|
|
Development aspects |
|
|
Data structure | In general, HDFS file formats will suit better for unstructured or semi-structured data. | There is little benefit from accessing unstructured data via HiveQL, because Hive will first save the results of a HiveQL query into a directory on the local file system of the DS engine. DS will then read the results from that file for further processing. Instead, reading the data directly via a HDFS file might be quicker. |
Future technologies: Stinger, Impala, Tez etc. | Some data access technologies will improve the performance of HiveQL or Pig Latin significantly (some already do so today). Most of them will support HiveQL whereas some of them will support both, HiveQL and Pig. The decision between Hive and Pig might also depend on the (future) data access engines in the given Hadoop environment. |