With the introduction of Foreign Data Wrappers in Postgres 9.1, access to distributed systems such as Hdfs, HBase, Hive with their multiple data formats is feasible. However, the existing FDW implementations for Big Data systems, such as Hdfs or Hive, lack a few key features and doesn’t have a common framework.
PXF provides a unified extensible framework for accessing any distributed system data source. Existing plugins include loading and querying of data stored in HDFS, HBase and Hive. It supports a wide range of data formats such as Text, Avro, Sequence, Hive RCFile, ORC, Parquet and Avro formats and HBase. The pluggable framework makes it very convenient for adding new custom plugins. It also supports advanced statistics and filter pushdown. PXF is an open source project and is currently being used by Apache HAWQ’s external table via PXF’s exposed REST API and is in the process of being integrated with other SQL engines.
With the integration of PXF into Postgres FDW, we can achieve a single unified pluggable framework to read and write any distributed system data source. PXF also abstracts Postgres from any remote client dependencies and provides a clean installation mechanism.
Speakers: Shivram Mani