Apache™ Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop®. It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases. Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache Falcon for these functions, maximizing reuse and consistency across Hadoop applications.
What Falcon Does
Falcon simplifies the development and management of data processing pipelines with introduction of higher layer of abstractions for users to work with. Falcon takes the complex coding out of data processing applications by providing common data management services out-of-the-box, simplifying the configuration and orchestration of data motion, disaster recovery and data retention workflows. Key features of Falcon include:
Data Replication Handling: Falcon replicates HDFS files and Hive Tables between different clusters for disaster recovery and multi-cluster data discovery scenarios.
Data Lifecycle Management: Falcon manages data eviction policies.
Data Lineage and Traceability: Falcon entity relationships enable users to view coarse-grained data lineage.
Process Coordination and Scheduling: Falcon automatically manages the complex logic of late data handling and retries.
Declarative Data Process Programming: Falcon introduces higher-level data abstractions (Clusters, Feeds and Processes) enabling separation of business logic from application logic, maximizing reuse and consistency when building processing pipelines.
Leverages Existing Hadoop Services: Falcon transparently coordinates and schedules data workflows using the existing Hadoop services such as Apache Oozie.
How Falcon Works
Falcon runs as a standalone server as part of your Hadoop cluster.
A user creates entity specifications and submits to Falcon using the Command Line Interface (CLI) or REST API. Falcon transforms the entity specifications into repeated actions through a Hadoop workflow scheduler. All the functions and workflow state management requirements are delegated to the scheduler. By default, Falcon uses Oozie for the scheduler.
The following diagram illustrates the entities defined as part of the Falcon framework:
Cluster: Represents the “interfaces” to a Hadoop cluster
Feed: Defines a dataset (such as HDFS files or Hive tables) with location, replication schedule and retention policy.
- Process: consumes Feeds and processes Feeds
Falcon in Action: Replication
All enterprise conversations involve data replication at some point. These range from the simple “I need multiple copies of data” to the more complex “I need certain subsets of staged, intermediate and presented data replicated across clusters in a failover scenario…and I need each dataset to have a different retention period.”
Typically, solving these problems are custom-built applications, which can be time consuming, error-prone and a long-term challenge to maintain. With Falcon, you can avoid the custom code and instead express the processing pipeline and replication policies in a simple declarative language.
In the scenario below, Staged data goes thru a sequence of processing to be consumed by business intelligence applications. The customer wants a replica of this data in a secondary cluster (for failover in case of cluster downtime). But the secondary cluster is smaller than the primary cluster, so only a subset of data should be replicated.
Hello Falcon. With Falcon, you define the datasets and processing workflow, and at designated points, Falcon replicates the data to the secondary cluster. Falcon orchestrates the processing and schedules the replication events. The end result: in case of failover, critical Staged and Presented is stored in the secondary cluster.