Unifying Data Pipelines and Microservices: A Novel Architectural Approach

Unifying Data Pipelines and Microservices: A Novel Architectural Approach

A new architectural approach that bridges the traditional divide between data pipeline systems and microservice architectures.

By

Yichen Jin

Co-Founder & CEO, Fleak

By reconceptualizing the boundaries between batch processing, stream processing, and synchronous API services, we propose a unified framework that enables more flexible and efficient data processing systems. This architecture allows organizations to leverage the strengths of both paradigms while minimizing their respective limitations.

Introduction

Modern data architectures typically maintain a clear separation between data pipelines and application services. Data pipelines handle batch and stream processing of large-scale data, while microservices manage real-time application requests. This separation, while conventional, can lead to unnecessary complexity and redundancy in system design. This paper explores an alternative approach that unifies these traditionally distinct architectural patterns.

Background

Traditional Data Pipelines

Data pipelines serve as the backbone of data processing systems, operating in both batch and streaming contexts. In batch processing, pipelines execute at scheduled intervals using orchestration frameworks like Apache Airflow or Dagster, processing finite sets of data in discrete time windows. Streaming pipelines, conversely, operate continuously on real-time data flows, often utilizing frameworks like Apache Flink or Kafka Streams for immediate data processing.

Regardless of their operational mode, these pipelines share common characteristics: they extract data from source systems (such as databases, applications, or data warehouses), apply transformation logic, and load results into destination systems. While effective for traditional data processing needs, these pipeline architectures typically maintain strict boundaries with application-layer services, leading to potential inefficiencies in modern, real-time data processing scenarios.

Microservice Architecture

Microservices represent a distributed architectural pattern where applications are built as a collection of loosely coupled, independently deployable services. These services typically communicate via HTTP APIs, can be implemented in different programming languages, and scale independently based on demand. While this architecture offers flexibility and scalability for application development, it traditionally maintains a separate domain from data processing systems, often requiring additional infrastructure and complexity when integrating with data pipelines.

The distinctness of these two architectural paradigms - data pipelines and microservices - has historically served valid purposes but also created artificial boundaries in modern data processing needs. This paper explores how these boundaries can be dissolved to create more efficient and flexible systems.

Proposed Architecture

Modern data architectures typically maintain distinct boundaries between stream processing engines, batch pipelines, and microservices. While this separation serves historical purposes, it introduces complexity in system design and operations. This section presents an architecture that explores the unification of these traditionally separate components through a novel compilation approach.

3.1 Current Architectural Approaches

Consider a simple real-time fraud detection system for transaction data. The system typically involves a couple of services below:

In production, organizations typically need to handle both real-time stream processing of transactions and on-demand API requests for transaction verification. This requirement commonly leads to maintaining two separate implementations:

Stream Processing Implementation:

In a stream processing implementation using Apache Flink, organizations typically create a single job that processes transactions in real-time through a series of transformations. The job consumes transaction events from Kafka and enriches each transaction by making external service calls for user history data, IP intelligence, and fraud scoring through a machine learning model. These transformations are chained together in a single Flink job, where each step is implemented as a mapping function that handles its own HTTP client connections, retries, and error handling. 

public class FraudDetectionStreamJob {
    public static void main(String[] args) {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        env.addSource(new KafkaSource<>("transactions"))
           .map(new UserHistoryEnrichment())    
           .map(new IPIntelligenceEnrichment())
    .map(new UserContactEnrichment()) 
           .map(new ModelInferenceFunction())   
           .process(new AlertingFunction())     
           .addSink(new KafkaSink<>("results"));
        
        env.execute("Fraud Detection Stream");
    }
}

API Service Implementation:

For API service implementation, organizations typically develop a web service using frameworks like Spring or Express that exposes HTTP endpoints for transaction verification. When a request arrives, the service executes a series of synchronous calls to external services to gather user history, IP intelligence, and fraud scores. The service manages its own connection pools for external service calls and implements retry mechanisms for failed requests. The results are then aggregated and returned as an HTTP response to the client.

@RestController
public class FraudDetectionApiService {
    @PostMapping("/check-transaction")
    public FraudCheckResponse checkTransaction(@RequestBody Transaction tx) {
        UserData userData = userService.getHistory(tx.getUserId());
        IpData ipData = ipService.checkIp(tx.getIpAddress());
	 Contact userContact = contactService.checkContact(tx.getContact());
        double score = modelService.predict(tx, userData, ipData, Contact);
        return new FraudCheckResponse(score);
    }
}

This dual implementation approach leads to several challenges:

  • Same business logic maintained in different codebases

  • Separate error handling mechanisms

  • Different deployment and scaling patterns

  • Distinct monitoring and observability setups

When business rules for fraud detection change, both implementations must be updated independently. For example, if risk scoring thresholds need adjustment, developers must modify both the Flink job's ModelInferenceFunction and the API service's scoring logic. This creates opportunities for inconsistencies where one implementation might be updated while the other remains on older logic.

Error handling becomes particularly complex as each implementation develops its own patterns. The Flink job might retry failed external service calls with exponential backoff, while the API service implements circuit breakers. When a user history service is temporarily unavailable, the stream processing job might skip the enrichment step and continue with partial data, while the API service might return an error response. These divergent behaviors can lead to different fraud detection outcomes for the same transaction patterns.

Monitoring and alerting also fragment across implementations. The Flink job reports metrics about streaming throughput and processing latency, while the API service tracks request rates and response times. When investigating potential fraud patterns, security teams must correlate data across these separate monitoring systems to understand the full picture.

Connection management to external services becomes another point of divergence. Each implementation maintains its own connection pools and timeout configurations. During high-traffic periods, one implementation might handle resource constraints differently from the other, potentially leading to inconsistent fraud detection behavior under load.

These challenges compound when organizations need to make rapid changes to their fraud detection logic in response to emerging fraud patterns. The requirement to coordinate changes across separate codebases slows down response time and increases the risk of introducing inconsistencies in fraud detection.

Unified Processing Model

We propose a unified architecture where data processing workflows can be defined once and deployed either as stream processors or API services. The fundamental insight of this architecture is that data processing patterns remain consistent regardless of their input source or output destination. Whether data arrives via a streaming source or an HTTP request, the core transformation logic remains the same. This observation leads to an architecture where processing pipelines can be defined once and deployed in multiple contexts.

Taking the fraud detection example, here's how the system implements consistent processing across both modes. The core fraud detection logic can be expressed through a configuration JSON that might look like below:

{
    "name": "fraud_detection",
    "nodes": [
        {
            "nodeId": "enrich_user_history",
            "commandName": "DBQuery",
            "arg": {
                "query": "{$query}"
                "connection_pool": {
                    "max_connections": 50,
                    "idle_timeout": "30s"
                },
                "error_handling": {
                    "retry_count": 3,
                    "backoff": "exponential",
                    "fallback": "use_cached_history"
                }
            }
        },
        {
            "nodeId": "ip_intelligence",
            "commandName": "httpCall",
            "arg": {
                "service": "ip_check",
                "endpoint": "/verify",
                "connection_pool": {
                    "max_connections": 30,
                    "idle_timeout": "30s"
                },
                "error_handling": {
                    "retry_count": 2,
                    "circuit_breaker": {
                        "error_threshold": 0.5,
                        "reset_timeout": "60s"
                    },
                    "fallback": "default_location"
                }
            }
        },
        {
            "nodeId": "user_contacts",
            "commandName": "DBQuery",
            "arg": {
                "query": "{$query}"
                "connection_pool": {
                    "max_connections": 50,
                    "idle_timeout": "30s"
                },
                "error_handling": {
                    "retry_count": 3,
                    "backoff": "exponential",
                    "fallback": "use_cached_contacts"
                }
            }
        }
        {
            "nodeId": "fraud_scoring",
            "commandName": "mlPredict",
            "arg": {
                "model_id": "fraud_v2",
                "error_handling": {
                    "fallback_model": "fraud_v1"
                }
            }
        }
    ]
}

This configuration captures the complete fraud detection workflow - enriching transaction data with user history, checking IP reputation, checking user contacts info and scoring through a machine learning model. The "error_handling" sections specify how the system should handle infrastructure concerns like connection pooling and error recovery for each step. They can be hidden from the business users if the devOps team already set up the default policies for each separate service. 

When it's compiled, it produces an optimized processing unit that maintains consistent behavior whether deployed for stream processing or API serving. The pipeline compilation process generates optimized code for each deployment mode while maintaining:

  • Identical business logic execution

  • Consistent error handling and retry policies

  • Uniform connection pool management

  • Standardized monitoring and observability

This architecture addresses a common challenge in modern data systems where the same business logic needs to be deployed both as stream processing pipelines and as API endpoints. Examples include fraud detection systems that process transaction streams while providing on-demand scoring APIs, and recommendation engines that continuously update user preferences while serving real-time recommendation requests.

Traditional approaches require maintaining separate implementations for stream processing and API serving, leading to duplicated code and potential inconsistencies. This new architecture resolves this through a declarative configuration that compiles into an optimized processing unit capable of both stream processing and API serving. The compiled unit maintains identical business logic execution and error handling across both deployment modes, while addressing fundamental distributed system requirements like connection management and horizontal scaling.

The configuration-based approach also transforms how teams across an organization can collaborate on data processing workflows. Business analysts and policy makers can modify business rules through a familiar interface without needing to understand distributed systems concepts. Data scientists can specify model features and thresholds without writing framework-specific code. Engineers from different language backgrounds can contribute to the same pipeline through a language-agnostic configuration. Organizations can build domain-specific abstractions on top of this configuration layer, whether through visual interfaces for business users or language-specific SDKs for developers, further reducing the technical barriers to pipeline development and maintenance.

This unification is particularly valuable in domains where processing consistency is critical, such as fraud detection and risk assessment systems. Organizations can ensure that whether a transaction is processed through their streaming pipeline or verified through an API call, it undergoes identical evaluation with consistent error handling and monitoring, while enabling diverse teams to collaborate effectively on the same processing logic.

3.3 Implementation and Migration Strategy

Adopting this unified architecture does not require organizations to replace their existing systems all at once. Instead, enterprises can implement a gradual migration strategy that minimizes disruption to existing operations while realizing incremental benefits.

A practical approach could begin with identifying a suitable pilot use case. In fraud detection systems, organizations often start with a subset of transaction types or a specific fraud detection rule. The new architecture can run in parallel with existing systems, processing the same transactions but not affecting production outcomes. This allows teams to validate the configuration-based approach and ensure consistent results before directing production traffic.

For example, a financial institution might first implement a configuration for only high value transfer type of transaction monitoring:

{
    "name": "high_value_transfer_monitoring",
    "nodes": [
        {
            "nodeId": "user_risk_check",
            "commandName": "httpCall",
            "arg": {
                "service": "existing_risk_service",
                "endpoint": "/check"
            }
        }
    ]

As confidence grows, additional processing steps can be gradually incorporated into the configuration while maintaining connections to existing services. This allows organizations to preserve their investments in current infrastructure while transitioning or adding new policies. And thanks to the architecture's ability of being deployed as an API service. Organizations can start with trigger only fraud detection service without touching the existing streaming/batch pipelines. As teams become comfortable with the configuration-based approach, they can progressively migrate more functionality.

Applicability and Limitations

This unified architecture is particularly well-suited for scenarios where organizations need to maintain consistent processing logic across streaming and API serving modes. The most compelling use cases are in domains with strict regulatory compliance or business requirements for processing consistency, such as fraud detection, risk assessment, content moderation, and real-time recommendation systems. In these scenarios, the benefits of unified logic and reduced maintenance overhead clearly justify the adoption of this architecture.

It also provides significant advantages in organizations where diverse teams need to collaborate on data processing workflows. When business analysts, data scientists, and engineers from different backgrounds need to contribute to the same processing logic, the configuration-based approach reduces technical barriers and enables domain-specific abstractions tailored to each team's expertise.

The architecture also assumes that the processing logic can be expressed through our pipeline configuration format. While this format covers a wide range of data processing needs, highly specialized algorithms or complex processing patterns might require more flexibility than a declarative configuration can provide. In such cases, framework-specific implementations might be more appropriate.

While it supports horizontal scaling to handle significant data volumes in both streaming and API serving modes, organizations with extreme throughput requirements (processing millions of events per second) or ultra-low latency requirements (sub-millisecond) might need specialized solutions optimized specifically for these performance characteristics. For most enterprise use cases, however, the architecture provides sufficient scalability while maintaining the benefits of unified deployment and consistent processing logic.

Start Building with Fleak Today

Production Ready AI Data Workflows in Minutes

Request a Demo

Start Building with Fleak Today

Production Ready AI Data Workflows in Minutes

Request a Demo