The default answer for new data pipelines is batch processing, and that default is usually correct. Streaming adds real-time capabilities at the cost of significant complexity — and most applications don't actually need sub-minute data freshness.
Batch processing: data accumulates, a job runs on a schedule (hourly, daily), processes all accumulated data, and writes results. Tools: dbt (SQL transformations on data warehouses), Apache Spark (large-scale batch on distributed compute), Airflow (orchestration), Prefect, Dagster. Cost is low — you pay for compute only during the job run. A typical nightly ETL on Databricks costs $5-50 depending on data volume and cluster size. Debugging is easier because you can inspect state at any point.
Stream processing: events are processed within milliseconds to seconds of arrival. Tools: Apache Flink (stateful streaming, exactly-once), Kafka Streams (JVM library, tight Kafka integration), Apache Spark Structured Streaming (micro-batching, slightly higher latency than true streaming), RisingWave (streaming SQL, easier than Flink), AWS Kinesis Data Analytics. Cost is higher — you're running infrastructure continuously, not on-demand.
When streaming is justified: fraud detection (decisions must be made before a transaction clears — typically <2 seconds), real-time recommendation systems where stale data causes immediate revenue loss, live dashboards for operations teams, IoT sensor processing, or financial market data. The common thread: the value of data degrades fast enough that 1-hour-old data has significantly less value than 1-second-old data.
Common mistake: adding streaming because it sounds modern, not because there's a real-time requirement. A startup that processes 10K orders/day and runs an hourly Spark job is over-engineering if they switch to Kafka Streams. The complexity tax — deployment, state management, late data handling, watermarking — is substantial.
Lambda architecture (streaming for low latency + batch for accuracy) adds even more complexity. In 2026, the Kappa architecture (streaming only, use replayable Kafka for historical) is preferred when streaming is required. But the practical simplicity winner is batch-with-micro-batch: run your dbt pipeline every 5 minutes on Snowflake or BigQuery. For most use cases, 5-minute data freshness covers the real-time requirement at batch complexity and cost.