But an often-underappreciated element of Google’s data portfolio is Dataflow, a powerful service that simplifies data pipelines while allowing for rapid scaling up of infrastructure.
Fundamentally, Dataflow addresses one of the biggest challenges around cloud-native data pipelines - that streaming and batch are given extremely different treatments.
To deal with streaming data, teams might use functions triggered by event notifications, whereas for batch it’s usually a case of using Spark with an ephemeral Hadoop managed service.
Managing an enterprise set of data pipelines like this, then, results in teams with different skills and different deployment practices, which is counter-intuitive in a world where we’re working towards more and more centralised control and understanding of data.
This is where Dataflow (from which Apache Beam forks off) offers an extremely useful differentiator - it unifies the management of both streaming and batch data pipelines into one place.
This creates noticeable benefits for organisations, such as simplification of operations and the ability to easily build ML solutions around data pipelines. For industries that rely on multiple different batch and streaming data sources, Dataflow offers a critical competitive edge - for example, in the gaming sector, the ability to personalise experiences based on various different player touchpoints is essential.
Another great example is Spotify, which serves billions of streams in 61 different markets while adding thousands of new tracks every day, and uses data for functions such as music recommendations as well as business reporting. In 2019, Spotify ran the largest-ever Dataflow job to support its Decade Wrapped round-up, which required the processing of "data stories" (personal statistics per user, such as top songs every year) over a ten-year period - and across 248 million monthly active users (MAU)!
But Dataflow does not only solve challenges for data ingestion and enterprise-grade data pipelines, but also for analytics and Machine Learning workloads. One of the most common issues with productionising of ML is that the training data is provided in batch format, whereas often the data for inference can be real-time.
As we’ll see in future instalments, Google Cloud Platform already offers a sophisticated suite of Machine Learning capabilities, with Dataflow’s convergence of batch and streaming able to underpin this.