The Multi-Cloud Future (5) — How To Do “Data” The Multi-Cloud Way

Kishore Gopalan
7 min readJun 17, 2021

It’s important to get both the business motive and the architecture right, to derive the best from your enterprise data in a multi-cloud world.

Photo by NASA on Unsplash

This is the fifth part of the “The Multi-Cloud Future” series. You can read the first part — Why Just “Cloud” and Not “Multi-Cloud” Will Impede Your Business Growth. Second part — Why Exactly Should You Adopt Multi-Cloud?. Third part — To Do Multi-cloud, You Should First ‘Think Multi-cloud’. Fourth part — Five Patterns To Get You To Start ‘Thinking’ Multi-cloud. Now read on…

The most important asset for any organization is their data. But that is also the asset that’s the most challenging to manage. I’ve written how cloud computing has evolved to keep in tune with the changing dynamic of world economics and trade and why it’s imperative for every organization to have a multi-cloud strategy. While keeping pace with changing dynamics is important for businesses to thrive and grow, it also brings in a new set of challenges.

Especially when you bring in multiple cloud platforms where your enterprise data is spread across different clouds, how does it change the data science equation? Are you still able to leverage all the data from all the clouds and on-premise to still connect the dots and extract meaningful analytics out of them?

We take a broad yet deep look at some ways Google Cloud lets you make the best use of all of your data — wherever they are.

Business Intelligence that spans across multiple clouds

Businesses don’t wait for cloud. In other words, cloud cannot inhibit or dictate how and where a business should grow, or how an analyst could derive insights and make intelligent decisions. While traditional Business Intelligence has been handful in the old world of “dashboards”, the newer order calls for something more expansive that can make deriving insights to be agnostic of data source.

Looker is Google’s multi-cloud Business Intelligence and data analytics platform. The power of Looker comes from its ability to connect to multiple data sources across different clouds or on-premise and create reports and dashboards. Its LookML has a rich set of APIs that let you embed data, visualizations and insights into your websites and to use workflow integration and build applications on top of Looker itself.

In the true spirit of multi-cloud — which is kind of what we’ve been harping on all through this series — Looker can be deployed in various locations, it doesn’t have to be Google Cloud. If the majority of the data is located in another cloud or on-premise, Looker is best deployed close to the primary data source.

Using BigQuery-Omni to Access Data in Multiple Clouds

The same control mechanism that applies to querying BigQuery managed storage can be applied to the data that resides in AWS. Another interesting pattern is using Looker together with BigQuery Omni.

Let’s take a look at how this diagram is different from the previous one. Rather than connecting from Looker to Athena, which is needed to query the data in S3 buckets, Looker connects to BigQuery. Technically it connects to the control plane and the control plane goes to the data plane residing in AWS and processes the query against the data in the S3 bucket.

The advantage of using this pattern is that the same control mechanism that applies to querying BigQuery managed storage can be applied to the data and that resides in AWS and it’s going to be a much more controlled and simpler setup of the dashboard. From the Looker perspective connectivity to BigQuery Omni is exactly the same as connectivity to BigQuery tables in managed storage or in federated data sources on Google Cloud. This can also become very handy when the data scientists or BI analysts don’t have access to Athena or some other way to query the data in S3 buckets through a SQL interface. BigQuery Omni becomes that SQL engine with the convenience and security of controlling it from the familiar Google Platform.

The rest of the previous pattern remains in place — you can still connect to other data sources, including BigQuery data on Google Cloud and on-premise data warehouse and visualize it on the same dashboard and build the same types of apps and workflows.

Selectively Migrating Data Using BQ-Omni

Let’s talk about several patterns where the data actually needs to be transferred into the Google Cloud Platform (GCP) in order to enable joins with GCP data sources or additional processing that can only be done on GCP.

The focus of these patterns will be to minimize the amount of data that needs to be transferred and also making the process as simple as possible. Bringing the data over from another cloud platform or on-premise data center always opens a number of questions related to the security, data governance, data management, and the cost of data transfer and data duplication.

Consider the source data in Amazon S3

Consider how BQ Omni can be used to selectively transfer data from either AWS or Azure. We are using AWS terminology here, but it would work exactly the same on Azure. BQ Omni can store the results of a query in an S3 bucket and there is no limit on the amount of data that can be stored. Consider this step the first two parts of the ETL process by extracting the data from an S3 bucket and transforming the data before transferring and finally loading in BigQuery. These two steps are a single EXPORT DATA statement where you give BigQuery the select statement to run and the location of the S3 bucket where the data will be output. There are three possible output formats that you can specify: CSV, JSON and Avro and several data compression algorithms. BigQuery will run the statement and produce one or several files.

Compress data to minimize egress cost and ingest into BigQuery

The next step would be to transfer the data to a GCS bucket. We recommend using a managed GCP service called Transfer Service which can reliably transfer large amounts of data from an S3 bucket to a GCS bucket. It can also automatically delete the source files to minimize the total cost of the solution. Once the data is located in a GCS bucket it can be used as-is as a BigQuery federated data source, or become the dataset used by AI Platform for ML model training. Or, we can ingest the data into BigQuery.

If the data was exported in Avro format then the data loading will be a breeze because the Avro format contains information about the data schema and it’s the most performant format for loading data in BigQuery. If needed the data can be additionally augmented by using a Dataflow pipeline.

Managing the whole workflow could become complex

Typically in cases like this you would want to manage the whole process in a workflow. You can use Google Cloud Workflows to orchestrate everything from start to finish. Google Cloud Workflows can be used to orchestrate and automate Google Cloud and HTTP-based API services with serverless workflows.

Minimizing costs of data transfers and what would you trade-off with

When applying this pattern keep in mind the egress cost. There are several trade offs to consider. For example you can export the data in Avro format with SNAPPY compression or you can do it in comma separated value format with GZip compression. Depending on your data the compression ratio might be slightly better for the CSV format. But then most likely you would need to pay the additional cost of unzipping the data once loaded to GCS, while Avro files don’t need to be post processed. It may be a good idea to start with compressed Avro, but keep other options in mind and ideally evaluate them.

To minimize the egress cost, ensure that the data transferred to GCP is only the minimum data set, respect the data residency requirements, and be careful about not accidentally transferring sensitive data into a GCP location which you may not have protected with appropriate controls.

In this post, we discussed some of the multi-cloud data patterns. But there’s more to it. In the next post, we’ll continue exploring more multi-cloud data patterns.

--

--

Kishore Gopalan

Enterprise Architect at Google. Talking about everything cloud and clear. Driving the next generation of innovation & digital transformation with Google Cloud.