The Data Engineers reading would probably know that Google Cloud SQL and Cloud Data Fusion are part of Google’s offerings for Data and Analytics in Google Cloud Platform (GCP). Hence, it should be simple to connect both of them together to form a simple data pipeline right? Yes, if you don’t mind using public interfaces; no, if both Data Fusion and Cloud SQL are meant to be internal services.
Here are some simple guidelines to ensuring that the Private versions of Cloud SQL and Data Fusion can work together.
The first consideration that needs to be put in place is that Cloud SQL and Data Fusion should reside in the same VPC network; both Cloud SQL and Data Fusion employ Google’s Private services access concept which allows them to “peer” services running in Google’s VPCs to ours. A nuance of this kind of peering means that it doesn’t support Transitive Connections, meaning that two peers cannot connect to each other over our VPC, this situation can be made worse if we ourselves are peering two VPCs together. So, keep them on the same VPC (hint: it can also be a Shared VPC).
In order to ensure that two Private access services can speak to each other, they have to think that the connection comes from within our VPC. This means that there needs to be a middle man in between each service so that GCP thinks the connections come from our VPC and are not the result of Transitive Peering. Here are the two methods that worked for us:
Both these solutions will create the desired effect, ensuring that the connections to Cloud SQL from Data Fusion originate from a VM in our VPC because it is acting as a proxy. An example of how you might configure TCP forwarding using IPTables is to:
sudo iptables -t nat -A PREROUTING -p tcp –dport 3306 -j DNAT –to <CloudSQL IP>
The above forwards traffic received by the VM on port 3306 to Cloud SQL on port 3306 and masquerades the traffic as coming from the IP of the VM
sudo iptables -A FORWARD -p tcp -d <CloudSQL IP> –dport 3306 -j ACCEPT
sudo iptables -t nat -A POSTROUTING -j MASQUERADE
This is a step we find that many implementers have missed out on, after setting up the proxy in the step above, the GCP firewall needs to be configured to allow traffic from the correct source and ports being used to communicate with the proxy instance for this to work.
In our testing, we realized that we could not change ports for some of the drivers in Data Fusion, so depending on the driver, you may need to allow port 3306 from the Allocated Internal IP range for Data Fusion which was defined in the Private Service Connection tab of your VPC (or VPC host if you are using a Shared VPC). This will commonly be described as a cdf-<data fusion instance name>. If you are using a Data Fusion driver that supports changing ports, you can allocate the selected port accordingly.
In short, create the following ingress and egress firewall rules:
Note: don’t forget to add other rules depending on other use-cases.
You can allow private Data Fusion instances to talk to private Cloud SQL instances. However, the answer just isn’t as straightforward as turning them on. Having a proxy in the middle also has other considerations if he traffic volume is high. For example, do you need High-Availability? Ensure that the proxy is scaled up according to the volume needed and remember to apply your Cloud Monitoring and Logging agents to the proxy to ensure that you get a full end-to-end view of your Cloud SQL and Data Fusion pipeline!
Did this article help you? Reach out to us at marketing@matrixc.com or read up more of our blog posts at https://www.matrixc.com/blog/. We’d be happy to help you out on your GCP journey!
As businesses increasingly migrate to the cloud, data protection has become a top priority. For…
Reduce Costs with Cloud Migration
In today’s fast-evolving market, staying ahead of the competition requires more than just reacting to…
Google’s recent AI-driven overhaul of its search engine marks a new era in search capabilities,…
In tandem with Google’s data center announcement, Malaysia is accelerating its national AI initiatives, positioning…
Google’s announcement of its first data center in Elmina, Malaysia, marks a pivotal moment for…