Case Study: Solving Cross-Region S3 Access in a Hardened Databricks EnvironmentBuilding secure Databricks platforms in AWS often means balancing two competing priorities: maintaining strict network controls while preserving access to the services and data that modern analytics workloads depend on. In theory, connecting a Databricks workspace to an Amazon S3 bucket in another AWS region should be straightforward. In practice, security constraints, proxy architectures, Unity Catalog internals, and AWS networking behaviours can turn a simple requirement into a complex engineering challenge.This case study describes how our team investigated and resolved cross-region S3 access issues in a highly restricted Databricks environment. Along the way, we uncovered several non-obvious behaviours related to transparent proxies, Unity Catalog external volumes, S3 region discovery, Spark configuration boundaries, and Linux packet processing. The article is intended for cloud architects, platform engineers, DevOps teams, security engineers, and Databricks practitioners responsible for designing secure data platforms on AWS. Readers working with regulated environments, strict egress controls, private networking, or large-scale Databricks deployments may find the lessons particularly relevant. Rather than presenting a predefined solution, this case study walks through the actual investigation process, including the approaches that failed, the assumptions that proved incorrect, and the architectural decisions that ultimately led to a more robust and scalable design. |
When you deploy Databricks in a security-conscious AWS environment, you quickly discover that “just make it work” isn’t an option. Every packet leaving your VPC needs justification. Every endpoint needs hardening. And when someone asks you to connect to an S3 bucket in another region? That’s when the real fun begins.
This is the story of how we spent weeks navigating the intersection of AWS networking, transparent proxies, Unity Catalog internals, and Linux kernel packet handling—ultimately replacing our entire egress architecture to solve what seemed like a simple cross-region S3 access problem.
The Starting Point: A Locked-Down Databricks Workspace
Our Databricks workspace runs in AWS with both classic and serverless compute. The security requirements are strict:
- HTTP/HTTPS traffic (ports 80/443) only to explicitly allowlisted domains
- SSH (port 22) access only to specific allowlisted hosts
- All traffic must stay within AWS whenever possible
- Everything managed via Terraform
For serverless compute, Databricks provides native controls: allowed_internet_destinations for ports 22/80/443, and allowed_storage_destinations for Unity Catalog external volumes.
For classic compute, we built a more elaborate setup:
- VPC Endpoints for AWS services: STS, Kinesis, Secrets Manager, SES—keeping this traffic inside AWS
- S3 Gateway Endpoint for same-region bucket access (as recommended by Databricks)
- Databricks REST and Relay VPC Endpoints for control plane communication
- CodeArtifact with a VPC endpoint to replace direct PyPI access
- Custom Egress Gateway (CEG)—our name for an EC2-based solution that works like NAT Gateway but adds domain filtering.
The CEG is the heart of our egress control. It’s an Ubuntu 24.04 instance running Squid Proxy, deployed via EC2 user data. We run one CEG per availability zone, with private subnet default routes pointing to them.
For HTTP traffic, iptables redirects packets to Squid’s intercept port (3129). Squid then applies ACLs—matching against our domain allowlist for port 80, and reading SNI (Server Name Indication) for TLS traffic on port 443. We use peek-and-splice mode: Squid inspects the SNI without breaking the TLS session, then decides whether to allow or block.
For SFTP traffic, we use a different approach: a systemd timer resolves allowlisted hostnames to IPs and populates an ipset for iptables matching.
This architecture served us well—until we needed to access an S3 bucket in a different region.
The Problem: Cross-Region S3 via Unity Catalog
The requirement seemed straightforward: connect an S3 bucket from us-west-2 to our Unity Catalog as an external volume. Our Databricks workspace lives in us-east-2.
We added the bucket domains to Squid’s allowlist:
- our-bucket.s3.amazonaws.com
- our-bucket.s3.us-west-2.amazonaws.com
And that’s when things broke.
Discovery #1: Transparent Proxies Break S3
Squid in transparent mode uses peek-and-splice for HTTPS. It doesn’t decrypt traffic—it just peeks at the SNI to make a filtering decision. Sounds non-invasive, right?
Wrong. Even in peek-and-splice mode, Squid creates two separate TCP sessions: one from the client to Squid, and another from Squid to the destination. The TLS handshake happens end-to-end, but the TCP connection is split.
S3 doesn’t like this. During our testing, S3 consistently terminated connections during the TLS handshake. After digging through logs and packet captures, we concluded this wasn’t a Squid bug—it’s fundamental to how transparent proxies work. Any transparent proxy that intercepts TCP will exhibit the same behaviour with S3.
Interestingly, Squid in explicit proxy mode works fine with S3. The difference is that in explicit mode, the client knows about the proxy and establishes a CONNECT tunnel. The proxy doesn’t split the TCP session; it just relays bytes.
But we couldn’t easily switch to explicit proxy mode for all traffic. Our architecture relied on transparent interception.
Attempt #2: The Transit VPC Approach
Since transparent proxying was off the table for S3, we decided to route around Squid entirely for this bucket. The plan:
- Create a Transit VPC in us-west-2 (the bucket’s region)
- Set up VPC peering from our Databricks VPC to the Transit VPC
- Deploy an S3 Interface Endpoint in the Transit VPC
- Create a Route53 Private Hosted Zone (PHZ) in the Databricks VPC with a DNS record for the regional bucket endpoint: our-bucket.s3.us-west-2.amazonaws.com
With this setup, DNS queries for the regional bucket name would resolve to the S3 Interface Endpoint in the Transit VPC, and traffic would flow through the VPC peering connection—bypassing CEG entirely.
We tested this in our sandbox environment:
aws s3 ls s3://our-bucket –region us-west-2
It worked. The regional endpoint request went through the Transit VPC, hit the S3 Interface Endpoint, and returned results.
We deployed to production—and it failed.
Discovery #2: Databricks Does Region Discovery First
When Databricks accesses an S3 bucket, it doesn’t go directly to the regional endpoint. First, it queries the global endpoint s3.amazonaws.com to discover which region the bucket lives in. Only after that does it make regional requests.
Our PHZ only covered our-bucket.s3.us-west-2.amazonaws.com. The initial request to our-bucket.s3.amazonaws.com still went to CEG—and got blocked by our transparent proxy issue.
Attempt #3: PHZ for the Global Endpoint
Okay, we decided, let’s add another PHZ. We created a zone for s3.amazonaws.com with a record for our-bucket.s3.amazonaws.com, pointing to the S3 Interface Endpoint in the Transit VPC.
Discovery #3: Certificate Mismatch
The regional S3 Interface Endpoint presents a certificate for its region—*.s3.us-west-2.amazonaws.com. When a client connects expecting s3.amazonaws.com, the certificate doesn’t match. TLS handshake fails.
There’s no way to make a regional endpoint respond with a global certificate. This approach was dead.
Attempt #4: Skip Region Discovery via S3A Configuration
What if we told Spark to skip region discovery entirely and go straight to the regional endpoint? The S3A filesystem supports per-bucket endpoint configuration:
spark.hadoop.fs.s3a.bucket.our-bucket.endpoint = s3.us-west-2.amazonaws.com
We created a global init script to inject this into /databricks/spark/conf/spark-defaults.conf at cluster startup.
Discovery #4: Standard Clusters Ignore Global Init Scripts
Databricks clusters with the “Standard” access mode (used with Unity Catalog) ignore global init scripts during bootstrap. The script never ran. The configuration never applied.
This is documented behaviour, but not obvious when you’re troubleshooting why your carefully crafted init script has no effect.
Attempt #5: Cluster Policy with S3A Proxy Settings
Time for a different angle. If we can’t avoid region discovery, maybe we can route it through an explicit proxy. We already knew explicit proxy mode worked with S3.
So, we created a cluster policy that injected these Spark configurations:
spark.hadoop.fs.s3a.proxy.host = proxy.ceg.internal
spark.hadoop.fs.s3a.proxy.port = 3128
spark.hadoop.fs.s3a.proxy.ssl.enabled = false
spark.hadoop.fs.s3a.cross.region.access.enabled = true
spark.hadoop.fs.s3a.proxy.non.proxy.hosts = *.s3.us-east-2.amazonaws.com
The proxy.ceg.internal hostname resolved (via Route53 PHZ) to our CEG instances’ private IPs.
We tested with direct S3 access:
display(dbutils.fs.ls(“s3://our-bucket/”))
It worked! The request went through the explicit proxy, region discovery succeeded, and we could list the bucket contents.
Then we connected the bucket as a Unity Catalog external volume and tried again.
It failed. The traffic went back through the transparent proxy path.
Discovery #5: Unity Catalog Ignores S3A Configuration
This was the critical finding. When you access S3 directly via s3:// paths, Spark uses the S3A filesystem, which honours spark.hadoop.fs.s3a.* settings.
But Unity Catalog External Volumes don’t use S3A. They use their own internal S3 client—DeltaSharingFileSystem or UnityFS—implemented in Java with credential vending from the Unity Catalog service. This code path completely ignores spark.hadoop.fs.s3a.* configurations.
No amount of S3A tuning would help us with Unity Catalog.
Attempt #6: System-Level Proxy Environment Variables
If Spark configurations don’t reach Unity Catalog’s S3 client, what about JVM-level settings? The standard HTTPS_PROXY and NO_PROXY environment variables are honoured by most Java HTTP clients.
We removed the S3A proxy settings and added environment variables to the cluster configuration:
NO_PROXY=169.254.169.254,*.s3.us-east-2.amazonaws.com
HTTPS_PROXY=http://proxy.ceg.internal:3128
The NO_PROXY setting ensures that same-region S3 traffic and instance metadata requests bypass the proxy.
We tested both scenarios:
- Direct bucket access: dbutils.fs.ls(“s3://our-bucket/”) — worked
- Unity Catalog external volume access — worked
We even tested a public bucket in another region (athena-examples) for good measure. It worked too.
Discovery #6: HTTPS_PROXY Works at JVM/System Level
The key insight: HTTPS_PROXY and NO_PROXY environment variables operate at the JVM and system level, affecting all HTTP clients regardless of whether they’re S3A, Unity Catalog’s internal client, or anything else. This covers both code paths.
The Plot Twist: Shell and Python Are Blocked
Victory was short-lived. While S3 access worked, we discovered that shell commands and Python code on the cluster also picked up the HTTPS_PROXY setting—and then failed.
Testing revealed an interesting asymmetry: the Databricks cluster node blocks outbound connections from shell and Python processes on non-standard ports like 3128. When a shell script or Python requests library tries to connect to the proxy, the connection is reset before it leaves the node. Yet S3 traffic through the JVM’s HTTP client works fine on port 3128.
The exact mechanism isn’t fully documented, but the practical implication is clear: Databricks nodes enforce port restrictions differently for JVM processes versus shell/Python subprocesses. This is likely a security feature to prevent arbitrary outbound connections from user code.
This meant we couldn’t use HTTPS_PROXY as a general solution—it worked for S3 but broke other tooling.
Alternative Path: IP-Based Routing Around Squid
We considered another approach: bypass Squid entirely for S3 traffic using IP-based routing.
AWS publishes S3 IP ranges in their ip-ranges.json file. We built a mechanism to:
- Download S3 CIDR blocks for the GLOBAL region
- Add these CIDRs to an ipset
- Configure iptables to route matching traffic directly to the internet gateway, bypassing Squid.
This handles the region discovery request (which goes to the global endpoint). After discovery, regional requests go through our existing Transit VPC + PHZ setup.
Security is maintained through VPC endpoint policies on the Transit VPC’s S3 Interface Endpoint.
This works, but has a significant trade-off: every new region requires a new Transit VPC. And here’s where Terraform fights back.
The Terraform Provider Alias Problem
Creating Transit VPCs across regions requires multiple AWS provider configurations:
provider “aws” {
alias = “us-west-2”
region = “us-west-2”
}
provider “aws” {
alias = “eu-west-1”
region = “eu-west-1”
}
You can’t dynamically generate provider aliases in Terraform. Each new region requires:
- A new provider block
- A new module instantiation with explicit provider reference
- Manual code changes—no elegant for_each over regions
Plus, each S3 Interface Endpoint incurs ongoing costs. For accessing buckets in many regions, this approach becomes expensive and unwieldy.
The Final Solution: NFQUEUE and Suricata
We stepped back and asked: what’s really the problem with Squid?
The issue is TCP session splitting. Squid intercepts the connection and creates two separate sessions. S3 detects this and refuses to play along.
What if we could inspect traffic without splitting the connection? That’s where NFQUEUE comes in.
NFQUEUE: Packet Inspection in Userspace
NFQUEUE (Netfilter Queue) is a Linux kernel mechanism that passes packets to a userspace program for analysis. The program inspects the packet and returns a verdict: accept, drop, or modify.
Unlike transparent proxying, NFQUEUE doesn’t create a new TCP session. The original connection remains intact. The userspace program just observes and decides.
For TLS traffic, we can still read the SNI from the ClientHello—it’s sent in plaintext before encryption begins. We get SNI-based filtering without session splitting.
Enter Suricata
Suricata is an open-source network threat detection engine. It supports NFQUEUE mode and can match traffic against rules including SNI patterns.
We replaced Squid with Suricata on our CEG instances:
- iptables sends traffic to NFQUEUE instead of redirecting to a proxy port
- Suricata receives packets, inspects them, and returns verdicts
- Allowed traffic passes through unchanged
- Blocked traffic is dropped
The result: all HTTP and HTTPS traffic, including S3, is now handled correctly. No session splitting. No TLS failures. No special-casing for specific services.
It’s worth noting that AWS offers a managed solution for this: AWS Network Firewall, which—perhaps unsurprisingly—runs Suricata under the hood. It would give us the same SNI-based filtering without the proxy headaches. But Network Firewall isn’t cheap: roughly $0.395/hour per endpoint, plus data processing charges. In a multi-AZ setup, that adds up fast. Our self-managed CEG running Suricata on EC2 achieves the same result at a fraction of the cost—we just own the operational overhead.
Key Takeaways
This journey revealed several important lessons about Databricks networking:
Configuration Layers and What They Control
| Layer | Scope | S3A Respected? | Notes |
|---|---|---|---|
| Global Init Scripts | Cluster bootstrap | N/A | Ignored by Standard/Unity Catalog clusters |
| Cluster Policy (Spark config) | Spark jobs | Yes | Only affects S3A filesystem |
| Environment Variables | JVM/System | Yes (for HTTPS_PROXY) | Covers both S3A and Unity Catalog |
| Network-level (iptables/routing) | All traffic | N/A | Most comprehensive but most complex |
Unity Catalog’s S3 Client Is Different
Unity Catalog External Volumes use DeltaSharingFileSystem/UnityFS, not Spark’s S3A. Any configuration targeting spark.hadoop.fs.s3a.* will be ignored for UC external volume access. This is a critical architectural detail that’s not obvious from documentation.
Transparent Proxies and S3 Don’t Mix
Any proxy that splits TCP sessions will break S3 access. This includes Squid in transparent mode, and likely other transparent proxy solutions. If you must proxy S3 traffic, use explicit proxy mode or NFQUEUE-based inspection.
NFQUEUE Offers the Best of Both Worlds
With NFQUEUE + Suricata, you get:
- SNI-based filtering for TLS traffic
- No TCP session splitting
- Compatibility with S3 and other sensitive protocols
- Transparent operation (clients don’t need proxy configuration)
Transit VPC Is Powerful but Has Limits
The Transit VPC pattern works well for accessing specific buckets in other regions. But scaling it across many regions hits Terraform’s provider alias limitations and incurs per-endpoint costs. It’s best suited for known, stable cross-region access patterns rather than dynamic multi-region requirements.
Conclusion
This project illustrates an important reality of modern Databricks implementations.
The most difficult challenges rarely involve writing Spark code or configuring notebooks. They emerge at the boundaries between platforms, cloud infrastructure, security requirements, governance models and operational constraints.
Organisations invest in Databricks to accelerate analytics, AI and data-driven decision making. Achieving those goals often requires solving problems that extend far beyond the platform itself.
At Software Planet Group, we specialise in helping organisations navigate these complex engineering challenges. Our focus is not simply on implementing technology, but on understanding the underlying problem, evaluating alternatives and designing solutions that remain effective as business requirements evolve.
Sometimes that means building a data platform. Sometimes it means redesigning infrastructure. And occasionally, as in this case, it means rethinking an entire architectural approach to uncover a simpler and more robust solution.
Have you encountered similar challenges with locked-down Databricks environments? We’d love to hear about your approaches—especially if you’ve found alternatives to the patterns described here.