Vertex AI Training jobs can’t connect to W&B.
The google.cloud.aiplatform.training.v1.TrainingService.CreateTrainingJob RPC is failing because the Vertex AI training worker nodes cannot establish a network connection to the Weights & Biases (W&B) backend servers. This is problematic because without this connection, your training runs won’t log metrics, artifacts, or any other valuable data to W&B, rendering your experiment tracking useless.
Here’s a breakdown of common causes and how to fix them:
-
W&B API Key/Entity Missing or Incorrect:
- Diagnosis: Check your W&B environment variables or
wandb-settings.yamlfile within your training container. Look forWANDB_API_KEYandWANDB_ENTITY. If they’re absent or incorrect, W&B won’t authenticate. - Fix: Set these environment variables in your Vertex AI training job definition. For example, in your Python training script, you might have:
Alternatively, if you’re using aimport os os.environ["WANDB_API_KEY"] = "YOUR_ACTUAL_WANDB_API_KEY" os.environ["WANDB_ENTITY"] = "your-wandb-entity"wandb-settings.yamlfile, ensure it’s correctly populated and mounted into your container. - Why it works: W&B uses your API key for authentication. Without a valid key and associated entity, the W&B servers will reject the connection, preventing any logging.
- Diagnosis: Check your W&B environment variables or
-
Network Egress Restrictions (VPC Service Controls, Firewall Rules):
- Diagnosis: Vertex AI training jobs often run within a Virtual Private Cloud (VPC). If your VPC has strict egress firewall rules or VPC Service Controls configured, outbound traffic to W&B’s public endpoints (
api.wandb.ai,files.wandb.ai) might be blocked. - Fix:
- Firewall Rules: In Google Cloud Console, navigate to VPC network -> Firewall. Create an egress firewall rule allowing TCP traffic on ports 443 (HTTPS) and 80 (HTTP) from your Vertex AI worker IP ranges to
api.wandb.aiandfiles.wandb.ai. - VPC Service Controls: If VPC Service Controls are in place, you’ll need to create an access policy that allows your Vertex AI service perimeter to communicate with W&B’s public endpoints. This often involves configuring authorized networks or identity-aware proxies. Consult the Google Cloud documentation for the most up-to-date methods for allowing egress to specific external services.
- Firewall Rules: In Google Cloud Console, navigate to VPC network -> Firewall. Create an egress firewall rule allowing TCP traffic on ports 443 (HTTPS) and 80 (HTTP) from your Vertex AI worker IP ranges to
- Why it works: These configurations explicitly permit network traffic from your Vertex AI environment to reach the W&B servers, bypassing any blanket blocking.
- Diagnosis: Vertex AI training jobs often run within a Virtual Private Cloud (VPC). If your VPC has strict egress firewall rules or VPC Service Controls configured, outbound traffic to W&B’s public endpoints (
-
Proxy Server Misconfiguration:
- Diagnosis: If your Vertex AI environment requires an HTTP/HTTPS proxy for outbound internet access, ensure that the
HTTP_PROXYandHTTPS_PROXYenvironment variables are correctly set for your training job, and that the proxy itself allows connections to W&B domains. - Fix: Set the proxy environment variables in your Vertex AI training job configuration:
Verify that your proxy server’s access control lists (ACLs) permit traffic toexport HTTP_PROXY="http://your-proxy.example.com:8080" export HTTPS_PROXY="http://your-proxy.example.com:8080"api.wandb.aiandfiles.wandb.aion port 443. - Why it works: The W&B client library respects these standard proxy environment variables. By routing traffic through the correctly configured proxy, the connection to W&B is successfully established.
- Diagnosis: If your Vertex AI environment requires an HTTP/HTTPS proxy for outbound internet access, ensure that the
-
W&B Project/Run Name Conflicts or Permissions:
- Diagnosis: While less common for initial connection failures, ensure the W&B entity and project you’re trying to log to exist and that your API key has write permissions for that project. Sometimes, a malformed project or entity name in the environment variables can cause issues.
- Fix: Double-check the spelling of your W&B entity and project names. Log into your W&B account to confirm the project exists and that your API key is associated with an account that has permissions to write to it. If necessary, regenerate your API key from your W&B user settings.
- Why it works: W&B’s backend requires a valid target project and entity for logging. Incorrect names or insufficient permissions will lead to authentication or authorization failures, preventing the connection from completing.
-
Incorrect W&B SDK Version or Installation:
- Diagnosis: An outdated or improperly installed W&B SDK within your Docker image can lead to unexpected connection errors, especially if W&B backend APIs have changed.
- Fix: Ensure you are using a recent, stable version of the W&B Python SDK. In your
requirements.txtor Dockerfile, specify a version likewandb==0.15.8(or the latest). Rebuild your Docker image after updating. - Why it works: Newer SDK versions are tested against current W&B backend versions and include bug fixes and updated networking logic that might be necessary for successful communication.
-
DNS Resolution Issues:
- Diagnosis: The Vertex AI worker nodes might be unable to resolve the DNS names for W&B’s servers (
api.wandb.ai,files.wandb.ai). This can happen if your VPC’s DNS configuration is non-standard or if there are network routing problems. - Fix: Test DNS resolution from within a similar environment. You could launch a temporary Vertex AI custom job with a simple script that tries to
ping api.wandb.aior perform acurl https://api.wandb.ai. If resolution fails, examine your VPC’s DNS settings (e.g., Cloud DNS policies, custom name servers) and ensure they can resolve public internet domains. - Why it works: Successful DNS resolution is the prerequisite for establishing any network connection. Fixing DNS allows the worker nodes to find the IP addresses of W&B’s servers.
- Diagnosis: The Vertex AI worker nodes might be unable to resolve the DNS names for W&B’s servers (
If you fix all these, the next error you’ll hit is a google.api.rpc.code.Code.UNAUTHENTICATED error because your Google Cloud service account lacks the necessary permissions to write to your W&B project if you’re using a private W&B setup and haven’t configured service account impersonation.