You can configure W&B alerts to notify Slack and email when a metric crosses a certain threshold during your training runs.
Let’s see it in action. Imagine you’re training a deep learning model for image classification and you want to be notified if the validation accuracy drops below 80% or if the training loss spikes above 5.0.
Here’s how you’d set up an alert in the W&B UI:
- Navigate to your Project Settings: Go to your W&B project page and click on the "Settings" tab.
- Go to the Alerts Section: In the settings menu, find and click on "Alerts."
- Create a New Alert: Click the "Create Alert" button.
- Name Your Alert: Give it a descriptive name, like "Bad Performance Alert."
- Select the Trigger:
- Metric: Choose the metric you want to monitor. For example,
val_accuracyortrain_loss. - Condition: Select the comparison operator. For
val_accuracy, you’d choose "Less than" (<). Fortrain_loss, you’d choose "Greater than" (>). - Threshold: Enter the value. For
val_accuracy, enter0.80. Fortrain_loss, enter5.0.
- Metric: Choose the metric you want to monitor. For example,
- Add Conditions (Optional): You can add multiple conditions. For example, you might want to trigger if either
val_accuracy < 0.80ortrain_loss > 5.0. You can specify if all conditions must be met or if any condition can be met. - Configure Notifications:
- Slack:
- Click "Add Slack Channel."
- You’ll be prompted to authorize W&B to post to your Slack workspace. Follow the on-screen instructions to connect your Slack account.
- Select the channel where you want to receive notifications (e.g.,
#ml-alerts).
- Email:
- Click "Add Email."
- Enter the email addresses you want to notify (e.g.,
your.email@example.com).
- Slack:
- Save the Alert: Click the "Create Alert" or "Save" button.
Once saved, this alert will monitor all new runs within that project. When a run’s logged metric meets the defined threshold, W&B will send a notification to your configured Slack channel and email addresses. The notification will include the run ID, the metric value, and a link to the run’s dashboard for immediate investigation.
This system is powerful because it automates the monitoring of your experiments, allowing you to catch regressions or performance degradation early without constantly staring at dashboards. It integrates with your existing workflows by pushing notifications to familiar communication channels.
The underlying mechanism involves W&B’s backend periodically polling the metrics logged by active runs. When a logged metric value, for a given run, satisfies the defined condition (e.g., val_accuracy falls below 0.80), the alert is triggered. This polling happens at a regular interval, typically every minute, ensuring near real-time feedback.
A common point of confusion is how W&B handles metrics that are only logged periodically, like validation metrics that might be computed every N epochs. The alert will trigger as soon as the first logged value that meets the threshold condition is observed for that metric in a run. If the metric value later improves and crosses back over the threshold, the alert will not re-trigger for that specific run unless you configure a separate "recovery" alert or the alert is set to trigger on any violation.
The next concept to explore is setting up custom dashboards to visualize and compare your most important metrics across multiple runs.