SSH keys are the backbone of secure access for most enterprises, but managing them at scale across thousands of hosts can quickly become a chaotic nightmare.

Here’s how it actually works, and how to tame the beast:

Imagine you’ve got a new developer, Alice, joining the team. She needs access to a dozen servers to deploy her new microservice. Traditionally, this means:

  1. Alice generates an SSH key pair.
  2. She gives her public key to each server administrator.
  3. Each administrator manually adds Alice’s public key to ~/.ssh/authorized_keys on their respective servers.
  4. If Alice leaves, each administrator has to find and remove her key.

Now, multiply this by thousands of developers, hundreds of servers, and a constant churn of new hires and departures. You end up with:

  • Stale keys: Keys of former employees lingering on servers, creating security holes.
  • Key sprawl: Alice’s public key is on server A, but a slightly different version is on server B.
  • Manual toil: Admins spend hours copying and pasting keys.
  • Lack of auditability: Who has access to what, and when was it granted?

The core problem is that SSH’s default mechanism (authorized_keys) is designed for individual, manual management, not for enterprise-wide orchestration.

The Solution: Centralized SSH Key Management

The goal is to treat SSH keys like any other managed resource: provisioned, audited, and revoked automatically. This is typically achieved with a combination of tools and principles.

1. Centralized Identity Provider (IdP) Integration

Your IdP (like Okta, Azure AD, or Keycloak) is the single source of truth for user identities. The key is to leverage this for SSH access.

  • How it works: Instead of directly managing SSH keys on hosts, you configure hosts to trust SSH certificates issued by your IdP (or a dedicated certificate authority that integrates with your IdP). Users authenticate to the IdP, and if authorized, receive a short-lived SSH certificate.
  • Diagnosis: If a user can’t log in, check IdP logs for authentication failures. On the server, ssh -v will show certificate validation errors.
  • Fix: Ensure the IdP is correctly configured to issue SSH certificates. This involves setting up a certificate authority (CA) within the IdP or a dedicated SSH CA that trusts the IdP’s user information. For example, in Keycloak, you’d configure the SSH CA and trust the IdP’s signing keys.
  • Why it works: The server only needs to trust the CA’s public key, not every individual user’s public key. The certificate contains user identity and authorizes access for its duration.

2. SSH Certificate Authority (CA)

This is the heart of a scalable SSH key management system. You set up a CA that signs user SSH certificates.

  • How it works: Users obtain a public/private key pair. They submit their public key (or their IdP-authenticated identity) to the CA. The CA verifies their identity and issues a signed SSH certificate (a certificate.pub file) that’s valid for a specific duration (e.g., 8 hours). The server is configured to trust the CA’s public key.
  • Diagnosis: ssh -v will show "Server accepts key: ssh-rsa SHA256:…" or similar if it’s falling back to traditional keys. If it’s trying certificates, you’ll see "Server accepts certificate: ssh-rsa-cert-v01@openssh.com SHA256:…"
  • Fix: Ensure the sshd_config on the target hosts has TrustedUserCAKeys /etc/ssh/trusted-user-ca-keys.pub. This file should contain the public key of your SSH CA.
  • Why it works: The server trusts the CA, and the certificate is cryptographically signed by the CA, proving the user’s identity and authorization for a limited time.

3. Automated Provisioning and Deprovisioning (Orchestration)

Tools like Ansible, Chef, Puppet, or Terraform are essential for distributing the CA’s public key to all hosts and configuring sshd_config.

  • How it works: These tools ensure that TrustedUserCAKeys is correctly set on all managed servers. When a user leaves the organization, their entry in the IdP is disabled, and they can no longer obtain new certificates.
  • Diagnosis: Check the sshd_config on a problematic host. Is TrustedUserCAKeys pointing to the correct file? Is the correct CA public key in that file?
  • Fix: Run your configuration management tool. For Ansible, it might be a playbook like:
    - name: Configure SSHD to trust user CA
      copy:
        src: files/ssh_user_ca.pub
        dest: /etc/ssh/trusted-user-ca-keys.pub
        owner: root
        group: root
        mode: '0644'
      notify: restart sshd
    
    - name: Ensure sshd_config uses TrustedUserCAKeys
      lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^#?TrustedUserCAKeys'
        line: 'TrustedUserCAKeys /etc/ssh/trusted-user-ca-keys.pub'
        state: present
      notify: restart sshd
    
  • Why it works: It automates the repetitive task of configuring SSH on every server, ensuring consistency and reducing human error.

4. Role-Based Access Control (RBAC)

While certificates grant identity, RBAC determines what that identity can do.

  • How it works: Certificates can be embedded with principals (e.g., developers, auditors) or custom options. sshd_config can then use AuthorizedPrincipalsFile to map these principals to specific users or groups allowed to log in.
  • Diagnosis: If a user can log in with a certificate but can’t access resources, check the AuthorizedPrincipalsFile on the server.
  • Fix: On the server, create /etc/ssh/authorized_principals/%u (where %u is the username) and add the principal from the certificate. For example, if the certificate has the principal webserver-admin, the file /etc/ssh/authorized_principals/alice would contain webserver-admin.
  • Why it works: It allows fine-grained control over which authenticated users can perform specific actions or access specific services, separating authentication from authorization.

5. Short-Lived Certificates

This is a critical security practice.

  • How it works: Certificates are issued with a very short validity period, often just a few hours. When a certificate expires, the associated access is automatically revoked.
  • Diagnosis: Check the valid before field in the certificate (you can inspect a user’s certificate with ssh-keygen -L -f /path/to/user_key-cert.pub).
  • Fix: When issuing certificates, set a short expiry. For example, using ssh-keygen -s ca_key -I user_id -n principal -V +8h user_key.pub will issue a certificate valid for 8 hours.
  • Why it works: If a user’s machine is compromised or a certificate is leaked, the window of opportunity for an attacker is severely limited.

6. Monitoring and Auditing

Log everything.

  • How it works: SSH daemon logs (auth.log or secure) should be collected and sent to a central logging system. This allows you to track who logged in, when, from where, and if certificate-based authentication was used.
  • Diagnosis: Search logs for Accepted certificate or Invalid user related to SSH.
  • Fix: Configure LogLevel VERBOSE in sshd_config for detailed logs. Ensure these logs are forwarded to a SIEM or log aggregation platform.
  • Why it works: Provides an audit trail for compliance, incident response, and security analysis.

By moving from individual authorized_keys to a CA-backed, short-lived certificate system integrated with your IdP, you transform SSH management from a manual chore into a robust, scalable, and secure process.

The next hurdle you’ll face is managing host-based SSH keys for automated services or bastion hosts, which requires a slightly different approach.

Want structured learning?

Take the full Ssh course →