SSH keys are the backbone of secure access for most enterprises, but managing them at scale across thousands of hosts can quickly become a chaotic nightmare.
Here’s how it actually works, and how to tame the beast:
Imagine you’ve got a new developer, Alice, joining the team. She needs access to a dozen servers to deploy her new microservice. Traditionally, this means:
- Alice generates an SSH key pair.
- She gives her public key to each server administrator.
- Each administrator manually adds Alice’s public key to
~/.ssh/authorized_keyson their respective servers. - If Alice leaves, each administrator has to find and remove her key.
Now, multiply this by thousands of developers, hundreds of servers, and a constant churn of new hires and departures. You end up with:
- Stale keys: Keys of former employees lingering on servers, creating security holes.
- Key sprawl: Alice’s public key is on server A, but a slightly different version is on server B.
- Manual toil: Admins spend hours copying and pasting keys.
- Lack of auditability: Who has access to what, and when was it granted?
The core problem is that SSH’s default mechanism (authorized_keys) is designed for individual, manual management, not for enterprise-wide orchestration.
The Solution: Centralized SSH Key Management
The goal is to treat SSH keys like any other managed resource: provisioned, audited, and revoked automatically. This is typically achieved with a combination of tools and principles.
1. Centralized Identity Provider (IdP) Integration
Your IdP (like Okta, Azure AD, or Keycloak) is the single source of truth for user identities. The key is to leverage this for SSH access.
- How it works: Instead of directly managing SSH keys on hosts, you configure hosts to trust SSH certificates issued by your IdP (or a dedicated certificate authority that integrates with your IdP). Users authenticate to the IdP, and if authorized, receive a short-lived SSH certificate.
- Diagnosis: If a user can’t log in, check IdP logs for authentication failures. On the server,
ssh -vwill show certificate validation errors. - Fix: Ensure the IdP is correctly configured to issue SSH certificates. This involves setting up a certificate authority (CA) within the IdP or a dedicated SSH CA that trusts the IdP’s user information. For example, in Keycloak, you’d configure the SSH CA and trust the IdP’s signing keys.
- Why it works: The server only needs to trust the CA’s public key, not every individual user’s public key. The certificate contains user identity and authorizes access for its duration.
2. SSH Certificate Authority (CA)
This is the heart of a scalable SSH key management system. You set up a CA that signs user SSH certificates.
- How it works: Users obtain a public/private key pair. They submit their public key (or their IdP-authenticated identity) to the CA. The CA verifies their identity and issues a signed SSH certificate (a
certificate.pubfile) that’s valid for a specific duration (e.g., 8 hours). The server is configured to trust the CA’s public key. - Diagnosis:
ssh -vwill show "Server accepts key: ssh-rsa SHA256:…" or similar if it’s falling back to traditional keys. If it’s trying certificates, you’ll see "Server accepts certificate: ssh-rsa-cert-v01@openssh.com SHA256:…" - Fix: Ensure the
sshd_configon the target hosts hasTrustedUserCAKeys /etc/ssh/trusted-user-ca-keys.pub. This file should contain the public key of your SSH CA. - Why it works: The server trusts the CA, and the certificate is cryptographically signed by the CA, proving the user’s identity and authorization for a limited time.
3. Automated Provisioning and Deprovisioning (Orchestration)
Tools like Ansible, Chef, Puppet, or Terraform are essential for distributing the CA’s public key to all hosts and configuring sshd_config.
- How it works: These tools ensure that
TrustedUserCAKeysis correctly set on all managed servers. When a user leaves the organization, their entry in the IdP is disabled, and they can no longer obtain new certificates. - Diagnosis: Check the
sshd_configon a problematic host. IsTrustedUserCAKeyspointing to the correct file? Is the correct CA public key in that file? - Fix: Run your configuration management tool. For Ansible, it might be a playbook like:
- name: Configure SSHD to trust user CA copy: src: files/ssh_user_ca.pub dest: /etc/ssh/trusted-user-ca-keys.pub owner: root group: root mode: '0644' notify: restart sshd - name: Ensure sshd_config uses TrustedUserCAKeys lineinfile: path: /etc/ssh/sshd_config regexp: '^#?TrustedUserCAKeys' line: 'TrustedUserCAKeys /etc/ssh/trusted-user-ca-keys.pub' state: present notify: restart sshd - Why it works: It automates the repetitive task of configuring SSH on every server, ensuring consistency and reducing human error.
4. Role-Based Access Control (RBAC)
While certificates grant identity, RBAC determines what that identity can do.
- How it works: Certificates can be embedded with principals (e.g.,
developers,auditors) or custom options.sshd_configcan then useAuthorizedPrincipalsFileto map these principals to specific users or groups allowed to log in. - Diagnosis: If a user can log in with a certificate but can’t access resources, check the
AuthorizedPrincipalsFileon the server. - Fix: On the server, create
/etc/ssh/authorized_principals/%u(where%uis the username) and add the principal from the certificate. For example, if the certificate has the principalwebserver-admin, the file/etc/ssh/authorized_principals/alicewould containwebserver-admin. - Why it works: It allows fine-grained control over which authenticated users can perform specific actions or access specific services, separating authentication from authorization.
5. Short-Lived Certificates
This is a critical security practice.
- How it works: Certificates are issued with a very short validity period, often just a few hours. When a certificate expires, the associated access is automatically revoked.
- Diagnosis: Check the
valid beforefield in the certificate (you can inspect a user’s certificate withssh-keygen -L -f /path/to/user_key-cert.pub). - Fix: When issuing certificates, set a short expiry. For example, using
ssh-keygen -s ca_key -I user_id -n principal -V +8h user_key.pubwill issue a certificate valid for 8 hours. - Why it works: If a user’s machine is compromised or a certificate is leaked, the window of opportunity for an attacker is severely limited.
6. Monitoring and Auditing
Log everything.
- How it works: SSH daemon logs (
auth.logorsecure) should be collected and sent to a central logging system. This allows you to track who logged in, when, from where, and if certificate-based authentication was used. - Diagnosis: Search logs for
Accepted certificateorInvalid userrelated to SSH. - Fix: Configure
LogLevel VERBOSEinsshd_configfor detailed logs. Ensure these logs are forwarded to a SIEM or log aggregation platform. - Why it works: Provides an audit trail for compliance, incident response, and security analysis.
By moving from individual authorized_keys to a CA-backed, short-lived certificate system integrated with your IdP, you transform SSH management from a manual chore into a robust, scalable, and secure process.
The next hurdle you’ll face is managing host-based SSH keys for automated services or bastion hosts, which requires a slightly different approach.