Portal:Cloud VPS/Admin/Auth logging

From Wikitech

Since 2022, all Cloud VPS instances forward authentication-related events to a set of syslog servers for security reasons. Forwarding these events to a remote machine mitigates various scenarios where attackers hide their actions by tampering with log files. Besides, audit logging can be used as a deterrent security control . Furthermore, after a compromise, these log files can assist administrators to find the culprits and to determine how machines were compromised. For simplicity, all of this can be called audit logging or auth logging.

Architecture

The following building blocks are relevant: rsyslog, Cloud VPS (instances), Puppet on Cloud VPS, service records and Cinder.

Client

Whenever a Cloud VPS instance (syslog client) receives a syslog message with the facility auth or authpriv, rsyslog will apply the omfwd action to forward the message to syslog servers. The actual syslog destination is not the FQDN of a syslog server, but a service record.

rsyslog allows forwarding to multiple syslog servers simultaneously. Provided that all syslog servers are up and running, rsyslog will send the syslog messages to syslogaudit1.svc.eqiad1.wikimedia.cloud:6514 and syslogaudit2.svc.eqiad1.wikimedia.cloud. See /etc/rsyslog.d/30-remote-syslog.conf (managed by Puppet) for the RainerScript code.

Server

Syslog servers run rsyslog as well. rsyslog listens on 6514/tcp. The well-known Transport Layer Security (TLS) standard is used to ensure the confidentiality and integrity of the syslog messages sent to this syslog server. The leaf certificate presented on the port is managed by acme-chief and contains both service records in its subjectAltName.

After the receival of a syslog message, rsyslog on the rsyslog server will write its contents to /srv/syslog/CLIENTIP/syslog.log, where CLIENTIP is the syslog client's IP address (172.16.x.x) that was used to communicate with the syslog server. See /etc/rsyslog.d/10-receiver.conf (managed by Puppet) for the RainerScript code.

Each syslog server has a Cinder volume that is mounted as /srv/syslog.

Security considerations

The syslog setup is not fully RFC 5425 compliant.

  • Client authentication: not present. There is no tested, deemed to be secure mechanism to automatically enroll all Cloud VPS instance (existing and new instances) with a certificate (containing the instance's FQDN), signed by a trustworthy certificate authority.(therefore no mutual authentication either). TCP's connection-oriented nature, which makes it somewhat harder to spoof IP addresses, is used for authentication. Given that IP addresses are usually used by exactly one Cloud VPS instance, the client's IP address is the client identifier as well. TCP is merely a weak form of authentication, though: a successful TCP sequence prediction attack will render the whole authentication process useless.[1] An authentication scheme based on cryptography (mutual TLS!?) could be much stronger.
  • Server authentication: yes; verification of 'is the leaf certificate of the syslog server signed by a trusted certificate authority?' (which is the reason rsyslog uses the OpenSSL NetStream driver, instead of the GnuTLS one and subject name validation (was enabled in 2023).

Searching through the messages

The two syslog servers are syslog-server-audit01.cloudinfra.eqiad1.wikimedia.cloud (service record: syslogaudit1) and syslog-server-audit02.cloudinfra.eqiad1.wikimedia.cloud (service record: syslogaudit2). Log files are organized by client IP address, are named syslog.log and can be found in /srv/syslog. For example, an unrotated log file containing syslog messages from 172.16.1.1 can be found at /srv/syslog/172.16.1.1/syslog.log.

Adminstration

The syslog servers can be rebooted in any order, as long as at least one syslog server stays up. Rebooting will induce a further split-brain situation, however.

Switching syslog servers

The list of syslog servers is managed via Hiera (hieradata/cloud1/eqiad1.yaml): profile::base::remote_syslog_tls. Be aware: this list contains service records. I am not sure how to change the pointer of these service records, though. A new syslog server must have the WMCS syslog server role (Puppet class: role::wmcs::centralserver_syslog).

Syslog clients will not re-send previous events to syslog servers (partially for the reason that one cannot trust the local log files, since attackers will make sure to hide their actions). To preserve the archive, you can copy rotated log files on the old syslog servers to the new syslog servers.

Known limitations

  • There are no application-level acknowledgements of receival of syslog messages. TCP does not provide full resiliency against sender failures.[2]
  • Since the syslog servers store the messages in plain log files, the syslog servers may be in constant split-brain mode. A syslog message can exist on both syslog servers, exactly one server or none.
  • The syslog setup is not monitored. An issue regarding the reachability of 6514/tcp and the validity of the certificate will not trigger alerts. The presence of Blackbox exporter would help to alleviate this limitation.[3]
  • The syslog setup deviates from the production logging pipeline. This makes the syslog setup less maintainable (more rsyslog expertise is required) and more maintainable (only rsyslog expertise is required) at the same time.
  • A Cloud VPS instance in the codfw1dev deployment does not send its audit logs to the syslog servers.
  • Log files are not backed up to off-Cloud VPS storage. In the event of a full compromise of Cloud VPS (either direct or indirect), integrity will be reduced greatly.
  • Authentication of syslog clients is relatively weak (see Security considerations for an explanation).

Miscellaneous

  1. If the initial sequence number (ISN) of a TCP connection, which isn't too large on its own (32 bits, so possible values), is generated using a weak random number generator. If you wish to read more on this topic: https://datatracker.ietf.org/doc/html/rfc6528#section-1 and https://datatracker.ietf.org/doc/html/rfc9293#section-3.4.1
  2. For more information, see this article written by Rainer Gerhards, lead developer of rsyslog.
  3. Creating a Prometheus/Alertmanager setup for the sole purpose of monitoring the syslog server is undesirable.