RPKI

From Wikitech
Jump to navigation Jump to search

Signing

Prefixes

All our prefixes have matching ROAs, for AS14907.

They are setup through the RIR's hosted RPKI platforms.

RIR Subnet Length
RIPE 185.15.56.0/22 up to /24
RIPE 2a02:ec80::/29 up to /48
RIPE 91.198.174.0/24 /24
ARIN 2620:0:860::/46 up to /48
ARIN 198.35.26.0/23 up to /24
ARIN 208.80.152.0/22 up to /24
APNIC 103.102.166.0/24 /24
APNIC 2001:df2:e500::/48 /48

Monitoring

BGPmon Network monitoring#RPKI Validation Failed

RIPE Network monitoring#Resource Certification (RPKI) alerts

Validation

Tracking task: https://phabricator.wikimedia.org/T220669

Gerrit changes: https://gerrit.wikimedia.org/r/q/topic:%22rpki%22+(status:open%20OR%20status:merged)

VMs: https://netbox.wikimedia.org/virtualization/virtual-machines/?q=rpki (Routinator requirements)

Grafana: https://grafana.wikimedia.org/d/UwUa77GZk/rpki

Current status

In production, reject RPKI invalid prefixes on all external BGP sessions (transit and peering).

RPKI validation infra.png

Packaging

Progress is being made toward an official Debian package in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=929024

As well as a request to SRE on how to package Rust/Go apps for our infra in https://phabricator.wikimedia.org/T220836

In the meantime the package is build the following way, on a Cloud VM:

sudo apt-get install musl-tools build-essential
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
cargo install cargo-deb
git clone https://github.com/NLnetLabs/routinator/
cd routinator/
git checkout <version>
rustup target add x86_64-unknown-linux-musl
cargo deb --target=x86_64-unknown-linux-musl

https://rpki.readthedocs.io/en/latest/routinator/installation.html#building-a-statically-linked-routinator

Then added to Reprepro

Router config

First we need the routers to talk to the Validators:

routing-options {
    [...]
    validation {
        group rpki {
            session 2620:0:861:103:10:64:32:19 {
                port 3323;
            }
            session 2620:0:860:101:10:192:0:103 {
                port 3323;
            }
        }
    }
}

Then we classify the learned prefixes

policy-options {
[...]
    policy-statement BGP_IXP_in (and Transit_in) {
    [...]
        term rpki-classification {
            from policy BGP_rpki;
        }
    [...]
    policy-statement BGP_community_actions {
        term rpki-invalids {
            from community RPKI:INVALID;
            then reject;
        }
        [...]
    }
policy-statement BGP_rpki {
    term valid {
        from {
            protocol bgp;
            validation-database valid;
        }
        then {
            validation-state valid;
            community add RPKI:VALID;
        }
    }
    term invalid {
        from {
            protocol bgp;
            validation-database invalid;
        }
        then {
            validation-state invalid;
            community add RPKI:INVALID;
        }
    }
    term unknown {
        from {
            protocol bgp;
            validation-database unknown;
        }
        then {
            validation-state unknown;
            community add RPKI:UNKNOWN;
        }
    }
}
}

We also set the validation status for prefixes exchanges on iBGP (internal) sessions:

policy-statement iBGP_rpki {
    term valid {
        from community RPKI:VALID;
        then validation-state valid;
    }
    term invalid {
        from community RPKI:INVALID;
        then validation-state invalid;
    }
    term unknown {
        from community RPKI:UNKNOWN;
        then validation-state unknown;
    }
}

How-to

Identify if an issue is due to invalid RPKI

Example of a RPKI invalid prefix, with valid less specifics.
  • Enter the IP of the user reporting an issue in https://stat.ripe.net/widget/prefix-routing-consistency.
  • Focus in particular in the rows that have YES for the In RIS column, as those are the ones advertised in the DFZ.
  • If the emoji is red, then the IP is originating from a RPKI invalid prefix or length. Hover over the face to have more details.
  • If the prefix or IP is not covered by a less specific prefix (see image) then it will not be able to be routed back to the client.
  • The content of a specific ROA can be found at https://rpki-validator.ripe.net/roas. Filter for a specific prefix and verify that the ASN matches and the prefix length is smaller or equal to the defined max length.
  • In that case, reach out to the provider so they fix their ROA, or disable validation (less preferred).
Perform a manual RPKI validation
  • SSH into one of the RPKI servers (rpki[12]001 as of Feb. 2020)
  • Query the local daemon for the validity of a prefix for an ASN (replace the values of the parameters):
$ curl "http://localhost:9556/validity?asn=99999999&prefix=10.0.0.0/22"
Example output for a prefixlen mismatch
{
  "validated_route": {
    "route": {
      "origin_asn": "AS99999999",
      "prefix": "10.0.0.0/24"
    },
    "validity": {
      "state": "Invalid",
      "reason": "length",
      "description": "At least one VRP Covers the Route Prefix, but the Route Prefix length is greater than the maximum length allowed by VRP(s) matching this route origin ASN",
      "VRPs": {
        "matched": [
        ],
        "unmatched_as": [
        ],
        "unmatched_length": [
          {
            "asn": "AS99999999",
            "prefix": "10.0.0.0/22",
            "max_length": "22"
          }

        ]      }
    }
  }
}

In this case it shows that the maximum length for the prefix to be announces is set to be 22 but the advertised subnet is a /24, hence invalid.

Example output for an ASN mismatch
{
  "validated_route": {
    "route": {
      "origin_asn": "AS99999999",
      "prefix": "10.0.0.0/22"
    },
    "validity": {
      "state": "Invalid",
      "reason": "as",
      "description": "At least one VRP Covers the Route Prefix, but no VRP ASN matches the route origin ASN",
      "VRPs": {
        "matched": [
        ],
        "unmatched_as": [
          {
            "asn": "AS11111111",
            "prefix": "10.0.0.0/22",
            "max_length": "22"
          }

        ],
        "unmatched_length": [
        ]      }
    }
  }
}

In this case it shows that the ROA specifies AS11111111 as the authorized ASN to advertise the prefix, but the prefix is advertised by AS99999999 (the one passed to the cURL query), hence invalid. The advertising ASN can be taken from the RIPE stat website linked above.

Disable validation

If validation is causing any issue and must be quickly disabled, stopping Routinator would not work, as by default the routers will keep the validator data in cache for 1h.

On the router side, you can either (depending on scope):

  • Disable all validation: deactivate routing-options validation
  • Set a static override: see bellow

Set a static override (exception)

  1. Add the exception to Homer, see example
  2. Run Homer on target routers

Monitoring

RPKI to router port

  • See below to check if the process is running
  • Check if the port (3323) is open in iptables
  • Check if routinator listens on the port (sudo netstat -nlpt | grep routinator)
  • Test port from a monitoring host (eg. nc -zv <hostname> <port>)
  • Open a task, cc netops/traffic

Process

Troubleshot it like most processes:

  • sudo service routinator status
  • Routinator logs to syslog, check logstash or /var/log/syslog
  • Try to re-start it sudo service routinator start
  • Open a task, cc netops/traffic

Grafana alerts

Valid ROAs decreasing

A possible cause is that Routinator can't download the new ROAs from the repositories

  • Check the logs for signs of rsync failure (eg. rsync rpki.ripe.net/repository: rsync: mkstemp "/var/lib/routinator/repository/rpki.ripe.net[...]CAi" failed: Permission denied (13))
  • try to manually run the rsync from a temporary directory
  • Ensure the server have connectivity to the internet (e.g. check the proxies)
Rsync status > 0

Look at the logs for more information on the failure.

Try to run the rsync manually, from a host not behind a proxy to rule out the proxies.

If the issue is on the rsync server side, ack the alert and monitor the issue (not actionable).

RRDP status

RRDP uses https to fetch ROAs. So the error code will be an http error code.

-1 means that the request timed out.

As all Routinator instance fetch from the same source you can compare them to know if the issue is most likely on our side or on the remote side.

Possible future work

  • Add monitoring on the routers side. Currently only screen scraping/netconf seems doable (no SNMP).
  • Encrypt the RTR traffic. Not a blocker as it's not PII and it's not leaving our infrastructure. Not supported on Junos.
  • Implement mechanism to easily add exceptions.

Resources

Routinator's doc: https://rpki.readthedocs.io/en/latest/routinator/index.html

Juniper's doc: https://www.juniper.net/documentation/en_US/junos/topics/topic-map/bgp-origin-as-validation.html