Logstash/Common Logging Schema

From Wikitech

The Observability team embarked on a project to adopt a Common Logging Schema in 2020. After evaluating options, the Observability team determined that adopting the Elastic Common Schema (ECS) had the best chance of success with the lowest barrier to entry.

Goals

  1. Control field growth.
  2. Establish consensus on field types and content.
  3. Provide guiding documentation for developers and users.
  4. Lay the groundwork for greater than 90 days log retention in compliance with our Privacy Policy and Data Usage Guidelines.
  5. Improve the overall user experience.

Rationale

At the time of the initial investigation, the logging cluster stored more than 14,000 unique fields. A limited subset of these fields were understood and used. As evidenced by a manual audit of saved objects, around 154 fields were actively being used. Of those fields, this number could be reduced to 81 by consolidating on a single shared name based on the field's content. Consolidating on a Common Logging Schema enabled us to drop many thousands of fields and provide guidance for users to seeking to share fields based on their content.

It was regularly reported that Kibana was slow to respond, partially due to the sheer number of fields it was required to deal with. This user experience led many to opt to use mwlog over Kibana for diagnosing issues due to familiarity and performance. The same datapoints such as a client IP or a url were stored in many different fields depending on the choices of the log producer. Leveraging a Common Logging Schema improves Kibana performance and enables users to find which field names contain the datapoints they are looking for.

Type conflicts and disagreements about what data a particular field should contain were commonplace. Sometimes, this resulted in a "forking" of the index pattern in an attempt to rectify the conflict. This action ultimately did not fix the problem and instead worsened the problem by making the fields in conflict unqueryable. It further worsened performance of the logging cluster by doubling the number of indexes required to be in memory and queried every time. Adopting a Common Logging Schema externalized field definitions from implementation and provided guidance for their use to drive consensus.

The mapping template attempted to guess the type of every field. Strings were configured to be analyzed and duplicated to another field in keyword form. A field's type was determined by the first log encountered with that field name at the point of index creation. If two or more producers used the same field name and ElasticSearch encountered a less-used type for that field first, the result would be all logs using that field with the more prevalent type would be dropped until the next index was generated. The Common Logging Schema sought to define types so that compliant producers could expect their logs to not be lost due to a type conflict.

Logstash filter configurations were complex and amending them was a perilous and error-prone process. Most of this problem was mitigated by adding an e2e filter verification step in CI and writing test cases. This CI step came many years after the logging cluster was in production making for very little test coverage for the amount of filter configuration. Adopting a Common Logging Schema incentivized writing tests for log producers when Logstash transformations were required to be made, as well as setting up a "fast-track" path that bypassed most filters when a producer is able to produce compliant log events.

Elastic Common Schema was chosen due to its availability, flexibility, and that it covered most use cases out of the box. The process for amending the schema followed the same patch request workflow most of the organization was accustomed to. Documentation could be generated and deployed after each merge. A system to manage multiple versions was added to the logging cluster configuration management so that amendments to the schema would not require waiting until the next index rotation to take effect.

Documentation

Up-to-date documentation and field reference can be found on doc.wikimedia.org.

Generating ECS-compliant Events

Required Fields

The one required field is ecs.version set to the current ECS version found on the Wikimedia Documentation Portal. All other fields must comply with the definitions found in the field reference section.

Migration Process

In collaboration with the Observability team, the migration process for non-ECS-compliant logs will largely follow this protocol:

  • Logstash will duplicate a limited volume of log data into an ECS-compatible “staging” index.
  • Relevant Kibana saved objects will be identified with coordination from stakeholders.
  • ECS integration will be improved until identified Kibana saved objects are considered functional.
  • Logs directed at the legacy indexes will be disabled.
  • Saved objects will updated to reference only the new ECS indexes.

Features

Timestamp

All ECS-compliant events will attempt to parse and appropriately locate the timestamp field. It will be parsed as an ISO-8601 datetime string and moved into the ECS-compliant @timestamp field. If this field is unavailable or unparseable, @timestamp will be set to a generated timestamp set to when the event was received by the logging pipeline.

Dot-Expansion

All ECS-compliant events can be provided as nested JSON objects, dot-delimited namespaced fields, or a mixture of the two. For example, these examples are ECS-compliant and are equivalent:

{
  "service": {
    "type": "my_app"
  },
  "log": {
    "level": "INFO",
    "facility": "local7"
  },
  "message": "Something happened.",
  "ecs": {
    "version": "1.7.0"
  }
},
{
  "service.type": "my_app",
  "log.level": "INFO",
  "log.facility": "local7",
  "message": "Something happened.",
  "ecs.version": "1.7.0"
},
{
  "service": {
    "type": "my_app"
  },
  "log": {
    "level": "INFO"
  },
  "log.facility": "local7",
  "message": "Something happened.",
  "ecs.version": "1.7.0"
}

Type field removal

All ECS-compliant events will have the base-level type field removed early in the pipeline to prevent legacy filters from modifying the event.

Grok failures

If a grok parse failure occurs with an ECS-compliant event, the value at log.original will be moved to the message field. This is so that the event will be still be queryable even if the pipeline is unable to parse it.

Allow only ECS top-level fields

ECS-compliant events pass through filter_on_template which strips out undefined top-level fields and populates normalized.dropped.* fields.

Dropped fields tracking

Events with fields dropped by filter_on_template populate normalized.dropped.* fields. These fields are arrays containing the keys not found in the ECS template.

Planned Features

These features are not yet available.

Level Normalization

Events providing a log.level but not a log.syslog object will have a log.syslog object generated for them based on available data. This is to facilitate level sorting and range queries on log levels between disparate log level naming conventions.

Level to RFC5424 Mapping Table
Lowercase log.level RFC5424 definition Lowercase RFC5424 Severity RFC5424 Severity code PHP[1] Java[2] NodeJS[3] Python[4] Syslog[5]
trace, debug debug-level messages debug 7 Yes Yes Yes Yes Yes
info, informational informational messages informational 6 Yes Yes Yes Yes Yes
notice normal but significant condition notice 5 Yes Yes Yes Yes Yes
warning, warn warning conditions warning 4 Yes Yes Yes Yes Yes
error, err error conditions error 3 Yes Yes Yes Yes Yes
critical, crit critical conditions critical 2 Yes Yes Yes Yes Yes
alert action must be taken immediately alert 1 Yes Yes Yes Yes Yes
emerg, emergency, fatal system is unusable emergency 0 Yes Yes Yes Yes Yes

If no log level indicator can be identified, log.level will be set to NOTSET.

If log.level cannot be mapped to RFC5424 severity, then log.syslog.severity.name will be set to "alert" and log.syslog.severity.code will be set to "1".

Maintenance

Deploying an updated schema

Once the patch is merged and CI has built and deployed the new documentation:

  1. Download the mapping template from the ECS docs page and add it to Puppet in the logstash templates directory. The Logstash ECS Cleanup Filter may need updating as well.
  2. Update the version => revision pair in the versions hash. One version can have only one revision available at a given time.
  3. Merge Puppet changes.

Deploying a new version of ECS

Update the ECS repository to check out the new version and resolve build issues, if any, then:

  1. Download the mapping template from the ECS docs page and add it to Puppet in the logstash templates directory. The Logstash ECS Cleanup Filter may need updating as well.
  2. Add the version => revision pair to the versions hash. One version can have only one revision available at a given time.
  3. Merge Puppet changes.

References

  1. https://www.php-fig.org/psr/psr-3/
  2. https://en.wikipedia.org/wiki/Log4j#Log4j_log_levels
  3. https://github.com/trentm/node-bunyan#levels
  4. https://docs.python.org/3/library/logging.html#levels
  5. https://www.rfc-editor.org/rfc/rfc5424#section-6.2.1