Logstash/Common Logging Schema
The Observability team embarked on a project to adopt a Common Logging Schema in 2020. After evaluating options, the Observability team determined that adopting the Elastic Common Schema (ECS) had the best chance of success with the lowest barrier to entry.
Goals
- Control field growth.
- Establish consensus on field types and content.
- Provide guiding documentation for developers and users.
- Lay the groundwork for greater than 90 days log retention in compliance with our Privacy Policy and Data Usage Guidelines.
- Improve the overall user experience.
Rationale
At the time of the initial investigation, the logging cluster stored more than 14,000 unique fields. A limited subset of these fields were understood and used. As evidenced by a manual audit of saved objects, around 154 fields were actively being used. Of those fields, this number could be reduced to 81 by consolidating on a single shared name based on the field's content. Consolidating on a Common Logging Schema enabled us to drop many thousands of fields and provide guidance for users to seeking to share fields based on their content.
It was regularly reported that Kibana was slow to respond, partially due to the sheer number of fields it was required to deal with. This user experience led many to opt to use mwlog over Kibana for diagnosing issues due to familiarity and performance. The same datapoints such as a client IP or a url were stored in many different fields depending on the choices of the log producer. Leveraging a Common Logging Schema improves Kibana performance and enables users to find which field names contain the datapoints they are looking for.
Type conflicts and disagreements about what data a particular field should contain were commonplace. Sometimes, this resulted in a "forking" of the index pattern in an attempt to rectify the conflict. This action ultimately did not fix the problem and instead worsened the problem by making the fields in conflict unqueryable. It further worsened performance of the logging cluster by doubling the number of indexes required to be in memory and queried every time. Adopting a Common Logging Schema externalized field definitions from implementation and provided guidance for their use to drive consensus.
The mapping template attempted to guess the type of every field. Strings were configured to be analyzed and duplicated to another field in keyword form. A field's type was determined by the first log encountered with that field name at the point of index creation. If two or more producers used the same field name and ElasticSearch encountered a less-used type for that field first, the result would be all logs using that field with the more prevalent type would be dropped until the next index was generated. The Common Logging Schema sought to define types so that compliant producers could expect their logs to not be lost due to a type conflict.
Logstash filter configurations were complex and amending them was a perilous and error-prone process. Most of this problem was mitigated by adding an e2e filter verification step in CI and writing test cases. This CI step came many years after the logging cluster was in production making for very little test coverage for the amount of filter configuration. Adopting a Common Logging Schema incentivized writing tests for log producers when Logstash transformations were required to be made, as well as setting up a "fast-track" path that bypassed most filters when a producer is able to produce compliant log events.
Elastic Common Schema was chosen due to its availability, flexibility, and that it covered most use cases out of the box. The process for amending the schema followed the same patch request workflow most of the organization was accustomed to. Documentation could be generated and deployed after each merge. A system to manage multiple versions was added to the logging cluster configuration management so that amendments to the schema would not require waiting until the next index rotation to take effect.
Documentation
Up-to-date documentation and field reference can be found on doc.wikimedia.org.
Generating ECS-compliant Events
Required Fields
The one required field is ecs.version
set to the current ECS version found on the Wikimedia Documentation Portal. All other fields must comply with the definitions found in the field reference section.
Migration Process
In collaboration with the Observability team, the migration process for non-ECS-compliant logs will largely follow this protocol:
- Logstash will duplicate a limited volume of log data into an ECS-compatible “staging” index.
- Relevant Kibana saved objects will be identified with coordination from stakeholders.
- ECS integration will be improved until identified Kibana saved objects are considered functional.
- Logs directed at the legacy indexes will be disabled.
- Saved objects will updated to reference only the new ECS indexes.
Features
Timestamp
All ECS-compliant events will attempt to parse and appropriately locate the timestamp
field. It will be parsed as an ISO-8601 datetime string and moved into the ECS-compliant @timestamp
field. If this field is unavailable or unparseable, @timestamp
will be set to a generated timestamp set to when the event was received by the logging pipeline.
Dot-Expansion
All ECS-compliant events can be provided as nested JSON objects, dot-delimited namespaced fields, or a mixture of the two. For example, these examples are ECS-compliant and are equivalent:
{
"service": {
"type": "my_app"
},
"log": {
"level": "INFO",
"facility": "local7"
},
"message": "Something happened.",
"ecs": {
"version": "1.7.0"
}
},
{
"service.type": "my_app",
"log.level": "INFO",
"log.facility": "local7",
"message": "Something happened.",
"ecs.version": "1.7.0"
},
{
"service": {
"type": "my_app"
},
"log": {
"level": "INFO"
},
"log.facility": "local7",
"message": "Something happened.",
"ecs.version": "1.7.0"
}
Type field removal
All ECS-compliant events will have the base-level type
field removed early in the pipeline to prevent legacy filters from modifying the event.
Grok failures
If a grok parse failure occurs with an ECS-compliant event, the value at log.original
will be moved to the message
field. This is so that the event will be still be queryable even if the pipeline is unable to parse it.
Allow only ECS top-level fields
ECS-compliant events pass through filter_on_template
which strips out undefined top-level fields and populates normalized.dropped.*
fields.
Dropped fields tracking
Events with fields dropped by filter_on_template
populate normalized.dropped.*
fields. These fields are arrays containing the keys not found in the ECS template.
Planned Features
These features are not yet available.
Level Normalization
Events providing a log.level
but not a log.syslog
object will have a log.syslog
object generated for them based on available data. This is to facilitate level sorting and range queries on log levels between disparate log level naming conventions.
Lowercase log.level |
RFC5424 definition | Lowercase RFC5424 Severity | RFC5424 Severity code | PHP[1] | Java[2] | NodeJS[3] | Python[4] | Syslog[5] |
---|---|---|---|---|---|---|---|---|
trace, debug | debug-level messages | debug | 7 | |||||
info, informational | informational messages | informational | 6 | |||||
notice | normal but significant condition | notice | 5 | |||||
warning, warn | warning conditions | warning | 4 | |||||
error, err | error conditions | error | 3 | |||||
critical, crit | critical conditions | critical | 2 | |||||
alert | action must be taken immediately | alert | 1 | |||||
emerg, emergency, fatal | system is unusable | emergency | 0 |
If no log level indicator can be identified, log.level
will be set to NOTSET
.
If log.level
cannot be mapped to RFC5424 severity, then log.syslog.severity.name
will be set to "alert" and log.syslog.severity.code
will be set to "1".
Maintenance
Deploying an updated schema
Once the patch is merged and CI has built and deployed the new documentation:
- Download the mapping template from the ECS docs page and add it to Puppet in the logstash templates directory. The Logstash ECS Cleanup Filter may need updating as well.
- Update the
version => revision
pair in the versions hash. One version can have only one revision available at a given time. - Merge Puppet changes.
Deploying a new version of ECS
Update the ECS repository to check out the new version and resolve build issues, if any, then:
- Download the mapping template from the ECS docs page and add it to Puppet in the logstash templates directory. The Logstash ECS Cleanup Filter may need updating as well.
- Add the
version => revision
pair to the versions hash. One version can have only one revision available at a given time. - Merge Puppet changes.