Analytics/Data access guidelines

From Wikitech
tl;dr

What is considered “sensitive data”? Any data or sets of data that is either covered by one of our privacy policies (e.g. [1], [2], [3]), can be used to identify a particular user, or where the public release would likely cause reputational damage to the WMF.

Can I download sensitive data to my laptop? No, not without approval from Analytics and Security.

Can I share sensitive data with someone who isn't under NDA? No. See guidelines for public release before sharing data with anyone who is not under NDA.

Can I share non-sensitive data publicly? Yes, but the data must be reviewed by Analytics, Legal, and Security.

Is it ok to store sensitive data on Toolforge or Cloud VPS? It is never ok to store sensitive data on Toolforge or Cloud VPS.

Is it ok to send sensitive data via email? No, never.

Is it ok to share data if I anonymize it? Anonymizing data is hard to do right, so no. Anonymization does not work in the event of a linkage attack, the more public data there is, the easier a linkage attack is.

Do I need special permissions to access sensitive data? In general, getting access requires that you are under a nondisclosure agreement (NDA), have manager approval, and undergo a standard review period. Other requirements may apply.

This is a reference guide for new and existing staff members and contractors who are under a nondisclosure agreement and work with sensitive data. This document describes data access generally, although your department or team may have more explicit policies and guidelines in addition to these.

These guidelines will help prevent unauthorized users from gaining access to sensitive data. Additionally, if you have access to sensitive data, you need to take extra care with your own security to prevent someone from impersonating you to access data.

“Sensitive data” in this context refers to data governed explicitly by the WMF's privacy policy or donor privacy policy, data or data sets of data that can be combined to identify a particular user, or any data where its public exposure would likely cause harm to a user or reputational damage to the WMF. A non-exhaustive list of sensitive data includes,

  • IP addresses and User Agent strings
  • Information about donors or donations
  • Search or browsing history of a particular user
  • Reading history of particular users, article content from private wikis, password hashes

Executive summary

  • Do not share any data externally without first receiving approval from your department's privacy contact, Security, and Legal.
  • We have six primary categories of sensitive data:
    • request logs, and derivative datasets;
    • event logs;
    • operational logs;
    • private application data;
    • fundraising data;
    • survey data.

In general, getting access requires that you are (1) under a nondisclosure agreement (NDA), (2) have manager approval, and (3) undergo a standard review period (as explained below). Other requirements below may apply.

  • To share and move data internally, you must transfer the data through one of the Wikimedia servers via SSH. Do not transfer data through any other means. Do not transfer data to anyone not covered by an NDA. There are additional requirements below for transferring sensitive data.
  • To share and move data externally, you must ensure data is free of sensitive information. Doing this well is difficult as the data in question can be recombined with data from other datasets (linkage attack), so both the data release process and example results should be reviewed thoroughly.

Data sources

This section describes the internal data sources that contain potentially sensitive data.

Request logs

Request logs are logs of every HTTP or HTTPS request made to a Wikimedia site. We store a complete log of web requests inside the Analytics network.

To access a machine that has a connection to the Analytics cluster, you must first connect to our network, which requires you to have a private SSH key that has been authorized by the Analytics Operations team.

Operations approval requires that users of the cluster:

  • demonstrate a legitimate need for this data;
  • be under an NDA;
  • have approval from their direct manager to access the data; and
  • undergo a 3-day waiting period in which Operations engineers can raise any concerns they may have about the user's handling of data.

The request logs contain various pieces of sensitive information. Of particular note are users' IP addresses, user agents, and in certain cases - such as with the Wikimedia mobile apps - unique identifiers. Additionally, IP addresses and user agents are considered “personal information” under the Wikimedia Foundation's privacy policy and should not be disclosed publicly or to third parties unless such disclosure falls under a permissible disclosure exception under the privacy policy. Please check with Legal to be sure.

A particular user's reading history, in aggregate, is considered sensitive, and should not be disclosed externally.

EventLogging

EventLogging is a system for tracking user-side events on the Wikimedia projects. We use this for things like identifying how often people actually click through to search results, or which options on a page people tend to select. Additionally, the Wikimedia mobile apps have per-installation unique identifiers.

The information that an EventLogging table contains varies depending on what the EventLogging schema is being used for, but they include various forms of sensitive information by default (specifically, user agents and browsing history) and may contain others depending on the use. Personal information should be kept confidential, as explained above. EventLogging schemas might contain data that has been deleted or suppressed for legal reasons on WMF websites, such as the text of a user's name or a page title, which should be treated as sensitive data.

Operational logs

Operational logs are generated by various systems within the cluster, and are used by application developers and the operations team to monitor the health of those systems and respond to incidents. Operational logs frequently contain IP address and user-agent information, as well as details of activities about particular users (logins, edits, blocks).

Access to this data requires access to the WMF cluster, which requires an NDA. Similar to webrequest logs, access to specific log servers requires access to the analytics network and manager approval. Access to logstash requires that you are in one of the LDAP groups: nda, ops, or wmf.

Private application data

Within the WMF cluster, several applications store sensitive data. Although this data is available to users with general cluster access (deployers, ops-team), it should only be accessed when debugging issues that cannot be resolved another way, or for operational work at the explicit request of the data owner.

This data includes,

  • Deleted and suppressed user-contributed data
  • IP addresses of editors (CheckUser data)
  • Authentication data (tokens, password, shared secrets) and encryption keys ($wgSecretKey)
  • Content from private wikis (wikis listed in private.dblist, including officewiki, arbcom wikis, stewardwiki, etc).

Sharing and moving data internally

Sometimes the work we do at WMF requires staff and contractors with direct access to sensitive data to share it with other staff and contractors. On other occasions, data needs to be transferred between machines - perhaps to a person's local machine, rather than a server - for analysis or ad-hoc processing. This section provides guidance on how to handle these scenarios.

Sharing data with other staff and contractors

There are situations that require sharing restricted data with other WMF staff and contractors.

If you find yourself in a scenario where you need to share sensitive data with other staff and contractors, or they need to share sensitive data with you, you are expected to keep the data as secure as possible. In particular, no sensitive data should ever be transferred via email, IRC, direct (non-SSH) file transfer, or any other unauthorized method.


Example

An analyst has performed some research, discovered something that needs to be passed to the engineers, and it can only be effectively represented in a form that contains personally identifiable information. They copy this data (including the PII) to fluorine, a machine that engineers can access, and go through the same pattern above. The data is held until the problem is understood, and then destroyed.

Should you need to transfer this data, it should only be transferred through one of the Wikimedia servers via SSH. Within the Analytics network, “stat1004” and “stat1007” are commonly used; within the Production network, “fluorine.” Find a server that works for the sender and receiver, place the files in your home directory on that server, confirm that it has been retrieved, and delete the server-side copy.

When files are created on the server, ensure that file permissions restrict access as much as possible (e.g., make the file group-readable by a group in which both engineers are members, not world-readable).

Files containing sensitive data should only be transferred to staff and contractors who have signed an NDA and who have demonstrated legitimate use for the data in a phabricator ticket asking for it. For example, transferring search data to a WMF staff engineer developing a new search suggestion engine could be acceptable, while transferring the same data to a volunteer, or a staff engineer who is simply curious about it, would not.

The Toolforge and Cloud VPS clusters are never an appropriate intermediary for transferring data between staff and contractors, or for anything else involving sensitive data, because Toolforge and Cloud VPS do not guarantee the inability of other system users to access your data.

Sensitive data in Phabricator

Phabricator is a public system, and nearly all data within Phabricator is publicly accessible. As a last resort, sensitive data can be recorded in Phabricator to debug or solve a particular problem. You must set appropriate access controls on the data, and limit the amount shared to the minimum necessary to solve the issue.

Phabricator allows strict View and Edit policies to be applied to Tasks. If including sensitive data in a Phabricator task, you should be sure you understand how to set the view policy correctly. The easiest way to avoid mistakes is to use the following form: Private Task (WMF-NDA). The form has a pre-configured policy that will restrict visibility to only users who are under NDA.

If sensitive data is accidentally added to Phabricator, a Phabricator Administrator should be notified, so that the data can be permanently removed. If you notice sensitive data posted in a public task, consider using the "Protect as security issue" item in the task sidebar to immediately restrict access to the task until an administrator is able to respond.

Surveys and long-term storage

Surveys are an exception to storing data on the Wikimedia servers (via SSH). Certain survey data must be kept long-term in order to conduct year-over-year analyses of major issues in our movement (e.g. gender gap, community health, etc.).

Sharing data externally

One of the things the Wikimedia Foundation prides itself on is its openness and transparency. This extends to our data, and we try to give back to the wider research community to enable research that improves our work and the Internet as a whole.

At the same time, we also pride ourselves on our commitment to respecting privacy. Sharing data externally should be done cautiously with every effort expended to ensure that sensitive information is never released. Raw data containing sensitive information should never, under any circumstances, be released publicly or to individuals not covered by the Wikimedia NDA. If you are unsure whether a type of data constitutes sensitive information, please reach out to your department's privacy contact or the Security team. If you need additional assistance or clarification, you can contact Legal.

Sanitization. Do not do it.

Sanitization is very difficult to perform correctly, mostly because the dataset your are concerned with could be cross-referenced with other datasets already public and this cross-reference might lead to leak of sensitive data (this is called a linkage attack). Before releasing any dataset or sharing dataset with an external party (this includes volunteers) please check with Security, Analytics, and Legal. Remember that removing sensitive fields does not ensure you are not leaking sensitive information. An example of how to file a task can be found in phab:T161656 (please note the restricted "Custom Policy" visibility of that task).

References