Jump to content

Data Platform/Data Lake/Traffic/Webrequest/RawIPUsage

From Wikitech

The goal of this document is to describe the use cases for the raw IP data stored on the webrequest data set for 60 days. Use cases are documented per team.

Survey

In November 2016, we sent out four questions to three mailing lists (ops, research-internal, wmfproducts) to ask the users of the data in webrequest log table whether hashing of the IP addresses in this table interfere with their usage of the data in these tables. Below, you can find the questions asked and the responses we have received, grouped by team.

Questions

1) ​Have you (or someone you have been the point of contact for) used raw IP addresses in the past 1.5 year?​ 2) If the answer to the previous question is Yes, would you be able to do the same work based on hashed IP addresses at a reasonable cost (where cost can be the amount of time you should spend on it)? 3) If the answer to (2) is no, please explain what you needed the raw IP address for? 4) If you have never used raw IP addresses in webrequest logs but you know of important use-cases of raw IPs, please share them.

Teams

Below you can find responses from different teams in the WMF. Please note that none of the respondents have responded to the survey on behalf of their team, but at their capacity as one team member in their corresponding teams. Any generalization of the results should be done considering this point.


Analytics

Analytics team runs monthly metrics that rely on raw IPs, in order to allow buffer time to be able to rerun metrics due to bugs or data issues (this happens) we retain raw IPs for about 90 days. We could reduce this interval a bit but really, not significantly. Note that any data that WMF or community needs that is split per country is also subjected to this same restrictions.

Discovery

One response received. There is a need for access to raw IP addresses if there is a need to provide different levels of service at the per-IP level. At the moment, there are no such controls but, for example, if an IP address abuses WDQS, the raw IP address allows us to block the IP address.

Editing

One response received. The raw IP addresses for editors from checkuser tables have been used, not from webrequest log table.


3 responses received. Raw IP addresses can be useful for detecting ISP specific blocking but this has not been undertaken, at least recently. The geolocated data is used more frequently, for example for questions related to trademark (for example, to provide information about local Wikipedia usage at the time of request for trademark registration). Although this questionnaire does not change the way gelocated data is handled, the Legal team points out that MaxMind is not always accurate and having access to raw logs may be useful in those cases. The respondents don't see an immidiate need for keeping raw IPs in the table.

Research

4 responses received. The respondents commented that hashed IP addresses plus basic geo-coding information serve their needs (assuming that there are no issues with MaxMind db). Research could see potential value in keeping raw IP addresses for research on specific IP ranges (for example if we need to investigate issues around access to content from certain ISPs), but given that the raw data will be available for 60 days outside of this table, this is not a concern for Research at the moment.

Technical operations

Two responses received. It's imperative for Ops to be able to examine raw IP addresses for a relatively short sliding window of time (for their purposes, a shorter-than-60-day limit may work). This data is useful for Ops for a variety of reasons, inlcuding but not limited to investigating DDos attacks. Slicing and dicing the data in more granular ways is another example of the needs the Ops team has, for example to learn which ASN the request is coming from, whether an IP is IPv6 or not, etc. This being said, the respondent uses sampled webrequest text logs on oxygen and the raw Kafka stream with kafkacat.

Zero

Two responses received. In the course of developing Wikipedia Zero, there was a need to access raw IP and proxy-vouched-for IP addresses. This information is used to track down piracy that was dramatically increasing partners' zero rated traffic.