Background

Project page: https://www.mediawiki.org/wiki/Analytics/Kraken
Documentation:
- Overview: https://www.mediawiki.org/wiki/Analytics/Kraken/Overview
- Software Overview: https://www.mediawiki.org/wiki/Analytics/Kraken/Overview/Software
- Services and Ports: https://www.mediawiki.org/wiki/Analytics/Kraken/Ports
- Infrastructure / Hardware: https://www.mediawiki.org/wiki/Analytics/Kraken/Infrastructure
Related:
- ACL Ticket: https://rt.wikimedia.org/Ticket/Display.html?id=4433
- Security Meeting Notes: https://www.mediawiki.org/wiki/Analytics/Kraken/Meetings/SecurityReview

Meeting Notes

Minimum Viable Cluster

Our first goal is get to a "minimum viable" cluster setup. To meet our stakeholder obligations, we'll need:

Unsampled storage of Mobile pageview requests as txt files
Unsampled storage of EventLogging data as JSON or txt files
Importing MediaWiki databases
Ability to access raw datafiles by WMF employees
Ability to query data using Pig, Hive, MapReduce, and Hadoop streaming
Ability to schedule jobs and workflows (Oozie)
Automated recovery of data importers (Zookeeper)

Not part of MVC – deferring for later discussion:

Realtime processing layer (Storm)
Guaranteed delivery of web-request data (Kafka producers and future of udp2log)

Dataflow Outline

Mark: Asher says mediawiki import cannot be part of MVC -- he said it can be done, but I don't have all the details. Maybe something about a "binlog API"?
Faidon: This doesn't seem like a very efficient use of hardware...
- Yeah, totally not. It's an evolutionary deployment, but this is a description of how things stand today. The end goal (and scope) looks very different. This was a way to simulate a production setup (including Kafka on the edge) without making production changes. A "production setup" is not a near-term goal any more (as we've scoped out ETL for now), so this looks a bit weird.
Faidon: Rolling resetup/reimage means some number of machines should be detached from the current cluster for the process of reviewing and rebuilding off the mainline puppet
- Kraig: Yeah, but unfortunately the team has limited bandwidth to maintain and work on two cluster, two setups, and so on -- as well as meeting our deliverables.
- Faidon: Well, Ops should probably be contributing more time to this project.
- Otto: Also, if we're not perf testing, we can totally do most of this in labs.

Data Analyst Access

Otto: Only people who currently need access are WMF employees who have shell access. But it's a bit onerous to access all the different services needed to administer the cluster or debug jobs (as they're on many different machines with many different ports) via a bunch of SSH tunnels.
- Mark: Okay, so let's deal with that by restricting access to shell accounts, and talk about the future.
NN user issue: use kerberos for inter-node auth, or iptables on NN to protect data comm ports. Ops will def help in making that happen ASAP.

Privacy & Anonymization

Everyone agrees this privacy & anonymization is a critical requirement, and it needs to happen as part of the Base Cluster.
Hash(IP+Salt) is the standard; definitely need to rotate the salt regularly, not keep records, etc.
- (Note: The salt will need to live in ZK to be accessible and consistent across processing nodes, but mark the key it as volatile so it isn't persisted anywhere.)

Tim: Currently, we keep 3 months of IPs in checkuser; if we keep IPs around for longer than that, it would be problematic.
Mark: Performing anonymization on the edge is feasible, and probably preferable for a bunch of reasons -- especially from ESAMS. We should look into this.
- Might also be able to do it in MediaWiki, but that wouldn't touch all traffic

Followups

Kafka Contractor

Magnus Edenhill, who wrote the librdkafka, the major C library for the Kafka protocol with both producer and consumer support. Patrick reached out to him; he quoted us a super, incredibly reasonable price for:

varnishncsa logging module
Zookeeper support
Offered to add 0.8 protocol support pro-bono

To decide if we want to go forward with this work, we need to decide whether there's any meaningful interest in replacing udp2log, especially as Analytics doesn't have the time or bandwidth to manage this guy.

Comments:

Cost is low, and benefit is huge
Could use LVS to loadbalance the TCP connections
- Would be lumpy, as connections are long-lived, and it wouldn't be balancing on bytes
- Would make ZK support unnecessary

Conclusion:

Definitely followup with the dude
Ops (Mark) will manage/review the work

Minimal Viable Cluster

Immediate work:

Network ACL (Ops)
Set up iptables on the NN to only whitelist cluster nodes on the data ports (Otto)

Initial Base Cluster

At what point should the cutover happen between the MVC (current running cluster), and the newly built Base Cluster? That is: When do we cut over the data import stream?

Puppetization of all components, Op Friendship
Security is satisfactory
Anonymization of IPs

Who on Ops is going to be involved in setting this up and reviewing it? Volunteers:

Mark, Faidon, possibly others

Conclusions & Next Steps

Keep current cluster running as-is
Continue puppet work on Initial Base Cluster
- Find machines: either pull out ~5 datanodes, or do it in labs (prefer labs)
- Start building new cluster