Data Engineering/FAQ

From Wikitech
Jump to navigation Jump to search

Data Requests

The job of the Wikimedia Foundation's Data Engineering team is to provide infrastructure and services so you can put/get your statistics into a system via a fairly simple code snippet. We are an infrastructure team that provides analytics infrastructure. Please have in mind that the data engineering team does not provide teams with numbers to measure their KPIs against. We will support your data access request, grant access to our systems if it pertains and help you as needed be. You can find us on IRC on #wikimedia-analytics.

Please see the Research FAQ on Meta to understand who at WMF owns research-related processes and resources, and on where to find data or statistics about a specific Product Audience (such as editors or readers).

What is the Analytics Cluster?

The Analytics Cluster is a catch-all term for compute resources and services running inside of the Analytics VLAN, which itself is inside of WMF production networks. Individual systems in the Analytics Cluster can be referred to more specifically, e.g. Analytics Hadoop Cluster, Druid Analytics, Hive, etc.

How do I transfer files between stat boxes?

There is an rsync server set up on stat boxes that allow for pulls from /home or from /srv.

stat1007$ rsync -av stat1006.eqiad.wmnet::srv/path/to/files/ /path/to/files/ 

Note the special '::srv’ on the destination.  :: indicates an rsync module name. Use ::home/<username> if you want to access files in your user's home directory on a remote stat box.

What do I do if I don't know what to do

Wait 5 minutes and try again. It might work, and even if it doesn't you'll be generally happier for having taken a short break.