User:Razzi/grand SRE IC plan

From Wikitech

Trying to become the most capable SRE at wmf. Or at least really capable, the competition is really with myself.

I already have root ssh credentials so there's nothing I can't do, from a permission standpoint. But realistically it's worth doing ops as separate from root; at this point I don't need root on the mediawiki servers etc. Pretty dangerous, somebody like me could easily dump some misinformation into the pipes and it would cause problems.

Also I've been in software for about a decade so I have the technical foundation to learn anything. If you're trying to replicate this know it'll take a few years to get up to speed on unix, programming, networking, cryptography. But really all you need is a good command line and willingness to be patient with yourself, software stuff can be really dense; learning a single command line tool like `ssh` or `git` can be a lifelong process, or at least take a few weeks of concerted effort.

Ok so I'm starting on the data engineering team, and I have some open tickets to get things like superset and presto (eventually trino) working. If I were an expert in all of these things this would all be easier, so I'll become an expert in each. First let me check my backlog to see what else I should do.

Aside: here's the job spec for a senior sre for search platform:


    Deployment, scaling, monitoring, provisioning, and support of our Search and SPARQL endpoints
    Developing and maintaining automation tools and processes
    Providing guidance and expertise to the team on productionizing and operating our applications
    Configuration management and deployment tools
    Ensuring the continuous improvement and evolution of services on our platform
    Monitoring of systems and services, optimization of performance and resource utilization
    Incident response, diagnosis and follow-up on system outages or alerts
    Assisting  in software updates

Skills and Experience:

    5+ years experience in an SRE/Operations/DevOps role as part of a team
    Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby, etc., we use primarily Python)
    Comfortable with Open Source configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack)
    Good understanding of Linux systems 
    Experience in automating tasks and processes, identifying process gaps, and finding automation opportunities
    Open to supporting JVM-based applications
    Strong English language skills and ability to work independently, as an effective part of a globally distributed team
    B.S. or M.S. in Computer Science or related field or equivalent in related work experience

And here's the job I have:

The Wikimedia Foundation is hiring a Site Reliability Engineer to support and maintain the data and statistics infrastructure that powers a big part of decision making in the Foundation and in the Wiki community. This includes everything from eliminating boring things from your daily workflow by automating them, to upgrading a multi-petabyte Hadoop cluster to the next upstream version without impacting uptime and users.

We're looking for an experienced candidate who's excited about working with big data systems. Ideally you will already have some experience working with software like Hadoop, Kafka, ElasticSearch, Spark and other members of the distributed computing world. Since you'll be joining an existing team of SREs you'll have plenty of space and opportunities to get familiar with our tech (Analytics, Search, WDQS), so there's no need to immediately have the answer to every question.

We are a full-time distributed team with no one working out of the actual Wikimedia office, so we are all together in the same remote boat. Part of the team is in Europe and part in the United States. We see each other in person two or three times a year, either during one of our off-sites (most recently in Europe), the Wikimedia All Hands (once a year), or Wikimania, the annual international conference for the Wiki community.

Here are some examples of projects we've been tackling lately that you might be involved with:

    Integrating an open-source GPU software platform like AMD ROCm in Hadoop and in the Tensorflow-related ecosystem
    Improving the security of our data by adding Kerberos authentication to the analytics Hadoop cluster and its satellite systems
    Scaling the Wikidata query service, a semantic query endpoint for graph databases
    Building the Foundation's new event data platform infrastructure
    Implementing alarms that alert the team of possible data loss or data corruption
    Building a new and improved Jupyter notebooks ecosystem for the Foundation and the community to use
    Building and deploying services in Kubernetes with Helm
    Upgrading the cluster to Hadoop 3
    Replacing Oozie by Airflow as a workflow scheduler

And these are our more formal requirements:

    Couple years experience in an SRE/Operations/DevOps role as part of a team
    Experience in supporting complex web applications running highly available and high traffic infrastructure based on Linux
    Comfortable with configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.), and modern observability      infrastructure (monitoring, metrics and logging)
    An appetite for the automation and streamlining of tasks
    Willingness to work with JVM-based systems  
    Comfortable with shell and scripting languages used in an SRE/Operations engineering context (e.g. Python, Go, Bash, Ruby, etc.)
    Good understanding of Linux/Unix fundamentals and debugging skills
    Strong English language skills and ability to work independently, as an effective part of a globally distributed team
    B.S. or M.S. in Computer Science, related field or equivalent in related work experience. Do not feel you need a degree to apply; we value hands-on experience most of all.

Ok back to the backlog. I see:

T293083 Superset SQL Lab fails to stop query
T288975 Cookbook to reboot cassandra nodes
T294772 Superset Timeout Logging
T294771 Increase Superset Timeout
T294768 Triage Superset Dashboard Timeouts
T292087 Setup Presto UI in production
T273004 Presto should warn or prevent users from querying without Hive partition predicates
T277553 varnishkafka / ATSkafka should support setting the kafka message timestamp
T273850 Superset caching doesn't enforce data acccess permissions
T269832 Add a presto query logger
T279738 Superset annotation text overlaps illegibly