User:BryanDavis/Developing community norms for critical bots and tools
Developing community norms for critical bots and tools
Don't let your brother-in-law's niece be the only person who keeps your community running
Variations on this talk have been presented multiple times:
I got started in Toolforge by helping maintain an IRC bot. This particular bot had not been running for very long, but it was doing work that the Release Engineering team had wanted for quite a while. The original author had recently left the WMF to take another job (an awesome job too, but that's another story), but there were co-maintainers so it wasn't orphaned, just neglected. One day the bot was down and when I poked a maintainer to restart it they added me instead and pointed me to the on-wiki doc about how to run the bot. This might sound like the maintainer being lazy, but really it was them being smart. I was paying attention to the bot and motivated to keep it running. Since then I've become the most active code contributor to the bot and caught the Toolforge bug.
Over time I became more and more involved in Toolforge. I have spent evenings and weekends idling in the IRC channel to try and help answer questions. I filed bugs, triaged bugs, and wrote some docs on Wikitech. One of the things that I started to notice was that some tools came up more often than others both on irc and in the bug reports. There are lots of things that are broken in Toolforge (and Cloud VPS and production and the Internet in general) all the time, but there were a number of tools that users actually seemed to care about. I've been doing software for quite a while and I know that if someone goes to all the trouble of writing a bug report or joining an IRC channel to ask for help that they really care.
So here's where I make a big confession. I'm not really a Wikipedian. I consider myself a Wikimedian, but most of my interest and contributions is on the FLOSS software side. I wasn't really clueful about how things get done behind the scenes in the various wiki projects. I started asking around, talking to "real Wikipedians" and learned quite a bit about how important some bots and tools can be on-wiki. There are tools that help with detecting bad edits, tools that help with making proper citations, tools that help you find pages that are in need of various kinds of attention, tools that help stewards and patrollers figure out things about the edit habits of others. All of you in this room probably can tell me about a tool or two or three that you use regularly.
Learning all of this lead to my thesis that bots and tools are a vital resource for many on-wiki content creation and curation activities. I'm not certain that this thesis has been proven yet, but it certainly has not been disproven. I used it as the core argument to change my job at the WMF and create a new group (of size one) to look into ways to make it easier to build and maintain tools. The hoped outcome of this project is more tools that are better maintained with the presumption that that will also make things better, faster, easier for on-wiki communities.
All tools are unique snowflakes
A typical bot or tool project begins life as a way for a motivated Wikimedia community member to make some on-wiki task easier (or possible!). These individuals are "scratching their own itch" in the best tradition of open source development. Many of these projects have a short lifecycle due to factors such as loss of interest by the maintainer, insurmountable technical hurdles, or discovery of a better means to solve the original problem. A few however become popular and tightly integrated in the workflows of one or more on-wiki communities.
There is a wide range of experience and practices among the Toolforge developer community. Some tools are developed by professional software engineers with years of real world experience in designing and building highly reliable and maintainable software. Other tools are built by people who are just learning to write code by following online tutorials. Some Tool maintainers have years of experience as contributors to the Wikimedia projects and others are just discovering the Wikimedia world. Some tools are built with 100% from scratch code and others use many third-party frameworks and libraries. Some start with a group of like minded developers and some are solo works that have never been discussed with others. Tools are built using both well known and esoteric programming languages.
No level of experience, programming language, or process is intrinsically better or worse than another. The differences emerge over time. In my opinion, the best tools are the ones that end up fulfilling a need for an on-wiki community and have maintainers who remain responsive to requests from their users.
Some ways that things can go wrong
I've got two real world examples of tools that have serious issues that could have been avoided. I don't want you to leave today thinking that the developers of these tools are bad people or that they have failed the movement. These examples are presented as a retrospective to illustrate my broader points. We are not here to point fingers or place blame; we are here to learn what not to do next time. In that spirit, I'm going to try not to "name and shame" the tools involved directly. If you really want to know which exact tools I'm talking about you can dig around on Phabricator.
It's the little things
A phabricator bug is filed about a tool that is often down. Nothing too new there, except this particular tool is linked to in templates on many wikis. And these templates are used in quite a few pages: something like 20k direct transclusions on enwiki and 120k on dewiki.
Toolforge admins are aware of this tool and its stability issues. Yuvi (who is awesome by the way) has migrated it from the older Ubuntu Precise job grid hosts to the newer Ubuntu Trusty hosts and given it almost double the memory that a tool is granted by default to try and make it more stable. A user takes on monitoring as a pet project and updates the ticket regularly when the tool is down. Yuvi and Valhalla do a lot of restarts, but nobody ever seems to hear from the maintainer.
The tool is actually doing worse things than just being intermittently down however. It causes a memory leak in its webserver that begins to affect other tools on the job grid. Massive amounts of memory are being consumed and are only freed by stopping and starting the tool's webservice. Valhalla and Yuvi continue to investigate the issue, but they really need some support from the tool's maintainer.
After repeated pings on Phabricator and wiki talk pages, the maintainer shows up and explains that they have lost interest in this particular tool. They provide a link to the source code but decline to choose a software license. They instead state "you can do what you will with it". I tried a couple of times to get them to change their mind about this point, but thus far it has not happened. That's pretty much the end of the story. For reasons I'll explain in more detail later an unlicensed, unmaintained tool is a dead tool.
A perfect storm
This next example shows how multiple small issues can compound over time. I watched this particular project go from needing a small update to being forced to shutdown entirely. Many people tried to help along the way, but ultimately some small omissions by the original tool author and external forces created a perfect storm that killed the tool.
The tool itself was a collection of cron jobs that had been approved to do many different tasks for a large Wikipedia project. I don't know the full history here, but I imagine that it followed similar patterns I have seen elsewhere. The author wrote a script to do some task that was needed on wiki. When that task was taken care of and things were working well someone pointed out another task that could use attention from a bot. Eventually this tool grew to have control over a large number of related curation tasks and made hundreds of useful edits on any given day. Everything was fine for days, weeks, months, maybe even years.
The first chink in the armor showed up in August of last year (2015). The bot was found to be among a number of Action API consumers that were still using HTTP after the global switch to HTTPS. This was possible due to a loophole that had been left open in the server configuration for POST traffic. The tool maintainer responded that they would have to get new software packages installed on Toolforge in order to fix things properly due to issues in the third-party libraries they used.
This particular software upgrade had been looked into previously and for various technical reasons deemed to be too difficult to do properly. Toolforge is a shared hosting environment, lots of different tools and bots are running on the same computers, which makes things like software version upgrades trickier than it would be on a typical computer with a single user. The Wikimedia Foundation also has various internal guidelines about where we should get software updates from to help ensure that the computers remain as secure as we can make them. Our approved channels did not provide the desired version of the software.
Fast forward to December 2015. The desired package upgrade still hasn't happened and another user opens a new bug asking for it. This is investigated and again found to be a problematic upgrade for Toolforge. When the upgrade bug is closed as declined, we opened a new bug specifically to look for a solution that will work for the tool. The maintainer again asserts that the fix would be easy if only we would provide the software upgrade that has now been rejected twice.
In May of 2016 the discussion on the task heats up again. Cutoff dates have been announced for closing the POST loophole and the bot is near the top of the list for known violations per day. The solo maintainer has recently posted on another task that they do not have the ability to work on anything wiki related for at least a few weeks.
I was pretty invested in finding a solution at this point after talking to folks on the wiki that would be impacted and willing to pitch in and try to keep it running. Since we knew the maintainer was going to be AFK for awhile I used my Toolforge admin powers to look into how the tool was put together.
After looking at the tool's scripts and configuration I found more bad news. The tool was written in Java (no problem, I know Java), but there is no source code on the Toolforge server, only compiled jar files (shit). This greatly limits what I can do to try and fix things. One of the ideas that has come up is to run a dedicated proxy service that converts the insecure HTTP POST requests that the bot is making to HTTPS POST requests that the Wikimedia servers are expecting and will soon require. Based on this I created and deployed a trivial HTTP to HTTPS transparent proxy (nginx to the rescue!).
The proxy was up, but we still had no response from the maintainer to help with testing it and time is running out. I decided that I would try to play the hero and make the fixes myself. I edited the TWENTY EIGHT job startup scripts to pass the correct arguments to the Java runtime and crossed my fingers that this would be all that was needed.
Sadly it was not. The particular libraries used by the tool didn't work with the standard Java HTTP proxy configuration values. I did further investigation and determined that there would be no way to fix the problem without changing the source code. Source I did not have access to because it was not present on the Toolforge server and not published by the tool maintainer.
The Wikimedia techops team was nice enough to add an exemption to the initial breaking change for insecure POSTs that gave the tool an extra week that no one else got to complete their fixes. The maintainer suddenly appeared on Phabricator right around this time and yet again asked for more special treatment for their tool. This is the same request for a newer version of Java that had been examined and rejected twice previously with quite detailed analysis as to why it was not a reasonable solution for Toolforge at the time.
On June 20th, 329 days after the first phabricator contact trying to warn of the issue, I shut down the cron jobs for the tool because the requests were all erroring out. One absentee maintainer, no source code, no license, procrastination, and demands for special treatment had killed the project.
Best practices that would have helped
Any project (bot, web tool, gadget, ...) that is valuable for an on-wiki workflow (editing, patrolling, archiving, ...) really must have certain things to protect the community. Developers put in a lot of effort to build new things and keep them running. No one should feel that they must be available 24/7/365 to support their applications. On the other hand, the sun never sets on the Wikimedia movement and useful and popular projects will experience issues at all times of the day and night. By adopting a few simple practices, a tool maintainer can make it easier for others to help them out and keep their tool running.
Here's my short list:
- Pick a license
- Publish the code
- Have multiple maintainers
- Write some documentation
Pick a license
Go to https://opensource.org/, pick one of the licenses that they have reviewed, and apply it. Done
There is a great resource at http://choosealicense.com that tries to explain the differences and why you might prefer one license over another. It does list some non-OSI licenses though so after you find one you think you like there double check it with the OSI list.
Publish the source code
The Wikimedia movement is all about open knowledge. Publishing your source code has numerous potential benefits and in my mind very few downsides. It is fairly trivial to setup a git repository with public browsing and downloading capabilities. The Wikimedia Foundation, BitBucket and GitHub all provide this for free as in beer. The Wikimedia Foundation hosting is even free as in speech if that is important to you. As a bonus, you get a backup of your code that lives somewhere besides Toolforge and/or your laptop.
If version control isn't for you for some reason, that's fine. Find another way to expose your source code. I have a couple of trivial tools that just include a "?source" link that will dump the source code in response. Brad's AnomieBOT has a whole subroutine that publishes its source to enwiki as pretty pages with a browsing interface.
Have multiple maintainers
Everybody needs a day off now and then. Sometimes we even need a long wiki break to recharge. Having multiple maintainers makes doing that easier without putting a burden on your users or the Toolforge admins. A co-maintainer doesn't have to be another highly technical person either. At the bare minimum you just want someone who can restart things when they are down and shout for help when restarting doesn't make things better. One of the easier ways to handle co-maintainers is just to find some mates who also have tools and become co-maintainers of each other's projects. Turn four tools with solo maintainers into four tools with four maintainers each in a few mouse clicks.
Don't forget to update your contact documentation to let others know who to poke when they have issues!
Write some documentation
Write just enough documentation that every maintainer can help with restarts and basic troubleshooting. Where? A wiki page is always ready for you in the Tool namespace on Wikitech, but really anywhere is fine. Add a MAINTAINERS file to your git repo, make a subpage in your User namespace on Wikitech or your home wiki, put it wherever you like, but put it somewhere. It can be really handy even if you haven't found a co-maintainer yet too. I'm not the greatest at remembering details about this and that, so I often read my own docs to remind myself how to update the code for a particular tool or restart it.
Spread the word
So how do we get from where we are today to a glorious world where all of the awesome community developed software that helps make life on wiki easier has a FLOSS license and public source code and multiple maintainers and basic support documentation? (Seriously, how do we do that?) I can keep giving talks like this one and write things on wiki pages and send emails out to mailing lists. And I will do that. But I need help to both spread the word and to provide a market incentive. I'm asking people to do more work than they are doing today, so they need to have some reason to change their behavior.
The people we would most like to adopt these practices are people who are working with and for on-wiki communities. On most of our wikis there are various rules and guidelines and best practices for how to edit certain types of articles and how to cite certain original sources, etc. I would really like to see these same communities come up with norms for the software that they use. We don't need these norms to be universally applied or to come top down from outside the wikis. We do need the wikis to protect themselves however from wasted effort on providing requirements and testing and bug reports and training and documentation for tools/gadgets/bots/... that then fail because the original maintainer lost interest in the community, or changed jobs, or had a health crisis, or won the lottery, or whatever else happened that took their time away from keeping the project healthy.
I confessed earlier that I'm not a Wikipedian, or a Commonist, Wikisourceror, Wiktionarian, Wikinewsie, or really much other than a software engineer who is enthusiastic about the Wikimedia movement and mission, and FLOSS software. So I need folks like you who are active on the wikis to think about this problem and find ways to encourage people to do the things that will help make their projects and your wikis more successful. I'd love to have some follow up conversations here or on line after we all go back to our respective homes about how to move forward.
We can start by trying to make this list well known:
- Pick a license
- Publish the code
- Have multiple maintainers
- Write some documentation
Going beyond the minimum
There are always more things that can be done to make a tool more stable and secure. After the basics are covered, here are some stretch goals to think about:
- Use a version control system to track how our code changes over time and make it easier to figure out if you are running the same version of the code that you and others are looking at when something goes wrong.
- Make a public bug tracker (and check it for bug reports)
- Put your maintenance documentation on Wikitech in the Tool: namespace so it's easy to find
- Put some end user documentation on your favorite wiki and link to it from your tool itself if it has a web interface and/or the Tool: namespace page for your tool.
- Build your tool using other Wikimedia and 3rd-party FLOSS projects (API client libraries, application frameworks) to save time and make it a bit more likely that others can contribute.
- Make a new tool for each unrelated activity. Tool accounts are cheap and basically disposable. It will be easier to attract co-maintainers and to eventually hand your tool off to someone else if you lose interest if there is one thing that the tool does. If you want a vanity page that shows all of the tools you work on that's fine. Put something on a wiki or even make a tool just to host a page that points to your other tools.
Where to get help
- Talk to other wikis
- Talk to other devs
- IRC / wiki / phab / cloud-l mailing list
- Interest in forming a group to review projects and mentor other developers?