Jump to content

User:Phuzion/Documentation Initiative

From Wikitech

The purpose of my account on this wiki is to help document as much as possible about the inner-workings of the Wikimedia Foundation servers and the software running on them. However, being a one-man operation without access to the local machines, I am unable to document much of this on my own, and will require significant help from the operations team. I ask for your help in documenting whatever you know.

Documentation for any organization is extremely important. Obviously, an experienced administrator can start pouring through source code and begin to figure out how things are run that way, but documentation is the shortcut that explains how to do the necessary day-to-day tasks. Even for administrators that have done certain processes a few times, but only do them on rare occasions, having documentation helps, as it will guide them through a task that they may have forgotten steps to since the last time they did this.

I'm hereby stepping up to the role of becoming the "Documentation Czar" who will oversee that all documentation on this wiki is kept up-to-date, accurate, and easy to follow for the operations team. This will facilitate faster responses to incidents, easier server deployment, a more homogeneous server farm, and less headaches for sysadmins. With the proper documentation, a minor incident may be resolved by an ops team member who does not typically deal with that specific portion of the server farm. For example, if a squid machine goes down, and the only person available at that time is someone who has not touched a squid machine in their career, that person may check the documentation on how to deal with incidents regarding squid machines, take a preliminary triage of the system in question, and follow the recovery procedure documented on this wiki to attempt to bring the server back to life.

My nickname on Freenode is phuzion, and I am more than willing to discuss with any members of the operations team any suggestions or recommendations on how to proceed with this initiative. I would love to see this be as successful as possible, and any help you folks can provide to me would be greatly appreciated. Please feel free to add items to the to-do list below. Even if you cannot write documentation on tasks that your are well-versed in, a conversation via IRC would be enough for me to begin drafting a procedure for the task. After my initial rough-draft is written, I will submit the draft to you for modification and approval, where it will then be marked as ready for use.

To-Do List

  • Verify documentation validity and currentness
  • Document any undocumented procedures and recurring tasks that are not automated
  • Establish "death documents" (tentative name) for critical members of operations teams - these are lists of procedures that each member performs on a regular basis that are critical to the operation of the foundation (not specifically linked to a person, but possibly a role, such as database administrator, system administrator, network engineer, etc). These "death documents" are documents that will be kept by the board in the event that an operations team member gets hit by the proverbial bus, and is no longer available to perform their tasks and are unable to train a replacement.
  • Establish a list of all active members of the ops team, and list their specialties and abilities for the purpose of knowing who to get in touch with when something does down. - Currently being worked on at List of Admins.

Please feel free to add to this to-do list as you see fit.