- Dsign for change
- Use automatic, repeatable builds
- Use redundancy
- Use backups
- Keep monitoring specific
- Graph data, keep exact historical data
- Log useful information, use multiple streams of data
- Understand your data storage, databases
- Scale out a lot, up a little
- Asyncronous jobs
- Security and patrols
- Learn from many sources
- Try many things
- Understand redundancy
- Understand scalability
- Become a troubleshooting superstar
- Work with IT
- Work with developers
- Work with ops
- Fix it now, not later
- Automate everything
- Change what's necessary
- Practice updating content, fast
- Standardize, stick to the standard
- Document well
- Use source control
- Hire well
- Avoid vendor lock in, and keep a good relationship with the vendors you do use
- Give Open Source a serious try
Dormando's [crappy] Operations Mantras
Ops Mantras (as made popular by Dormando).
I've been doing this shit for a while now. I'm presently acting as a MySQL DBA for SixApart, but these views are mine and not of my employer. This is an omega post of all of the generalized one off mantras I find valuable when approaching operations management. Even if these end up being idealistic, my humble view is to shoot for these and you'll be better off with what you end up with.
It's uh, long, sorry. This _was_ inspired by another post which I'll not be direct linking. Aside from the list-style, I've not stolen anything else. The Mantras are broken up into major sections:
- The Technical Element
- The Human Element
- The Practice
The Technical Element
Design for change
- The old google mantra is right. Design for change. Change is having to deploy new software, upgrade existing software, scaling, equipment breaking, and people shifting around.
- Everything in this mantra is about finding balance. You might think it's a good idea to tightly marry your system to a particular OS or Linux distro. It's just as bad of an idea to separately them entirely. Use layers and a _little_ indirection if you must.
- This does not mean complete and total platform agnostics. It's about making one system two, two systems twenty. Dealing if a sysadmin gets hit by a bus, if that dangling harddrive dies, if someone runs rm -rf /. It's for the incremental changes. Security updates, pushing new corporate content.
Use automatic, repeatable builds
- Don't build anything by hand. If you do, do it twice, and grab every single command the second time around.
- I cannot stress how important this is. It should take no more than 15 minutes from bare metal to production for new hardware. There does not need to be a human element to screw it up, or get punished when a server goes down and no one knows how to replace it.
- This is true for anything. There is _no_ such thing as a "one off" server build. If you've built it once, and it only needs to exist once, it will exist twice. The second time around is when it breaks, or if you need to do a major upgrade or consolidation two years down the road and have no frickin' clue how it was put together.
- Test, vet new builds. This should be easy because your builds are all automatic, correct?
- Scripted builds means that upgrade from Linux Distro Version 3 to Version 4 is absolutely clear cut. Install Version 4 and test the scripts. Read documentation and fix until it works again. This should be a week's worth of work at most, not a yearlong project. (to finish just in time for Version 5 to come out!).
- Just beacuse something might be easy to rebuild, doesn't mean you can ignore redundancy. Jump boxes, mail servers, billing gateways, whatever. Wouldn't it be a hell of a lot easier if you could swap out one half of the equation without causing downtime for your customers?
- ... and along those lines, you get to "deal with it later!" when a box goes down at 3am, and the redundant machine kicks in.
- Even if it's not ideal, go for it anyway. Rsync'ing configs to a second box is a step above nothing. DRBD might not be perfect but it can provide an amazing service.
- We shouldn't even joke about this. Use harddrives, burn the tapes. Compress them, move them, run them in parallel. Backup EVERYTHING!
- If your builds are automatic, the entire process can be backed up. If you're following along past this point a *real* Disaster Recovery plan might not seem so far fetched.
Keep monitoring specific
- Monitor every damn thing you can, but do it right. Don't get a thousand alerts if your NFS server craps its pants. Don't alert on timeouts if it doesn't make sense for your system. Test for success at the most specific level; sure the service might allow a new TCP connection, and it might even say hello, but does it remember how to do its job?
- If you have 500 webservers, you probably don't need to know immediately if one goes down. You _should_ know if the load balancer decided to not take it out of rotation and real human people are seeing its uglyriffic error messages.
Graph data, keep exact historical data
- Graphs are for visualizing trends. Historical data is for crunching numbers. Don't mix the two! It's too easy to get wrong numbers from eyeballing graphs. Many sites use rrd's or other aggregating data systems which will average and smooth out data over time to save on storage space. This means it's not only hard to read, it's wrong.
- Don't get trapped having to skim through hundreds of graphs just to pinpoint an issue. If you're trying to find outliers in the graphs, you can pull those out via scripts as well.
- If you must use graphs for troubleshooting, try to aggregate high level concepts into a single page, which link into drill-down pages from there. If you can see a spike in the database load, you'll know to click to the page overviewing the databases, then you'd see the one or two iffy machines in question. The idea is to narrow something down fast. Remove as much guesswork as possible.
Log useful information, use multiple streams of data
- Work on your own, and with development, to log as much useful information as you can. Doesn't matter if you live analyze it and store the data somewhere, or lump it into a database and run reports. Information is useful.
- Useful examples: Page rendering time (what page, what box, etc), user-facing errors, database and internal service errors, bandwidth usage, etc.
- Establish graphs, reports, and do historical comparisons from generalized data.
- Reports are really important. Get digested data week-to-week or day-to-day about changes in your infrastructure.
Understand your data storage, databases
- There's an entirely separate set of undrestanding about operating databases, but sometimes you can't leave all of this up to your DBA.
- Having multiple, redundant databases affords you many luxuries. Operations that were once many hours of downtime can be done "online" without shelling out for a huge Oracle instance. MySQL and replication is a fantastic thing.
- Work with the DBAs to get the best possible hardware for the database in question. RAID10, gobs of RAM, many fast spindles, and potentially RAM disks and SSD's. Ops has access to the vendors, DBA's can beat the pants off the hardware. Find out what works best and save tons of cash in the long run.
- Database configurations are changing. Software like HiveDB, MySQL Proxy, DPM exist now. We're absolutely doing partitioned data for huge datasets. We're also thinking outside of the box with software like starling and Gearman. Learn what these are, and understand that not everything will be in a database.
- Get a good grip on your filers! If the data's important, back it up! Snapshots on monolithic NFS servers are fantastic, wonderful, and NOT a backup!
- Consider alternatives. MogileFS gets better year after year. There're likely other projects for freely and cheaply maintaining massive stores of files. Similar systems were developed for youtube.com, archive.org, etc. We're finally free of expensive NFS filers being the standard!
Scale out a lot, up a little
- You've seen all of the papers. Scale out is really the way to go. Get commodity (read: available, affordable, standard, NOT super cheap) hardware and work with everyone to ensure all aspects possible can scale out.
- Scaling out starts at two, work from there. This also happens to encompass redundancy.
- Scale out as far as you can without being idiotic about it. The example of MySQL replication with single master, many slaves, is a fantastic example of one form of scale-out sucking. All slaves must do all writes, so as the number of writes scale up with the reads (if they do for your app, which I bet they certainly do), you get less capacity per slave you add.
- Keep alternatives in mind. User or range partitioning onto many databases, avoiding production slaves where possible, etc. Good ideas, many ways of implement.
- Everything can scale if you give it a chance! Routers, switches, load balancers, webservers, databases.
- Remember scale up? Big evil machines with many slow cores, lots of IO boards, and very expensive storage equipment? They're coming back. Well, the CPU part is.
- RAM is cheap.
- Combine the two, and you just may end up combining services again. A load balancer here, a webserver there... If an application can use many CPUs (apache) this is perfect. If it can't (memcached doesn't get much benefit from it, usually) you can end up wasting tons of available resources by segregating services too much.
- Job systems could potentially fill in gaps here. Where there're extra cores, slap up more workers.
- Caching is good. Developers, sysops, etc. Get on this! Yes, it's weird. It's different. Sometimes you may even need to, gasp, make a tradeoff for it. Effective use of caching can have as much as a ten times increase in overall system performance. That's a giant magnifying glass over the systems you have already and a fraction of the overall cost.
- Memcached. Service cache, denormalize DB structures (where it makes performance sense!), squid cache, or even make better usage of OS caches.
- Test it, toy with it, and break it. There will be new and different problems with caching. Be prepared for it.
- Starling, Gearman, The Schwartz, whatever. Job systems allow much more application flexibility. Workers can be spawned one-off, be persistent (load cached data, prepare data, etc), be on different hardware, different locations, and be syncronous or asyncronous.
- Maintaining these things is an ops issue. Using them is both a developer and an ops issue.
- User clicks "send all my friends an e-mail". Schedule a job, immediately say "okay done! Your friends will receive your spam shortly!" - let the job service multiplex and deal with the issue.
- Job systems are great places to bridge services. Blog post -> IM notification, billing cron -> billing services, authentication gateways, etc.
- Easy to scale. There will be choke points for where requests come in, and all the workers need to do is pull. This is in contrast with the largely push/pull state of HTTP.
Security and patrols
- Install security updates! Seriously! There's a whole crazy network of people who are dedicated to giving these to you in the shortest period of time possible. Don't let them sit for _years_ because you're afraid of change.
- Security is in layers. Accept what you can and cannot secure. Just because mysql has password access doesn't mean it gets to be directly accessable by the internet.
- Disable passwords over ssh. Use passphrase encrypted key auth. Remote users _cannot_ guess your private key. They _have_ to get it from you. Keep it safe, and there's no point in firewalling off your ssh port.
- Understand how the application works, exactly what it needs to do, and work that to your advantage. If the only part of your application which needs outbound internet access _at all_ are the billing pages and some twitter-posting service, those can easily become job workers. Put the job workers on specific boxes and allow those access to specific hosts. Keep the rest of your network in the dark.
- The above is especially important for php sites, but probably works great elsewhere. If someone breaks in, it's most likely going to be through your application. When someone gets in through the front gate, they'll need to haul in their toolbox to get into the safe. Don't let them pull in data and get what they need, or upload the contents of your database somewhere!
- These specific suggestions aside, read a lot. Use your best judgement, and test. If you have no understanding of how a security model works, that might not immediately make it worthless, but you certainly don't know where its limits are or even if it works.
- Secure based on testing, theory, attack trees, don't stab in the dark. I love it when people dream up obscure security models and ordinary folks like me can smush it to crumbles.
- Patrol what you can! Audit logins, logouts, commands used. All accesses to external facing services, including all arguments given in the request. Find outliers, outright ban input outside of the scope of your application, and do what you can actively and have the data to work retroactively.
- If you suspect something's been cracked, *take proper precaution* and understand a little computer forensics (or get a company that does). Respond by removing network access, checking the system through serial console or direct terminal, and avoiding using any service, config file, or data on the compromised machine. Too many people "clean up a trojan' and never understand how it got there, or if they've _actually cleaned it up_.
- If you do have a security team, forensics expert, or anyone else onhand, you must touch the machine as little as possible and isolate it. This means not rebooting it to "clear out some funky running processes". They need to be able to get at those. If you need to half ass it, go ahead, but remember to wipe the system completely clean, apply any security updates, and do your best to figure out if they've compromised any important data. Do what you can.
- Security is an incredible balancing act. If you do it wrong, developers, users, etc, will revolt and find ways around it. If they _can_ get around it, you're not doing your job right. If they _can't_ get around it, they might just give up and leave.
- Keep an iron grip on access control. This means ops must absolutely provide windows for what doors have been locked. Kicking development off of production entirely means they get to stab in the dark on fixing hard problems. Providing logging, debugging tools, etc, without allowing them to directly change the service, will be a win for all aspects of production.
The Human Element
Learn from many sources
- Fill up some RSS feeds, and read at least a few good articles per week. LWN, kerneltrap, undeadly.org, whatever's relevant, or even loosely related, to what you do.
- Read blogs from smart people. Sometimes they post interesting topics, and comment streams give us the unique ability to directly converse with the masters.
- Read a few blogs from not so smart people. Get a feel for what stumps them, or what they do that doesn't work so well.
- Get to know people who can kick your ass, at anything. Stay humble.
- Help find your own strengths by taking in from many sources, and gobbling up what envigorates you.
- Read up on success and failure stories from other companies. Ring up their CTO's and get them to divulge advise over free lunch.
Try many things
- You'll be amazed at what you can do if you keep trying. Never seen something before? Give it a shot.
- Try to not be a dangerous newbie. Play in the sandbox until you're comfortable enough to not burn down the house.
- Really understand how redundancy affects things. How it works, how it doesn't work.
- Break redundant systems in a test lab, sometimes in production. Learn what you can while you're in control. Unplug the power, yank cards out, kill processes, run the box out of memory, yank a harddrive, yank ethernet.
- Test replacing and upgrading systems in a redundant setup. Maybe you can toss in that brand new hal-o-tron 8000 without taking downtime.
- There're tons of papers on making scalable systems. Even if you can't write one yourself, try to understand the theory.
- Learn with virtualization. Set up a few virtual machines and try tossing up applications to multiple machines. Run multiple instances locally on different ports.
- It's usually the job of operations to do proper capacity planning. You won't know what to do add unless you truely understand where resources should be added.
Become a troubleshooting superstar
- The moment something breaks the clock is ticking. You must be able to pull out your arsenal and use them effectively.
- Practice troubleshooting. Pick a perfectly good, working page, and try to track down how it works.
- strace, ltrace, lsof, logs.
- Understand that load != load. Look at all available information as to how a host is performing or behaving.
- Be very familiar with the tools for your IO system. Often "mysterious" performance problems happen beacuse your RAID or SAN setup isn't happy for some reason.
- Leave documentation. Checklists, troubleshooting tips, build tools.
- Build more tools. For yourself, for other people, or add features to existing ones.
Work with IT
- Believe it or not, there is overlap.
- Ops has to maintain high bandwidth network access for servers. IT has to do the same for people, and is often the bridge ops has *into* the datacenter. It may make sense to work together on this one.
- Draw the right line. IT should manage mail, but ops should manage development servers. Don't offload things you don't need to, and offer to do what you do best if necessary.
- Don't alienate people. Macs are popular, linux is (slowly) gaining share. Believe it or not, forcing everyone to use microsoft productivity software can bite you. There are plenty of alternatives, try one. Odds are more people in your company are familiar with google apps than they are with outlook.
- Don't make it more difficult than you have to do for people to run a unix system natively. Unless your backend is a windows shop, wouldn't you want people to have more familiarity with the OS they're supposed to support?
Work with developers
- You both work on the same product, for the same purpose. Try working together a little more.
- Having strategy meetings is not working together.
- Development understands the code resources the best, and operations understands the hardware and deployment the best. You can design something more efficient by taking all of this into mind.
- Cross training. Disseminating information can show how tools and designs on both sides can be improved to be more manageable and resilient.
- Be careful of being too demanding on either side. It's not an Us vs Them. Everyone's human. Everyone should be doing as much as they can for the company, not for themselves.
- It's more pleasent to handle crunch times and emergencies when everyone gets along.
Work with ops
- Ops folks have their specialties. Networking, databases, OS. Don't forget to talk to each other!
- Getting stuck in a rut is demotivating, boring, and a good way to lose people. Even if your systems ops guy has the ability to look over the shoulder of the network guy, they have the opportunity to learn.
- Always give people an opportunity to try, learn, and grow.
- Be careful of rewarding your best with too much work. If there're people who can pick up slack, you use them.
- Bad eggs. It happens. Be tough enough to deal with them. Most people can be turned around with a little help, but they need to be able to be independent.
Fix it now, not later
- If a webserver goes offline, don't care about it. You have ten spare, right?
- Pick a day during the week to sweep up broken crap. Replace any broken hardware, ensure everything's 100% before swinging into the weekend.
- If small, annoying problems crop up, fix them permanently first thing in the morning. Logs fill up the disk twice last week? Come in fresh the next day, and fix it for good. These stack up, and suck.
- If you have automated builds, use this to your advantage to fix what you can right away, or in bulk.
- Humans can't screw up scripted tasks (as easily).
- Do it twice. Once by hand if you must, then roll up what you did into a script.
- Commented scripts make fantastic documentation. Instead of writing twenty pages detailing how to install something (which is up to interpretation of the reader!), write a script which explains what it does.
- Scripts can be rolled up into automated builds. The more often something is done, the closer it should get to becoming a zero time task.
Change what's necessary
- Make small, isolated changes.
- If you don't have to change it, leave it.
- This also means you must understand _when_ to change. Find what's necessary and upgrade it, switch it out, make it standard.
Design for change
- If you can't do it right immediately, get on the road to it being right.
- This means if you don't have time to do something right, get the basics going with a clear migration roadmap to the right thing. While your new mail system might not be the crazy cool redundant bounce-processing spam monster you dream of, installing postfix and setting up two hosts with a clean configuration gets you closer than you might think.
- This does have a tendency to leave unfinished projects everywhere, but you were going to do that anyway. :)
Practice updating content, fast
- It's usually the job of operations to push out code. Don't suck at it. Push in parallel, apply rolling restarts, be an efficient machine.
- This includes software updates, security patches, and configuration changes.
- Use puppet, cfengine, whatever you need to control the configuration. Keep it clean, simple, and easy.
- The fewer files one must change to make a necessary adjustment the better. If you're adding one line to 20 files just to push out a new database, you're doing it wrong. Build simple templates, build outward, and don't repeat data which needs to be edited by hand.
Standardize, stick to the standard
- Pick one or two standard OS's, httpd's, databases, package systems.
- Stick with them. Adjust and upgrade methods as it makes sense.
- Don't stick with that major version forever. Unless your product is going to be feature frozen forever, you'll need to keep the standard rolling forward, and everthing behind it.
- The _more_ is standard, the more places your tools will work. The more packages for other parts of the operation will "just work" everywhere else too.
- Document process
- Document product
- Categorize into shallow trees.
- Don't redundantly document. If a script has a long help, ask the reader to refer to that. The closer the documentation is to the program being discussed, the more likely it is to stay accurate.
- Marry documentation into code. perldoc, pydoc, etc.
- Out of date documentation is poisonous. Reserve time to keep things up to date. Sit down with new employees and update documentation as they run into problems.
- Use ticketing systems, with moderation. Documentation of history is important as well. Forcing people to create detailed process tickets for a DNS is just pissing in other people's cheerios.
Use source control
- Use git, or mercurial. Avoid SVN like the black plague.
- Put all of your configurations, scripts hacks, whatever, into source control.
- Keep checkouts everywhere...
- Keep strict, clean, master checkouts. No one should be able to push changes that aren't comitted, but it should also be easy to test changes (in a VM, directly on a single test machine) without having to wrestle with the source control.
- Discern between stubborn and smart
- Don't avoid hiring senior. Some people really know their shit. Some _seem_ like they do. Others are "senior" in a particular area and will fall behind as technology changes. While you might want to avoid some, there are definitely rockstars out there.
- Don't avoid hiring junior. I know so many people who've started really junior (including myself! I still view myself as junior), who've shot up through the ranks and are now have firm established careers. I'd believe most of us have. Except there are ones who don't learn, don't have the motivation, or are in the wrong field.
Avoid vendor lock in, and keep a good relationship with the vendors you do use
- Buying propreitary hardward has the major downside of potentially locking you into always using it. It might be a particular SAN, NAS, special-case direct attached storage, backup systems, etc. Avoid getting sucked in. If you follow all of the above design advise, one should be able to build test environments on different platforms quickly. You're then able to keep on top of hardware evaluations and keep choices open.
- If everything's deep, dark, gnarled, undocumented, and directly dependent on your fancy proprietary load balancer, you'll never wriggle free of it.
- Be nice to the vendors you do end up using. If you "push them _hard_ on price!" for every single purchase, expect some shit hardware to show up.
- Datacenters these days have a lot of potentially useful resources. Try to throw some free remote hands service into your contract and abusing that to get harddrives replaced, vendor items shipped/RMA'ed, and some basic hardware installs. I've had entire racks of equipment delivered and installed with barely a visit from an employee... and damn, it's nice.
Give Open Source a serious try
- nginx, mongrel, lighttpd, apache, perlbal, mogilefs, memcached, squid, OpenBGPD, PF, IPTables, LVS, MySQL, Postgres, blah, blah, blah. Before you hop back on that trusty, reliable, expensive proprietary setup, give open source a shot. You might find yourself adding plugins, extensions, code fixes or contracting help to bring features you'd never be able to do otherwise. In my own experience OSS is just as reliable, often moreso, than big expensive hardware when put under significant load.
- The idea of "you get what you pay for" is a complete lie. If you can't make OSS work for you and need the hand holding, you _can_ still go with a vendor. If you have a smart, motivated team, who really want to learn and understand how their infrastructure runs, you just can't beat some hardy GPL'ed or BSD'ed systems.
- MySQL and Postgres are fine. Call them tradeoffs if you will; nothing's going to crawl out of your closet and night and eat your data. Sure, it does happen, but you're much more likely to be screwed over with monolithic oracle instances going offline (it happens!) than you are with a well tested and stable MySQL instance, in a redundant master<->master cluster pair.
- I'd say 'cite references' - but go look around. Check out any number of articles on the LAMP stack. Most major dot coms, ISP's, and even corporations now are adopting. Give it a shot. The worst you'll have is some lost time, and another product to scare your vendor into dropping price with