Wednesday, May 23, 2018

Introducing Qless job queue

Introducing qless — Our New Job Queue


the cached version from https://moz.com/devblog/introducing-qless-our-new-job-queue/

Recently we’ve been looking over various job queues, and there are a lot of attractive options out there. In particular, certain internal projects have had great success with Resque, and there are myriad others, from Celery to beanstalk and so on and so forth. We wanted something a little new and different, and we wound up with qless a Redis-based job queue with strong guarantees that jobs don’t get dropped, high performance, stats, job tracking and more. In case you’re thinking about bailing out after the first paragraph, I’ll try to keep you interested with a compendium of selling points (I apologize for the slight shamelessness of it):
  1. Language agnostic, and we use it in Ruby and Python, and have stubbed out C++ and Node
  2. Completely client-managed. A Redis instance is all you need
  3. Jobs don’t get dropped
  4. Jobs have priority
  5. Queues are pipeline-oriented, but it’s not required
  6. Jobs keep a history of what’s happened to them
  7. Jobs can be tagged (for search / debugging / tracking)
  8. Jobs can be scheduled (and made recurring)
  9. Jobs can have interdependencies that unlock each other
  10. Qless keeps extensive stats about how long jobs wait and how long to run
  11. A powerful web interface
  12. Oh, and blocking event notifications (like for Growl and Campfire)
A bit of context: Those of you who are customers of our campaign crawl know that it’s been and continues to be worked on. The full spectrum of changes and difficulties we’ve encountered is beyond the scope of this post, but I’d like to call out in particular the queueing system we were using until very recently. At its heart, a job queue is a simple idea. A worker asks for something to do, and it gets handed off, completed and that’s that. One of the major problems we had is that sometimes jobs would get dropped by a worker and we wouldn’t notice. The other is that scheduling jobs is a pain. One model is to periodically look for all the jobs that should be run, and put them into the queue. Problem is, that process can fall over and for us that means customers missing crawls. With the scene set, we can start the bragging (rather, we humbly present qless).

Qless makes heavy use of a feature new to Redis 2.6, server-side Lua scripting. Transactions are important when making a job queue, and since Lua scripts are executed atomically on the Redis server, it alleviates a lot of concerns about locking, semaphores, etc. The other huge get for us in using Lua scripts is that new language bindings can use the exact same scripts. When it comes time to get serious about bindings for Node or C++ (we’ve done a little poking around to make sure we haven’t backed ourselves into a corner), it’s just a matter of writing a language-specific wrapper that loads the same Lua scripts as any other bindings, and invokes them. Easy-peasy.

Completely client-managed. The Lua scripts that comprise the core library do all the maintenance, too. The very act of popping cleans out expired locks, and checks for dropped jobs, and job completion does some tidying up as well. This way there’s no need for a nanny process — just a Redis instance you can point your workers at.

Jobs don’t get dropped. We try our best to write clean code, but it’s easy to make mistakes, and sometimes workers fail, and it can be difficult to ensure that jobs don’t disappear. To this end Resque takes a very reasonable stance by forking off a child process for every task that’s going to get done. In this way, the parent process can be trusted to make sure that everything happens as it should — either the job gets completed, or catches an error, etc., but some appropriate action takes place. Our tack is to use heartbeating. As opposed to a lot of job queues that have a large number of tasks that take maybe a few minutes, we have some very large jobs (some take about a week) and some very small (a few seconds). Rather than trusting that workers will complete their jobs, they have to check in as they make progress, or qless gives that job to another willing worker. We can (and we often do) completely nuke worker boxes, and the jobs get picked up somewhere else with no problem.

Jobs have priority. Some other systems have support for this, and it was important to us, too. In particular, if a customer writes in with a problem, we want to be able to bump its priority to make sure it rockets through and we can get back to them as soon as possible. Job priorities can be adjusted mid-flight, too.

Queues are pipeline-oriented and jobs know their history. Like many large tasks, ours are broken into stages of a pipeline. As an example, we crawl a customer site, analyze it, aggregate Mozscape data, and then generate prematerialized views. Qless lets us describe a pipeline in a single job class and run it with a single job entity. And the job keeps track of events as it moves along: put in the crawl queue at such and such time, then popped and completed by such-and-such worker and so forth. It’s surprisingly helpful in debugging.

Jobs are tagged and tracked. Not all jobs are created equal, and some are more interesting than others. In particular, problem jobs, or customer complaints are of particular interest. Qless lets us flag jobs that we want to keep a close eye on (more on this at the very end), and tag them with useful information. Every job has a unique identifier, but a project might have additional ways they want to look up jobs. When we want to find all the running jobs for a given customer, it’s as easy as that. It builds an index of these tags to make for efficient lookup.

Jobs can be scheduled and recurred. This is solving a particularly painful problem for us of the scheduling crawls. You describe a recurring job much in the same way you would a normal job — it’s bound to a queue, it has data, priority, etc. When a pop request detects that a recurring job should be run, it creates a copy right then and there and returns it as one of the results.

Jobs can have dependencies. To borrow a particularly good example from a co-worker, imagine making Thanksgiving dinner. You need to make pies, and put the turkey in the oven, and make the gravy, and they’re all independent tasks, but they might have dependencies on each other. You can’t make the gravy until you’ve cooked the turkey, and you can’t cook the turkey until you’ve made the stuffing, and so on. When one job depends on another, when the independent job completes, it automatically unlocks the dependent job to be popped.

Qless keeps performance stats. We keep summary stats about how long jobs wait in various queues on any given day, and how long they take to run. But more than that, it also provides a histogram of runtime so you can look at the distribution (or just admire your handiwork). This doesn’t have to be the extent of your benchmarking, but it seemed like an intuitive fit that you should be able to have access to these metrics in the same place you’re managing your queues.

A powerful web interface. Inspired by the great web interface from Resque, and armed with Twitter’s Bootstrap, it was important to us to make a web app for managing these queues. In particular, we wanted to make something that would easily enable our ops and help teams to quickly gain insight about customer or server issues, whether it’s requeueing jobs, or just tracking their progress.
Before I get to the last little flourish, you are to be thanked for making it this far. I’ve done my best to keep this brief, but that clearly hasn’t worked. This has been the product of a lot of thought, time, grief with our last job queue, and effort, and so it is hard to not talk about the details. (If you’re curious, you should check it out on github and then let us know what you think.)

Notifications. It’s all well and good to track jobs, which aggregates a short list of the jobs you’re tracking so you have a quick summary, but we thought we’d take it one step further and get notifications about job progress. We use campfire pretty heavily internally, and I personally love growl notifications. Start up a little daemon, and it uses Redis’ PubSub to get notified of any changes to tracked jobs — when they fail, get completed, get popped, pushed, etc. Gone are the days of hitting refresh to check up on trouble tickets!

In parting, as always, contributions, suggestions, bug reports and so on are always welcome. Happy queueing!

No comments:

Post a Comment