For those who don’t know… (all of you?) retickr is a Mac application that enables users to create customized playlists of RSS, Facebook, and Twitter and then stream then passively stream them across a monitor. While this has been a ‘cool’ project… our goals have always been loftier than what we currently offer. We want to understand what news you care about and how you consume it so we can get more meaningful content in front of you without you really having to do anything. How? We collect, monitor, and deliver content to our end users through a RESTful api which means when you’re using retickr you’re really using a client-server application.
When we started retickr we were (and still are) very young. I was 21… a Junior in Computer Science and still wet-behind-the-ears but eager to do something with what I had learned so far in school, eager to change people’s lives. What I quickly learned was that applications often work well in test environments but fail in the real world and without monitoring tools (things we knew existed but had never actually used) we are more-or-less blind to what is actually happening on a second-by-second if not day-by-day basis. This blog post is a brief run-down of some problems we have encountered and the tools we used to solve them.
our development stack
At retickr we deal with a lot of data (relative to our team, hardware, and experience of course) and we needed a solution that could do the following:
- Deliver content to tens of thousands of users (and growing)
- Crawl over a hundred thousand sources (and growing)
- Respond within sub-seconds (hopefully decreasing)
- All cost next-to-nothing (we hate spending money)
The above is a tall order and we knew that we needed to take some risks, adopt some newer technologies, and think creatively. It became clear we needed to have an incredible amount of news cached locally on our database so that we can slice and dice it for our users as quickly as possible. We accomplished this by using a combination of Cassandra, a “NoSQL” data-store that scales well for large numbers of elements, MySQL which we use to deal with data that is either “inherently relational” or that would currently be too much trouble to shoe-horn into a non-relational data model, and Django on Apache which we use to stitch together requests and spit out data (it’s a very thin layer of controller code).
This is our young company’s new take on the LAMP stack, a sort of LAMP + NoSQL solution. LAMNP… or maybe it’s LANMP. I always forget what we agreed on.
what’s going on with the servers?
Being young, collegiate programmers our only experience ‘monitoring’ was through UNIX tools like top, ps, du, and df. These command line tools are great for spot-checking servers on your laptop but obviously not ideal for long-term monitoring and debugging. We had no way of knowing in real-time when, for instance: a significant error would cause our server to frantically write to an error log until it filled up the hard drive and crashed our system (this sad story actually happened to us). It is probably safe to say that in the last year we are starting to understand the difference between ‘school’ and ‘real-world’.
Luckily we have discovered some awesome monitoring tools that help us track the health of our web-servers, load balancers (more on these later), and database servers. We currently use zabbix which we cannot imagine how we lived without. You simply install an agent daemon on each server you’d like to monitor and then install the zabbix server itself on a server, vm, or cloud instance. Zabbix creates really awesome visualizations like this one which have really helped us grasp load patterns and behaviors.
With minimum set-up you can quickly see CPU load and utilization across multiple servers as well as memory, storage, and network usage statistics. You can also configure triggers which will send alerts (we use email alerts but there are options for other alert systems) to a single person, a mailing list, or in our case pager duty.
it’s 4 in the morning, what do you mean Chinese users crashed it?
One pleasant surprise we’ve discovered is that retickr has more international appeal than we originally anticipated for how far along we currently are. This up-side does come with a down-side though, people are using retickr while we are attempting to sleep. Because of this we established a 24/7 on-call system so that we can rotate team-members who can fix problems in the middle of our night. We currently have two people in our rotation and the on-call often feels like he is holding the football for days at a time.
This system is made much easier through Pager Duty because we can configure on-call schedules based on days of the week and the service offers multiple notification methods including email, text, and actual phone-calls. In an attempt to eliminate a ‘single point of failure’ we also set up alerts that can escalate to other team-members who aren’t necessarily on-call so that if the actual on-call engineer doesn’t pick up his phone we can move down the list until somebody is conscious/sober enough to answer and get to work.
How does Pager Duty integrate with our monitoring solutions? Remember those email notifications (zabbix) that I talked about earlier? Instead of setting them up to alert a single employee we instead configure them to point to an email address Pager Duty operates. Pager Duty then consults the rules we’ve configured to determine which engineer gets a notification and in what manner they should be notified. It then steps through its playbook of notifications until someone informs Pager Duty that the current problem is being taken care of.
everything looks good but people are still complaining
So far we’ve dealt with the obvious metrics to monitor including CPU load, network usage, etc… but what about (all) those times users complain about slow service? How do you know, for instance, what the average response time is for your application, which line of code is dragging it’s feet, which patch is not behaving properly? At retickr we operate as an agile development shop. We’re pushing new code everyday because our service needs to get better faster and we don’t necessarily have time to rigorously test everything. While this is inherently a good thing, it can get us in trouble and we need robust monitoring to get us out of some ‘sad’ times that occur every couple weeks when something doesn’t go to plan.
Sometimes this means patches for bugs (hey we aren’t perfect), sometimes it means new features, but things are constantly in flux and knowing what the response times are for our various API endpoints is very important. We struggled with this for some time, and frankly there were times that service was really bad for our customers. One of my favorite tools that we’ve discovered is a product called New Relic. Simply put, their product is amazing. Their monitoring tool is made up of two parts a middleware layer that sits between Django and Apache (in our case, they also have tools for PHP, Java and Ruby though we’ve never used them). This middleware keeps track of all the requests coming in to our API and keeps timing information about function calls, network requests to remote services (Facebook, Twitter, Google) and database queries. It has an excellent interface that shows us our average response times, and a listing of any exceptions that our code is throwing.
New Relic also recently began integrating directly with Pager Duty which offers a few nice features that email alone fails to deliver.
Yeah but your service breaks pretty regularly, why should I listen to you?
I’ll be the first to admit that retickr’s infrastructure has struggled at times. We had many hard learned lessons in a marketplace that is less than forgiving. We had to discover tools that worked well for our monitoring with no experience operating a 24 / 7 service. I hope that these tools will help other young entrepreneurs that have an idea take their project and turn it into a reliable product.