The problem
At Affine, we are in the business of providing classification data about online video content for advertisers. What this means for our engineering team is that we needed to develop a system that is capable of handling and processing an extremely high volume of traffic. For us, there are two places where this is particularly important: handling and responding, with very low latency, to billions of queries a month, as well as crunching and using that data to allow us to process and report on millions of unique URLs.
A typical request will look like this:
http://api.affinesystems.com/query_line_item?key=c57c84e817643751bba41dca7e1a68ebe2d0c3bb&line_item_id=613&url=[YOURURL]
With no brackets around your url.
We are able to handle at least all of this traffic to our query service with a relatively small number of what we refer to as our ‘Query Service’ machines. These machines have two jobs: to listen for queries and provide responses to them, and to tell the rest of our platform that they received a query about a new url that needs to be processed. The query service receive more than 20 million queries a day.
MySQL Fail
The initial approach to this problem was to essentially have two MySQL tables, a video index and a processing queue. The query service machine checks to see if a request url is in the index, and if it isn’t, it adds the url to the processing queue, where our video ingestion and detection machines pick off urls, process them, and then add their results to the video index. However, this approach quickly runs into a number of problems: latency, url duplication, and queue prioritization.
Latency
Using direct MySQL inserts whenever a new url is seen, the query service machines very quickly ended up spending most of their time waiting to insert into the processing queue – which, among other things, causes them to miss requests being sent to them.
URL de-duplication
Using per-url inserts, if the machines get hit the by the same unknown url 5000 times before that url is picked off the processing queue and added to the video index, that url would then be put in the processing queue 5000 times. Even if you have your system set up in a way where that url gets picked off and then ignored the other 4999 times, this is still very undesirable behavior.
Queue Prioritization
The above problem also highlights the final problem: url prioritization. A url that we only see once, should NOT be processed before a url that we are seeing thousands of times, yet if every time we see a new url we just add it to the queue, there is no easy way to figure out which urls are ‘popular’ (besides doing counts of entries in the processing queue, which would lock up the queue and the query service machines even worse, since that query forces a table level lock)
Instead what we decided to do was to set up a mysql read-only slave for the query service machines to
access the video index. In order to capture the data needed for inserts, we log all query service requests and responses. So now, instead of having all the machines writing to a queue every time they get a new url, we have thousands of urls buried in their log files that we need to somehow crunch so we can figure out what urls to process.
Enter MapReduce
With MapReduce we had our answer. With Pig as our query language, we are able to write high level sql-like statements that told MapReduce how to crunch our data. In the case of loading data in the queue, we did essentially a group by unique url, then outputted a sorted two column file with the two columns being the url, and the number of times we saw it. This list, now sorted by priority, is bulk loaded into the queue for future processing. This way we manage to take the load off of our query service machines(They aren’t waiting for SQL to catch up), we don’t have to deal with duplicates in our processing queue, because we already de-duplicated the urls when we crunched them, and we get them loaded into the queue already sorted by priority order!


