During the latest European football tournament, the Euro 2012, through the Stats n’Tweets project, we have monitored the e-reputation of all the players in realtime. During the month of the competition, we ended up collecting more than 33 millions of tweets. Now that the tournament is over, we wanted to share some feedbacks and insights on what we have learned during this experiment.
In order to better understand our needs, let’s briefly describe what Stats n’Tweets arena has to offer. It provides a way to compare players on two levels: Twitter-based e-reputation analysis and football performances. Our main task in this project was to handle the e-reputation analysis, from the data collection to the end user reports.
The language: Python
At Syllabs, we are mainly programming in C++ and Python. For this task, we have chosen Python mainly because it allows lightning fast developments but on top of that, I think it is a great programming language.
The data: Twitter
Twitter API is basically plug-and-play. It was easy as pie, the API well documented. The only drawback is that the Streaming API lacks clients. I found one that seemed quite reliable, however, it remained very basic so I ended up rewriting this part to tweak it as I needed.
The databases: mongoDB and Redis
When choosing a database, I find that most of the headaches come from not knowing what you are going to do with your data but also not knowing for sure how much data you are going to get. In our case, at the beginning of the project, we had no idea about the amount of data we were going to get: “Will it be thousands? millions? billions?”. Therefore, here is the guideline we followed: assume small data but stay sharp and anticipate. With that in mind, we have considered three databases: MySQL, mongoDB and Riak (Why those three? because we are currently using them).
MySQL was quickly put aside. We previously managed to store about 160 millions of entries in a single MySQL instance. It was not a great experience, as it required a lot of fine tuning to have everything go as smoothly as possible, and we had to either implement sharding logic in our application or supersize our servers. Furthermore, we were using none of the cool SQL features so we went NoSQL as the datastores are a lot easier to scale.
At this time, we are having Riak in production (with over one terabyte of data). The overall performances are quite outstanding and it is designed to be easy to shard. For all I know, it is by far the easiest database to deploy when it comes to sharding with replication as there is basically nothing to do. However, we have noted a few inconveniences. It is not suited for frequent document updates and it performs quite poorly when one has to scan through the whole dataset (key listing when reprocessing data for instance). We wanted to delegate as much as possible to the datastore, therefore, staying clear from having to keep an index of keys in a Redis for instance. The main reason behind that is that if we have to suddenly rescale, we wanted to avoid the single point of failure that Redis can be (I am hoping to see built-in master/slave failover and recovery which would greatly help in this case).
MongoDB is also well crafted and able to process “humonguous” data by design. It has replication with automated master switch on failure and sharding is supported and easy to setup. It performs well on data update and has a very fancy query language.
I do not want to get into a MongoDB vs Riak post as it is not the point of this post but here are the main arguments as to why we have chosen MongoDB over Riak. Riak is very pleasant to administrate but in order to keep it running to its best, I often find that quite a few things have to be handled in the application. On the other hand, MongoDB deployment is more classical, one has to defined shards and replicate the usual way but Mongo offers a data logic closer to a SQL-like database. In this case, we had not planned to have a myriad of nodes so we sticked to MongoDB.
We also used Redis. It is a high performance key-value store, well designed for data which can fit in the RAM of a single server. It does not offer much cluster functionality but it is a great tool for caching or intra process communication.
I think that a proper setup in our case would be at least two replicates with an arbitrer for MongoDB but for some reasons, we were on low resources and very tight schedule (aren’t those the best ingredients for the most thrilling projects?). It always comes down to this: in the ideal world, take a few days to plan out what needs to be done, get a few servers, each with dedicated tasks, use clean dev cycles with tests. In the real world, max out what you can so instead of the steadiest setup, I will be introducing the survivor toolkit: a basic server (with HDD) hosted by a French provider with a house-hosted backup.
A few statistics
Overall, I have been very impressed by mongoDB as it performed very well and required no tuning at all. To clarify the context about the amount of data we have been handling, here are a few noticeable statistics:
About the data gathering:
- during a match, we are gathering 50 tweets per seconds on average with peaks up to 70-100 tweets per seconds (goals?? faults??),
- on average daytime, we are generally processing from 10 to 20 tweets per seconds.
- about 300 ops/s (where an “op” is an atomic database operation such as query, insert or update),
- during a match, 1000 to 1500 ops/s based on the popularity of the playing teams.
A few times during the competition, we improved the analysis of our tools and had to reprocess the whole thing while collecting new incoming data. It led to some very interesting times where we slightly raised our database usage to 8k ops/s (instead of the usual 0.3k ops/s). It went so smoothly, I was really amazed! The load went up but there were no major disturbances, everything went well.
Double thumbs up for the guys at 10gen who did a great job with MMS (MongoDB Monitoring Service). It is a very nice monitoring tool, easy to setup. Start a client which will gather the key statistics and you can access the summary through a clean web UI. As usual on these matters, having some background on how mongoDB works is very helpful in order to read the graphs correctly.
Based on the data, we feel like we are dangerously approaching the limits of the “standard” (or deprecated as some might say) hard disk drives. We are seeing the IO lock percentage go up as time goes by which can be normal since we are processing more data, however, it can still become a major hindrance. During the first week, the lock percentage was going nicely at 0% on average but as we kept on inserting data, by the third week, we were getting closer to a 50% lock percentage on average. The competition has almost ended but if it had gone on much longer, we would have done something to lighten the weight on the database. Probably improving the storage model by adding multiple collections based on the hotness of the data for instance.
At some point in the conception process, we were weighting whether or not we should go realtime. The initial idea was that it was easier to update the statistics of the tweets from the day before as batch process that would run every night. Such opinions are deprecated! Indeed, it took as much effort to build a realtime application. I would say that when you combine MongoDB and Redis, setting up a realtime application that processes a reasonable amount of data can be quickly done. In retrospect, we have compared the former batch version to the final realtime app, realtime wins the comparison hands down: lower CPU usage, less overall stress, more flexibility. Right now, if you are planning an application which runs as a batch jobs, I would seriously advise you to consider going realtime.
Last but not least, the text analysis have been performed using the Syllabs Text Mining technology (homebrew tech, quite easy to use ; p). It has combined most of our NLP techs, from language detection to term extraction and sentiment analysis with a little spark of content generation. Our greatest regret in this aspect was that the application was only in French as most of our tools support at least english and spanish. It would have been great to setup a front in English and Spanish for instance. It will be for another run of Stats n’Tweets!
The most challenging part about this is to properly define the processing chain, from the data collection to the user end. Going realtime requires a little bit of tuning and proper caching to tolerate fast online data update. To this end, combining Redis and MongoDB gives you a lot of flexibility with great throughput. In this case, we managed to optimize the whole data processing so that it fitted on a single server but we know for sure that it scales out well on a multi processing node environment ; )
That’s all folks!
To put up a framework capable of processing a few millions of entries in a few weeks was very challenging. Going realtime was a good idea and added up to the good sides of the project.
Do not get scared when others speak of 20 node clusters but also do not try to imitate them prematurely. Start out reasonably sized, do not over think your app. Usually people started small and I believe this type of project is a good introduction to the bigdata world.