Past Christmas I found the perfect pet project for that season: Twitter Storm.
Basically is a excellent piece of software that will allow you to process real time information in a ‘kind’ of map reduce style. Also it’s based on the concept of topologies (workflows), quite familiar to me after working at Yahoo! with the precursor of Apache Oozie (workflows for Haddop!)
With that background, and after taking a look to this amazing presentation by Nathan Marz in Strange Loop 2011, I really wanted to try the Storm goodies and how it suppose to solve some of the problems that you can find in Hadoop (I recommend you to watch the whole presentation, it gives a perfect introduction on what you can do with this software).
Before continuing explaining what did I build (yes!! there is a demo!), let me do some -not precise, probably not 100% accurate – questions and answer that will help me to explain what is Storm:
What is storm?
Is a framework (magic word!) that will help you to process data streams in a distributed enviroment (\m/). It’s based on the concept of topologies and aims to process the information continuously.
What is a topology?
Topology, workflow, defines different tasks (spouts and bolts) that will process the information, execute a task and maybe output something. The input for one of this tasks could be the output of another one and so on. In this topology you also define the number of threads running each task in parallel.
What are those spouts and bolts?
A spout is a source of tuples, it doesn’t have any input and generate tuples, meanwhile a bolt can consume and process the tuples and maybe output something.
… and what’s a tuple?
Is kind of the unit of information that you pass form task to task.Remember tuples must be serializable.
So with all this new concepts, and topologies, streams and clusters (YES! didn’t I mention it before?) .. what you want to build?
Simple, the most common and accesible data stream is the twitter firehose, let’s try to build a web application that will tell us what’s going on in the Android world.
What means what’s going on in the Android world? Simple:
- Let’s process those tweets that have the android word on them.
- First, filter the one with bad words or the one that we knows that are rubbish.
- From those tweets filtered, extract hashtags, tweets being retweeted and tweets that contains links.
- Also from the tweets that contains links, if that link is pointing to the android market, just let’s scrape the information as well.
- And … why not? For the important tweets (those retweeted), if they contain a link, try to extract the article metadata using goose (a html article extractor).
Let’s try to see this in a graphic way?
That’s how the information will flow, and the kind of tasks that we will execute. Yes it’s more effective to group some of those tasks, but remember, we just wanted to give this a try ;P
In order to create this monstrosity I just followed the superb well documented project storm-starter where the creators of Storm, do an excellent example of how to create the topology, write the different bolts and spouts. With this storm-starter project you could run storm locally, in a single machine, but also will guide you on how to run then in a cluster. If you are a hardcore hacker and directly want to try an example on a proper cluster, also recommend you to take a look storm-deploy, another incredible software that will help you to automatically deploy your storm topology and tasks across different Amazon EC2 servers.
Ok, let’s continue with our idea of check what’s going on LIVE in the Android world. What do we need if we want to show the information that we are processing real time?
In my case, just decided to setup a couple of technologies that work all quite good together. Let’s explain how:
- If we are talking about live data, the first thing that come to our minds is using PubSub, where the Storm cluster is publishing all the data that comes from twitter, and our intrepid browsers will subscribe to the information coming for them. In order to do this, each bolt (unit that process information), for example, the tags spout (extract hashtags from tweets), will subscribe to an specific channel … let’s call it tags and will emit any hashtag coming from Twitter related to android. What technology should we pick for this … don’t know, let’s see … REDIS is so convenient with it’s pubsub capabilities, also could let us implement counters (quite easy).
- Our web server will be in charge of subscribing to the redis pubsub channels, and tell all the clients connected that new information arrived. Basically, this is acting as a pubsub proxy for the clients … wow, we are doing pubproxsub!! (lol!). In this case there is also a candidate that will help us to expose all this live information on a easy weay. Let’s use node.js with the awesome library Socket.io and it’s ability to push information to the clients is really helpful.
- And our client, in this case a webpage … well, I don’t have the best design skills ever, so let’s trust some library that put funny and well done css for us. Wait, we are building something around twitter … so why not using Twitter Bootstrap as well! hahaha 😛 yes I’m becoming twittermental!
As a result of all this, you can point your browser to:
wait for some seconds, and start seeing how your screen starts becoming populated with real time information coming from twitter.
Ok, and after more than 1000 thousand words … no single line of code … buddy: SHOW ME THE CODE.
You can find all the code use for that example there: all the storm code (topology, spouts, bolts) and the node.js + socket.io + js needed to setup the server and display the information in the web client. Again, this is just a test, if you realise, the storm project is configured to run in local mode (in a single machine), and everything is quite hacky … but it’s enough to test the technology and figure out what you really can build!
Remember, if you want to take storm seriously:
- Watch this presentation.
- Read the storm tutorial.
- Build, study and understand the storm-starter project.
- … and don’t forget to visit the storm user group.