News: Gathering Data for Fun and Profit

Gathering Data for Fun and Profit

Oh Data, You so Awesome!

We are going to use Node.JS to gather us some data.  Given nodes plethora of well abstracted network abilities and it's deep evened nature, it will make quick work of plugging into various data sources and gathering / making good use of said data.  

Data sources we will be using are: Twitter, IRC and RSS.

Step 1 Install yourself some Node.JS

If you are on most modern systems there is a package manager in place.. use that to install node.. 

apt-get install node.js node.js-dev or pkg_add -vi node or brew install node or  yum install node

If your system is crazy use the following steps to install node:

  1. curl -O http://nodejs.org/dist/v0.6.13/node-v0.6.13.tar.gz
  2. tar -zxvf node-v0.6.13.tar.gz
  3. cd node-v0.6.13
  4. ./configure --prefix=/usr/local
  5. make && make install 

Step 2 Identify the data you wish to gather

I have chosen to gather data pertaining to the "That's what she said" joke.  I have a bot named "mcchunkie" ( link takes you to mcchunkie's brain - so you can see words it thinks are "funny" ) that has done some crowd sourcing of "twss" data.  It uses a naive baysian classifier to identify words in a sentence that are "twss" worthy. 

But it needs MOAR DATA! 

So we will keep any string that has /twss/i in it ( and various meta information provided by the sources ).

Step 3 Start writing some code!

I am only going to post images of the code with a brief description in this section.  The end of the article will contain links to a github project with all the running code.

Gather Twitter data:

twitter.stream( 'statuses/filter', { track: data_source.twitter }, function( str ) {
    str.on( 'tweet', function( tw ) {
gather_data( 'tweet', tw );
});

Gather RSS Data:

rss.on( 'article', function( article ) {
if ( article.content.match( /twss/i ) ) {
gather_data( 'rss', article );
}
});
rss.start();

Gather IRC Data:

irc_client.addListener( 'message', function( from, to, msg ) {
if ( msg.match( /twss/i ) ) {
gather_data( 'irc', { from: from, to: to, msg: msg } );
}
});

As you can see.. there isn't much in the way of code.  Each block is establishing an event listener through the various libraries being used. 

When a new event is triggered ( new line sent to irc, or new article published ), the listener triggers a function that we have passed to it. 

The function then hands the data to the "gather_data" function, which simply logs the data to STDOUT.

Conclusion

Data is awesome. 

Link to the code: https://github.com/qbit/whtgather

Image by Behrig

1 Comment

That is a solid conclusion mate. I concur.

Share Your Thoughts

  • Hot
  • Latest