Dogukan Sonmez

Currently in Munich, Germany

Java Groovy Python

No-Sql TDD Scrum

Django Shell Cloud

August 26 2014

Let's get our hands dirty

We know we want to do some “big data staff”. So how do we going to start with building a big data solution.

First off all where can we find free and publicly available data?

So we have data but now the question is where we are going to store it?

  • HDFS: HDFS is a distributed file system that is well suited for the storage of large file.
    it is primary storage system used by Hadoop applications.

  • HBASE: it is built on top of HDFS and provides fast record lookups (and updates) for large tables.
    HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
    Based on its documentation it says “HBase database should have hundreds of millions or -- even better -- billions of rows.
    Anything less, and you're advised to stick with an RDBMS”

  • Cassandra: One of the most widely used NoSQL databases. With the version 2.0. it says :
    "Cassandra powers massive data sets quickly and reliably without compromising performance, whether running in the cloud or partially on-premise in a hybrid data store.

  • MongoDB: Document based NoSQL database. Well might be not quite fit for big data but they have also claim mongoDB is also good for big data.

Also you can list here many other solutions (Google Big Table, Amazon dynamodb, OrientDB, titan,neo4j ).

Solutions is depends on how big is your data and how do you want to use it like batch, streaming ,real time etc based on your requirements solutions might be change.

You can keep in mind some notes about these solutions:

  • HDFS is its ability to tackle big data use cases and most of the characteristics that comprise them (data velocity, variety, and volume). HDFS is not good for handling small files if you are going to use it keep in mind not produce many small sized files.

  • Cassandra and mongodb easier than hdfs or hbase to try locally, you can install them even from command line.

So far we found the data and we store it somewhere:) now the question is How are we going to process our data?

What I mean with processing is getting unstructured data and convert it structured form so we can do some analytic top of this data. Mostly time consuming part of big data applications is cleaning the data, removing noise, transforming into input form for next application which is going to use it. For large data set what can we do:

  • Mapreduce with hadoop and use cascading or scalding to make it even easier and more fun.
  • Use Apache Spark it is open source system developed at the UC Berkeley AMP Lab. It is in memory and faster than hadoop.
  • Apache storm is basically real time analytics like hadoop but not batching
  • Create your own data processing pipeline:

    Create your microservices connect with some binary protocol(thrift avro protobuf)
    by using twitter finagle use kafka for publishing subscribing
    events make it non blocking with akka, index your data in eleasticsearch
    use graph database to create relation out of your data.(Looks fancy isn’t it)

So what is next now. Yes now it‘s time to having fun with data!

  • if you decided to use mapreduce hadoop then you can use apache mahout(
    which is already integrated with hadoop. it has many data mining and machine learning algorithms.
    Algorithms in mahout are not state of the art but it is very good starting point specially if you want to build simple recommendation engine or
    Frequent pattern mining or topic modeling etc.

  • If you go for apache spark it has already existing Mlib for machine learning.
    And various documentation at

  • I would also recommend using graphlab which has claims better performance and graph based algorithms.
    it has also hdfs integration and it has many algorithms as well.

  • For nlp task you can use GATE For example task like language identification pos tagging etc it’s free and powerful.

in addition to these technologies ( there is weka
machine learning tools for data mining, giraph which you can build your own graph and apply your own algorithm top of it,
R for machine learning and plenty of other solution can be found just simple search in google)

Keep in mind:

  • Don’t move data all the time; move computation
  • Create feedback loop for your algorithms
  • Build non blocking pipelines