We know we want to do some “big data staff”. So how do we going to start with building a big data solution.
HDFS: HDFS is a distributed file system that is well suited for the storage of large file.
it is primary storage system used by Hadoop applications.
HBASE: it is built on top of HDFS and provides fast record lookups (and updates) for large tables.
HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
Based on its documentation it says “HBase database should have hundreds of millions or -- even better -- billions of rows.
Anything less, and you're advised to stick with an RDBMS”
Cassandra: One of the most widely used NoSQL databases. With the version 2.0. it says :
"Cassandra powers massive data sets quickly and reliably without compromising performance, whether running in the cloud or partially on-premise in a hybrid data store.
MongoDB: Document based NoSQL database. Well might be not quite fit for big data but they have also claim mongoDB is also good for big data.
Also you can list here many other solutions (Google Big Table, Amazon dynamodb, OrientDB, titan,neo4j ).
Solutions is depends on how big is your data and how do you want to use it like batch, streaming ,real time etc based on your requirements solutions might be change.
You can keep in mind some notes about these solutions:
HDFS is its ability to tackle big data use cases and most of the characteristics that comprise them (data velocity, variety, and volume). HDFS is not good for handling small files if you are going to use it keep in mind not produce many small sized files.
What I mean with processing is getting unstructured data and convert it structured form so we can do some analytic top of this data. Mostly time consuming part of big data applications is cleaning the data, removing noise, transforming into input form for next application which is going to use it. For large data set what can we do:
Create your own data processing pipeline:
Create your microservices connect with some binary protocol(thrift avro protobuf)
by using twitter finagle use kafka for publishing subscribing
events make it non blocking with akka, index your data in eleasticsearch
use graph database to create relation out of your data.(Looks fancy isn’t it)
if you decided to use mapreduce hadoop then you can use apache mahout(https://mahout.apache.org/)
which is already integrated with hadoop. it has many data mining and machine learning algorithms.
Algorithms in mahout are not state of the art but it is very good starting point specially if you want to build simple recommendation engine or
Frequent pattern mining or topic modeling etc.
I would also recommend using graphlab
http://graphlab.org/projects/index.html which has claims better performance and graph based algorithms.
it has also hdfs integration and it has many algorithms as well.
in addition to these technologies ( there is weka http://www.cs.waikato.ac.nz/ml/weka/
machine learning tools for data mining, giraph which you can build your own graph and apply your own algorithm top of it,
R for machine learning and plenty of other solution can be found just simple search in google)