You want to know what the real answer to all this Big Data challenge is ? It's in us!!!
Fantastic isn't it. Well till that becomes "commercially viable" , let's talk about what we can do today.
The right tool for the right job – that’s no doubt a cliché to many. But it’s surprising how often the tools at hand are used for any kind of job. In my last post, I talked about why dealing with Big Data is not just about data, but also about a new set of tools.
Let’s dissect a use case to understand the heart of the problem. In earlier posts, I talked about clickstream data. Clickstream data is data that is generated by user actions on web pages – this can include everything from components on the web page that were downloaded when a user clicked on something, the ip address, the time of the interaction, the session id, the length of time, number of downloads triggered, bytes transferred, referral URL etc – in “tech” speak, you can say it’s the electronic record of actions that a user triggers on a web page. All of this is recorded in your web server logs. On business web sites, these logs can grow to several gigabytes a day easily. Also, like I mentioned in my previous post, analysis of this data can lead to some very beneficial insights and potentially more business. To get some perspective, if you are an Archer Administrator, check out the size of the largest log on your IIS server that hosts the Archer web application. I am guessing the largest file is easily a few hundred megabytes if not close to gigabytes. OR WAIT, Archer History Log anyone?
Here’s a snippet from the web server log on my local Archer instance (on my laptop), which incidentally was about 17 MB in about a day (used primarily by me a couple of times a day):
Now there are those who would argue that log data is not the best example of “Big Data” – part of the reasoning being that it does somewhat have a structure. Besides, weren’t businesses doing click stream analysis already before it was characterized as Big Data?
Yes they were – but there’s a little thing they do that is very inconspicuously described as “pre-processing”. Pre-processing is a diversion that hides fundamental challenges in dealing with all of the data in the logs. Logs themselves or raw data in the logs are second class citizens or even worse “homeless”. The web servers don’t want to hang on to them since the size of the logs can impact the web server host itself in terms of performance and storage. The systems that are going to use this data don’t want them in the raw format and don’t want all of it.
Typically, “pre-processing” involves some very expensive investments to clean the data, validate certain elements, aggregate and conform to quite often, a relational database schema or a data warehouse schema. Not only that, but both the content and the time window are crunched to accommodate what the existing infrastructure can handle. At the tail end of this transformation is the loading of this data into the warehouse or relational database system. Not only does this data now represent a fraction of the raw data from the logs, it could be several days between the raw data coming in and the final output into the target data source. In other words, by the time somebody looks at a report on usage stats and patterns for the day the actions were recorded, weeks could have gone by. And the raw data that was input to this is usually thrown away. There’s another big problem in this whole scenario in that you are clogging your network with terabyte data movements. Let’s face it – where and how do you cost-effectively store data that is coming in at a rate of several hundred gigabytes a day? And once you do how do cost-effectively and efficiently process several terabytes or petabytes of this data later? This is a Big Data problem. We need a different paradigm to break this barrier.
What if, instead of trying to pump that 200 GB daily weblog into a SAN, you could break it apart and store it on a commodity hardware cluster comprised of a couple of machines with local storage?
And what if, you could push the work you want to do onto those machines with the units of data that constitute the file? In parallel? Instead of moving the data around?
Say hello to Hadoop – a software framework that has come to play a pivotal role in solving Big Data problems. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop at it’s core consists of two components :
1) HDFS or the Hadoop Distributed File System that provides high throughput access to data and designed to run on commodity hardware
2) Map-Reduce: A Programming model for processing large data sets
So what makes Hadoop the right tool for this problem?
Storing very large volumes of data across multiple commodity machines: With HDFS, large sets of large files can be distributed across a cluster of machines.
Fault tolerant: In computations involving a large number of nodes, failure of nodes is expected. This notion is built into Hadoop. Data from all files is duplicated across multiple nodes.
Move the computation, not the data: This is one of the core fundamental assumptions in Hadoop “Moving the computation is cheaper than moving the data”. Moving the processing of the data to where the data is not only reduces network congestion, but increases the overall throughput of the system. This is known as “data locality”.
In my next blog post, I'll explore some more aspects of Hadoop and talk about addtional tools in the Big Data quiver that gets you completely armed for your Big Data challenges.