In my last post, I covered certain elements of Big Data and how you identify with Big Data. Not everybody needs to deal with Big Data, but for those who do, they quickly realize that the hammers and wrenches they have been using to deal with traditional data no longer are the right tools.
Popular websites and portals easily get several million visitors a day and several billion page views per day. This “clickstream” data is very log like and while it has a pattern, does not necessarily fit the definition of “structured”. Further, the rate at which this data streams in is very, very fast.
Facebook gets over 2 billion likes and shares a day – to many, this is “fun” data and nobody really looks behind the scenes (nor do they need to) to see how Facebook manages this data. Today, this type of data (social media) is actually being mined by organizations to do things like “sentiment analysis”. This technique is very useful to business in making “course corrections” based on their interpretation of “sentiment” say towards their products or product campaigns. Similarly, “Likes” can be utilized for targeted advertising and marketing if it’s a page that’s owned and operated by a business. When you “like” something, news feeds and ads related to that product or service are constantly fed to you.
Consider the realm of security. Protecting cardholder information is critical and is a top priority for financial institutions. Understanding purchase patterns and buying behaviors is key to detecting fraud early and accurately. Payment platforms have to deal with several sources from point of sale systems, websites and mobile devices. Although, many institutions do fraud detection today, they rely on “small” (smaller) subsets of data simply – technically known as “Sampling” to build the data that will eventually run analytics on. The rest of the data is pruned onto magnetic tapes (regulatory requirements) and may potentially never see the light of day or the “probing of BI tools”.
Problems like these can be easily related to when you think outside of business and IT. Let’s take this very simple (though a little exaggerated scenario). Let’s say I am helping my school goer with a data collection project to be done during spring break. My son wants to count and group cars that come into our neighborhood street by color – say 7-8 am in the morning and 5-6 pm in the evening for 5 days. We have 4 people in our household. One way we could do this is to have 1 person each for each of the days, with one person maybe covering day 1 and day 5. Another approach could be to have one person cover the morning hour and another cover the evening hour.
Pretty straightforward right? – The tools we would use are no more than a paper, pencil and a calculator potentially (maybe mental math is more than enough). The process is also not too complicated – look out the window or sit outside by the door ; start marking off counts by color:
At the end of the 5 days, we sit together by the breakfast table and total up the counts of the different car colors.
Now let’s say you want to cover two streets (two neighborhoods) – you can’t just sit outside your door anymore or look out your window. You can either enlist a friend for help in the other neighborhood or sit in front of your friend’s house for a couple of hours every day. That’s not bad – you have to go out of your way to enlist an additional resource (friend) or additional system in the process (friend’s house) – but it’s still doable with paper, pencil and maybe a calculator ( I know you probably don’t need one, but it’s a handy tool lying around the house).
You post this little project on your facebook page innocuously. Social media swings into action – ten subdivisions now are wildly interested in knowing the count of cars grouped by color in all of their neighborhoods not just for 2 hours a day but for 8 hours every day. They want you to lead this effort. They will anxiously, excitedly wait for the results on the 6th day.
Ten subdivisions – let’s say each subdivision has 10 streets, that’s counting cars in and out from 100 streets for 8 hours.
This is a wildly exaggerated scenario – but consider this: even if it were three subdivisions with 30 streets and 2 hours a day, your process for 1 street – sitting outside your doorstep for an hour, counting cars on paper and finally tallying at end of day 5 – is no more feasible in this scenario. It’s certainly not feasible for 100 streets and 8 hours each day.
Your process needs to change to handle this scenario – you will need new tools (probably spreadsheets, a tablet maybe) and new ways to efficiently divide up the work and do the final aggregation. You certainly need more people involved and doing work for sure.
So, if your business needs require you to ingest and process this type of data – where the volume, the velocity and the variety is far greater than any scale that has been dealt with before- you need a different approach to tackling this data. You need new tools and new ways of handling this data deluge. This is really what Big Data is about.
I am purposely peeling the layers of the Big Data onion slowly. Many times, and quite too often, people think about the data deluge in terms of one element or one characteristic of Big Data (often volume, sometimes variety) and immediately run off to acquire tools for that element.
In the next blog post, I will start delving into some of the technologies and tools that are necessary when you start down the Big Data path.