Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics
Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics
Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics
Ebook447 pages11 hours

Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

You receive an e-mail. It contains an offer for a complete personal computer system. It seems like the retailer read your mind since you were exploring computers on their web site just a few hours prior….

As you drive to the store to buy the computer bundle, you get an offer for a discounted coffee from the coffee shop you are getting ready to drive past. It says that since you’re in the area, you can get 10% off if you stop by in the next 20 minutes….

As you drink your coffee, you receive an apology from the manufacturer of a product that you complained about yesterday on your Facebook page, as well as on the company’s web site….

Finally, once you get back home, you receive notice of a special armor upgrade available for purchase in your favorite online video game.  It is just what is needed to get past some spots you’ve been struggling with….

Sound crazy? Are these things that can only happen in the distant future? No. All of these scenarios are possible today! Big data. Advanced analytics. Big data analytics. It seems you can’t escape such terms today. Everywhere you turn people are discussing, writing about, and promoting big data and advanced analytics. Well, you can now add this book to the discussion.

What is real and what is hype? Such attention can lead one to the suspicion that perhaps the analysis of big data is something that is more hype than substance. While there has been a lot of hype over the past few years, the reality is that we are in a transformative era in terms of analytic capabilities and the leveraging of massive amounts of data. If you take the time to cut through the sometimes-over-zealous hype present in the media, you’ll find something very real and very powerful underneath it. With big data, the hype is driven by genuine excitement and anticipation of the business and consumer benefits that analyzing it will yield over time.

Big data is the next wave of new data sources that will drive the next wave of analytic innovation in business, government, and academia. These innovations have the potential to radically change how organizations view their business. The analysis that big data enables will lead to decisions that are more informed and, in some cases, different from what they are today. It will yield insights that many can only dream about today. As you’ll see, there are many consistencies with the requirements to tame big data and what has always been needed to tame new data sources. However, the additional scale of big data necessitates utilizing the newest tools, technologies, methods, and processes. The old way of approaching analysis just won’t work. It is time to evolve the world of advanced analytics to the next level. That’s what this book is about.

Taming the Big Data Tidal Wave isn’t just the title of this book, but rather an activity that will determine which businesses win and which lose in the next decade. By preparing and taking the initiative, organizations can ride the big data tidal wave to success rather than being pummeled underneath the crushing surf. What do you need to know and how do you prepare in order to start taming big data and generating exciting new analytics from it? Sit back, get comfortable, and prepare to find out!

LanguageEnglish
PublisherWiley
Release dateMar 19, 2012
ISBN9781118241172
Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics
Author

Bill Franks

Bill Franks is Chief Analytics Officer for ?The International Institute For Analytics (IIA), where he provides perspective on trends in the analytics, data science, AI, and big data space and helps clients understand how IIA can support their efforts to improve analytics performance. Franks is also the author of the books Taming The Big Data Tidal Wave and The Analytics Revolution. He is a sought after speaker and frequent blogger who has been ranked a top 10 global big data influencer, a top big data and artificial intelligence influencer, and was an inaugural inductee into the Analytics Hall of Fame in 2019. His work, including several years as Chief Analytics Officer for Teradata (NYSE: TDC), has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small non-profit organizations.

Read more from Bill Franks

Related to Taming The Big Data Tidal Wave

Titles in the series (79)

View More

Related ebooks

Business For You

View More

Related articles

Reviews for Taming The Big Data Tidal Wave

Rating: 4 out of 5 stars
4/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Taming The Big Data Tidal Wave - Bill Franks

    PART ONE: The Rise of Big Data

    CHAPTER 1

    What Is Big Data and Why Does It Matter?

    Perhaps nothing will have as large an impact on advanced analytics in the coming years as the ongoing explosion of new and powerful data sources. When analyzing customers, for example, the days of relying exclusively on demographics and sales history are past. Virtually every industry has at least one completely new data source coming online soon, if it isn’t here already. Some of the data sources apply widely across industries; others are primarily relevant to a very small number of industries or niches. Many of these data sources fall under a new term that is receiving a lot of buzz: big data.

    Big data is sprouting up everywhere and using it appropriately will drive competitive advantage. Ignoring big data will put an organization at risk and cause it to fall behind the competition. To stay competitive, it is imperative that organizations aggressively pursue capturing and analyzing these new data sources to gain the insights that they offer. Analytic professionals have a lot of work to do! It won’t be easy to incorporate big data alongside all the other data that has been used for analysis for years.

    This chapter begins with some background on big data and what it is all about. Then it will cover a number of considerations in terms of how an organization can make use of big data. Readers will need to understand what is in this chapter as much as or more than anything else in the book if they are to tame the big data tidal wave successfully.

    WHAT IS BIG DATA?

    There is not a consensus in the marketplace as to how to define big data, but there are a couple of consistent themes. Two sources have done a good job of capturing the essence of what most would agree big data is all about. The first definition is from Gartner’s Merv Adrian in a Q1, 2011 Teradata Magazine article. He said, Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.¹ Another good definition is from a paper by the McKinsey Global Institute in May 2011: Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.²

    These definitions imply that what qualifies as big data will change over time as technology advances. What was big data historically or what is big data today won’t be big data tomorrow. This aspect of the definition of big data is one that some people find unsettling. The preceding definitions also imply that what constitutes big data can vary by industry, or even organization, if the tools and technologies in place vary greatly in capability. We will talk more about this later in the chapter in the section titled Today’s Big Data Is Not Tomorrow’s Big Data.

    A couple of interesting facts in the McKinsey paper help bring into focus how much data is out there today:

    $600 today can buy a disk drive that will store all of the world’s music.

    There are 30 billion pieces of information shared on Facebook each month.

    Fifteen of 17 industry sectors in the United States have more data per company on average than the U.S. Library of Congress.³

    THE BIG IN BIG DATA ISN’T JUST ABOUT VOLUME

    While big data certainly involves having a lot of data, big data doesn’t refer to data volume alone. Big data also has increased velocity (i.e., the rate at which data is transmitted and received), complexity, and variety compared to data sources of the past.

    Big data isn’t just about the size of the data in terms of how much data there is. According to the Gartner Group, the big in big data also refers to several other characteristics of a big data source.⁴ These aspects include not just increased volume but increased velocity and increased variety. These factors, of course, lead to extra complexity as well. What this means is that you aren’t just getting a lot of data when you work with big data. It’s also coming at you fast, it’s coming at you in complex formats, and it’s coming at you from a variety of sources.

    It is easy to see why the wealth of big data coming toward us can be likened to a tidal wave and why taming it will be such a challenge! The analytics techniques, processes, and systems within organizations will be strained up to, or even beyond, their limits. It will be necessary to develop additional analysis techniques and processes utilizing updated technologies and methods in order to analyze and act upon big data effectively. We will talk about all these topics before the book is done with the goal of demonstrating why the effort to tame big data is more than worth it.

    IS THE BIG PART OR THE DATA PART MORE IMPORTANT?

    It is already time to take a brief quiz! Stop for a minute and consider the following question before you read on: What is the most important part of the term big data? Is it (1) the big part, (2) the data part, (3) both, or (4) neither? Take a minute to think about it and once you’ve locked in your answer, proceed to the next paragraph. In the meantime, imagine the contestants are thinking music from a game show playing in the background.

    Okay, now that you’ve locked in your answer let’s find out if you got the right answer. The answer to the question is choice (4). Neither the big part nor the data part is the most important part of big data. Not by a long shot. What organizations do with big data is what is most important. The analysis your organization does against big data combined with the actions that are taken to improve your business are what matters.

    Having a big source of data does not in and of itself add any value whatsoever. Maybe your data is bigger than mine. Who cares? In fact, having any set of data, however big or small it may be, doesn’t add any value by itself. Data that is captured but not used for anything is of no more value than some of the old junk stored in an attic or basement. Data is irrelevant without being put into context and put to use. As with any source of data big or small, the power of big data is in what is done with that data. How is it analyzed? What actions are taken based on the findings? How is the data used to make changes to a business?

    Reading a lot of the hype around big data, many people are led to believe that just because big data has high volume, velocity, and variety, it is somehow better or more important than other data. This is not true. As we will discuss later in the chapter in the section titled Most Big Data Doesn’t Matter, many big data sources have a far higher percentage of useless or low-value content than virtually any historical data source. By the time you trim down a big data source to what you actually need, it may not even be so big any more. But that doesn’t really matter, because whether it stays big or whether it ends up being small when you’re done processing it, the size isn’t important. It’s what you do with it.

    IT ISN’T HOW BIG IT IS. IT’S HOW YOU USE IT!

    We’re talking about big data of course! Neither the fact that big data is big nor the fact that it is data adds any inherent value. The value is in how you analyze and act upon the data to improve your business.

    The first critical point to remember as we start into the book is that big data is both big and it’s data. However, that’s not what’s going to make it exciting for you and your organization. The exciting part comes from all the new and powerful analytics that will be possible as the data is utilized. We’re going to talk about a number of those new analytics as we proceed.

    HOW IS BIG DATA DIFFERENT?

    There are some important ways that big data is different from traditional data sources. Not every big data source will have every feature that follows, but most big data sources will have several of them.

    First, big data is often automatically generated by a machine. Instead of a person being involved in creating new data, it’s generated purely by machines in an automated way. If you think about traditional data sources, there was always a person involved. Consider retail or bank transactions, telephone call detail records, product shipments, or invoice payments. All of those involve a person doing something in order for a data record to be generated. Somebody had to deposit money, or make a purchase, or make a phone call, or send a shipment, or make a payment. In each case, there is a person who is taking action as part of the process of new data being created. This is not so for big data in many cases. A lot of sources of big data are generated without any human interaction at all. A sensor embedded in an engine, for example, spits out data about its surroundings even if nobody touches it or asks it to.

    Second, big data is typically an entirely new source of data. It is not simply an extended collection of existing data. For example, with the use of the Internet, customers can now execute a transaction with a bank or retailer online. But the transactions they execute are not fundamentally different transactions from what they would have done traditionally. They’ve simply executed the transactions through a different channel. An organization may capture web transactions, but they are really just more of the same old transactions that have been captured for years. However, actually capturing browsing behaviors as customers execute a transaction creates fundamentally new data which we’ll discuss in detail in Chapter 2.

    Sometimes more of the same can be taken to such an extreme that the data becomes something new. For example, your power meter has probably been read manually each month for years. An argument can be made that automatic readings every 15 minutes by a Smart Meter is more of the same. It can also be argued that it is so much more of the same and that it enables such a different, more in-depth level of analytics that such data is really a new data source. We’ll discuss this data in Chapter 3.

    Third, many big data sources are not designed to be friendly. In fact, some of the sources aren’t designed at all! Take text streams from a social media site. There is no way to ask users to follow certain standards of grammar, or sentence ordering, or vocabulary. You are going to get what you get when people make a posting. It can be difficult to work with such data at best and very, very ugly at worst. We’ll discuss text data in Chapters 3 and 6. Most traditional data sources were designed up-front to be friendly. Systems used to capture transactions, for example, provide data in a clean, preformatted template that makes the data easy to load and use. This was driven in part by the historical need to be highly efficient with space. There was no room for excess fluff.

    BIG DATA CAN BE MESSY AND UGLY

    Traditional data sources were very tightly defined up-front. Every bit of data had a high level of value or it would not be included. With the cost of storage space becoming almost negligible, big data sources are not always tightly defined up-front and typically capture everything that may be of use. This can lead to having to wade through messy, junk-filled data when doing an analysis.

    Last, large swaths of big data streams may not have much value. In fact, much of the data may even be close to worthless. Within a web log, there is information that is very powerful. There is also a lot of information that doesn’t have much value at all. It is necessary to weed through and pull out the valuable and relevant pieces. Traditional data sources were defined up-front to be 100 percent relevant. This is because of the scalability limitations that were present. It was far too expensive to have anything included in a data feed that wasn’t critical. Not only were data records predefined, but every piece of data in them was high-value. Storage space is no longer a primary constraint. This has led to the default with big data being to capture everything possible and worry later about what matters. This ensures nothing will be missed, but also can make the process of analyzing big data more painful.

    HOW IS BIG DATA MORE OF THE SAME?

    As with any new topic getting a lot of attention, there are all sorts of claims about how big data is going to fundamentally change everything about how analysis is done and how it is used. If you take the time to think about it, however, it really isn’t the case. It is an example where the hype is going beyond the reality.

    The fact that big data is big and poses scalability issues isn’t new. Most new data sources were considered big and difficult when they first came into use. Big data is just the next wave of new, bigger data that pushes current limits. Analysts were able to tame past data sources, given the constraints at the time, and big data will be tamed as well. After all, analysts have been at the forefront of exploring new data sources for a long time. That’s going to continue.

    Who first started to analyze call detail records within telecom companies? Analysts did. I was doing churn analysis against mainframe tapes at my first job. At the time, the data was mind-boggling big. Who first started digging into retail point-of-sale data to figure out what nuggets it held? Analysts did. Originally, the thought of analyzing data about tens to hundreds of thousands of products across thousands of stores was considered a huge problem. Today, not so much.

    The analytical professionals who first dipped their toe into such sources were dealing with what at the time were unthinkably large amounts of data. They had to figure out how to analyze it and make use of it within the constraints in place at the time. Many people doubted it was possible, and some even questioned the value of such data. That sounds a lot like big data today, doesn’t it?

    Big data really isn’t going to change what analytic professionals are trying to do or why they are doing it. Even as some begin to define themselves as data scientists, rather than analysts, the goals and objectives are the same. Certainly the problems addressed will evolve with big data, just as they have always evolved. But at the end of the day, analysts and data scientists will simply be exploring new and unthinkably large data sets to uncover valuable trends and patterns as they have always done. For the purposes of this book, we’ll include both traditional analysts and data scientists under the umbrella term analytic professionals. We’ll also cover these professionals in much more detail in Chapters 7, 8, and 9. The key takeaway here is that the challenge of big data isn’t as new as it first sounds.

    YOU HAVE NOTHING TO FEAR

    In many ways, big data doesn’t pose any problems that your organization hasn’t faced before. Taming new, large data sources that push the current limits of scalability is an ongoing theme in the world of analytics. Big data is simply the next generation of such data. Analytical professionals are well-versed in dealing with these situations. If your organization has tamed other data sources, it can tame big data, too.

    Big data will change some of the tactics analytic professionals use as they do their work. New tools, methods, and technologies will be added alongside traditional analytic tools to help deal more effectively with the flood of big data. Complex filtering algorithms will be developed to siphon off the meaningful pieces from a raw stream of big data. Modeling and forecasting processes will be updated to include big data inputs on top of currently exiting inputs. We’ll discuss these topics more in Chapters 4, 5, and 6.

    The preceding tactical changes don’t fundamentally alter the goals or purpose of analysis, or the analysis process itself. Big data will certainly drive new and innovative analytics, and it will force analytic professionals to continue to get creative within their scalability constraints. Big data will also only get bigger over time. However, incorporating big data really isn’t that much different from what analysts have always done. They are ready to meet the challenge.

    RISKS OF BIG DATA

    Big data does come with risks. One risk is that an organization will be so overwhelmed with big data that it won’t make any progress. The key here, as we will discuss in Chapter 8, is to get the right people involved so that doesn’t happen. You need the right people attacking big data and attempting to solve the right kinds of problems. With the right people addressing the right problems, organizations can avoid spinning their wheels and failing to make progress.

    Another risk is that costs escalate too fast as too much big data is captured before an organization knows what to do with it. As with anything, avoiding this is a matter of making sure that progress moves at a pace that allows the organization to keep up. It isn’t necessary to go for it all at once and capture 100 percent of every new data source starting tomorrow. What is necessary is to start capturing samples of the new data sources to learn about them. Using those initial samples, experimental analysis can be performed to determine what is truly important within each source and how each can be used. Building from that base, an organization will be ready to effectively tackle a data source on a large scale.

    Perhaps the biggest risk with many sources of big data is privacy. If everyone in the world was good and honest, then we wouldn’t have to worry much about privacy. But everyone is not good and honest. In fact, in addition to individuals, there are also companies that are not good and honest. There are even entire governments that are not good and honest. Big data has the potential to be problematic here. Privacy will need to be addressed with respect to big data, or it may never meet its potential. Without proper restraints, big data has the potential to unleash such a groundswell of protest that some sources of it may be shut down completely.

    Consider the attention received by recent security breaches that led to credit card numbers and classified government documents being stolen and posted online. It isn’t a stretch to say that if data is being stored, somebody will try and steal it. Once the bad guys get their hands on data they will do bad things with it. There have also been high-profile cases of major organizations getting into trouble for having ambiguous or poorly defined privacy policies. This has led to data being used in ways that consumers didn’t understand or support, causing a backlash. Both self-regulation and legal regulation of the uses of big data will need to evolve as the use of big data explodes.

    Self-regulation is critical. After all, it shows that an industry cares. Industries should regulate themselves and develop rules that everyone can live with. Self-imposed rules are usually better and less restrictive than those created when a government entity steps in because an industry didn’t do a good job of policing itself.

    PRIVACY WILL BE A HUGE ISSUE WITH BIG DATA

    Given the sensitive nature of many sources of big data, privacy concerns will be a major focal point. Once data exists, dishonest people will try to use it in ways you wouldn’t approve of without your consent. Policies and protocols for the handling, storage, and application of big data are going to need to catch up with the analysis capabilities that already exist. Be sure to think through your organization’s approach to privacy up front and make your position totally clear and transparent.

    People are already concerned about how their web browsing history is tracked. There are also concerns about the tracking of people’s locations and actions through cell phone applications and GPS systems. Nefarious uses of big data are possible, and if it is possible someone will try it. Therefore, steps need to be taken to stop that from happening. Organizations will need to clearly explain how they will keep data secure and how they will use it if the general population is going to accept having their data captured and analyzed.

    WHY YOU NEED TO TAME BIG DATA

    Many organizations have done little, if anything, with big data yet. Luckily, your organization is not too far behind in 2012 if you have ignored big data so far, unless you are in an industry, such as ecommerce, where analyzing big data is already standard. That will change soon, however, as momentum is picking up rapidly. So far, most organizations have missed only the chance to be on the leading edge. That is actually just fine with many organizations. Today, they have a chance to get ahead of the pack. Within a few years, any organization that isn’t analyzing big data will be late to the game and will be stuck playing catch up for years to come. The time to start taming big data is now.

    It isn’t often that a company can leverage totally new data sources and drive value for its business while the competition isn’t doing the same thing. That is the huge opportunity in big data today. You have a chance to get ahead of much of your competition and beat them to the punch. We will continue to see examples in the coming years of businesses transforming themselves with the analysis of big data. Case studies will tell the story about how the competition was left in the dust and caught totally off guard. It is already possible to find compelling results being discussed in articles, at conferences, and elsewhere today. Some of these case studies are from companies in industries considered dull, old, and stodgy. It isn’t just the sexy, new industries like ecommerce that are involved. We’ll look at a variety of examples of how big data can be used in Chapters 2 and 3.

    THE TIME IS NOW!

    Your organization needs to start taming big data now. As of today, you’ve only missed the chance to be on the bleeding edge if you’ve ignored big data. Today, you can still get ahead of the pack. In a few years, you’ll be left behind if you are still sitting on the sidelines. If your organization is already committed to capturing data and using analysis to make decisions, then going after big data isn’t a stretch. It is simply an extension of what you are already doing today.

    The fact is that the decision to start taming big data shouldn’t be a big stretch. Most organizations have already committed to collecting and analyzing data as a core part of their strategy. Data warehousing, reporting, and analysis are ubiquitous. Once an organization has bought into the idea that data has value, then taming and analyzing big data is just an extension of that commitment. Don’t let a naysayer tell you it isn’t worth exploring big data, or that it isn’t proven, or that it’s too risky. Those same excuses would have prevented any of the progress made in the past few decades with data and analysis. Focus those who are uncertain or nervous about big data on the fact that big data is simply an extension of what the organization is already doing. There is nothing earth-shatteringly new and different about it and nothing to fear.

    THE STRUCTURE OF BIG DATA

    As you read about big data, you will come across a lot of discussion on the concept of data being structured, unstructured, semi-structured, or even multi-structured. Big data is often described as unstructured and traditional data as structured. The lines aren’t as clean as such labels suggest, however. Let’s explore these three types of data structure from a layman’s perspective. Highly technical details are out of scope for this book.

    Most traditional data sources are fully in the structured realm. This means traditional data sources come in a clear, predefined format that is specified in detail. There is no variation from the defined formats on a day-to-day or update-to-update basis. For a stock trade, the first field received might be a date in a MM/DD/YYYY format. Next might be an account number in a 12-digit numeric format. Next might be a stock symbol that is a three- to five-digit character field. And so on. Every piece of information included is known ahead of time, comes in a specified format, and occurs in a specified order. This makes it easy to work with.

    Unstructured data sources are those that you have little or no control over. You are going to get what you get. Text data, video data, and audio data all fall into this classification. A picture has a format of individual pixels set up in rows, but how those pixels fit together to create the picture seen by an observer is going to vary substantially in each case. There are sources of big data that are truly unstructured such as those preceding. However, most data is at least semi-structured.

    Semi-structured data has a logical flow and format to it that can be understood, but the format is not user-friendly. Sometimes semi-structured data is referred to as multi-structured data. There can be a lot of noise or unnecessary data intermixed with the nuggets of high value in such a feed. Reading semi-structured data to analyze it isn’t as simple as specifying a fixed file format. To read semi-structured data, it is necessary to employ complex rules that dynamically determine how to proceed after reading each piece of information.

    Web logs are a perfect example of semi-structured data. Web logs are pretty ugly when you look at them; however, each piece of information does, in fact, serve a purpose of some sort. Whether any given piece of a web log serves your purposes is another question. See Figure 1.1 for an example of a raw web log.

    Figure 1.1 Example of a Raw Web Log

    c01f001

    WHAT STRUCTURE DOES YOUR BIG DATA HAVE?

    Many sources of big data are actually semi-structured or multi-structured, not unstructured. Such data does have a logical flow to it that can be understood so that information can be extracted from it for analysis. It just isn’t as easy to deal with as traditional structured data sources. Taming semi-structured data is largely a matter of putting in the extra time and effort to figure out the best way to process it.

    There is logic to the information in the web log even if it isn’t entirely clear at first glance. There are fields, there are delimiters, and there are values just like in a structured source. However, they do not follow each other consistently or in a set way. The log text generated by a click on a web site right now can be longer or shorter than the log text generated by a click from a different page one minute from now. In the end, however, it’s important to understand that semi-structured data does have an underlying logic. It is possible to develop relationships between various pieces of it. It simply takes more effort than structured data.

    Analytic professionals will be more intimidated by truly unstructured data than by semi-structured data. They may have to wrestle with semi-structured data to bend it to their will, but they can do it. Analysts can get semi-structured data into a form that is well structured and can incorporate it into their analytical processes. Truly unstructured data can be much harder to tame and will remain a challenge for organizations even as they tame semi-structured data.

    EXPLORING BIG DATA

    Getting started with big data isn’t difficult. Simply collect some big data and let your organization’s analytics team start exploring what it offers. It isn’t necessary for an organization to design a production-quality, ongoing data feed to start. It just needs to get the analytics team’s hands and tools on some of the data so that exploratory analysis can begin. This is what analysts and data scientists

    Enjoying the preview?
    Page 1 of 1