Guidance and results for the data-rich, yet information-poor.

Messy DataWhen you’re working with predictive modeling you’re typically going to run into two different types of data sets.

Structured data sets are the easiest to deal with. Your financial data may be mostly structured, for example. You have a fixed dollar amount which came in during a specific period of time. The numbers always mean the same thing.

“Unstructured” data contains important information, but it might not live in any numeric form. You may have thousands of customer service calls or records to sort through. You may be tracking social media content, like the subject matter of your customer’s tweets. When that data is used, it will have to be prepared and converted into a specific number before it will be useful to you.

Most modern data environments are a very messy mix of structured and unstructured data, and usually that data is stored in huge quantities. It’s easy to get overwhelmed when you look at the sheer scope of the data that is available to you.

Fortunately, the problem is not insurmountable if you approach your data in the right way.

First and foremost, you must have a clear idea of what you are trying to accomplish with this data. What are your objectives? Once you know your objectives it’s easier to understand which data won’t make sense for this particular problem set.

Next, you’ll need to take a sample. You’re never going to analyze all of the data that’s available to you. You only need to put together a good enough representation of the whole solution space.

In fact, one of the biggest mistakes most modelers make is that they’re prone to going into the data assuming that “more” is better. In truth, adding more and more data is actually a good way to ensure the failure of your model.

Fortunately, knowing your targets makes it far easier to decide what you’re going to include. You’ll dive in, pick the data that is relevant to your objective, and then work with that data (and only that data).

Don’t jump into the data without setting your course. That’s like starting a book on Chapter 5. Both actions all but ensure you won’t have a clue what’s going on–which means you’re likely to create an even bigger mess than the one you started with.

photodune-1381176-analizing-data--xs (2)As you may know, data preparation is usually the most labor-and-time intensive part of a predictive analytics project. What you may not recognize is that the entire preparation phase needs to be documented on an on-going basis as you complete this phase of the project.

It may not seem important as you’re doing it. But if your model ever needs any revisions you’re going to need to know what you did–and you’re probably not going to be very accurate if you simply try to pull that information out of your own memory banks.

You may also have to use your model again at a later date, with a new set of data. That model simply won’t give you consistent results if you don’t prepare the new data set in the same way you went about preparing the original data set. Remember you may be translating unstructured data (such as the content of recorded customer service calls) into numbers so that your model can actually read what happened. If you use different numbers the second time around the model’s going to produce very different results.

This documentation process should be completed through all six phases of the model’s development.

Fortunately, TMA has created a way to make your data prep documentation much easier. TMA offers an Excel Workbook template which makes it very easy to document and capture your progress on nearly any kind of predictive analytics project. You will receive this spreadsheet with your other course materials when you sign up for one of TMA’s training classes.

Documentation helps you deal with any data preparation issues on an on-going basis. If you don’t use TMA’s process it’s still very important to develop your own method. Otherwise, you’ll be starting from scratch every time there’s any question about the model or models that you’ve been developing.

BriefcasePredictive analytics offers one outstanding strength: it helps to eliminate the inconsistencies in human behavior. It even helps you eliminate your own inconsistencies. You can start making decisions without any fear of unconscious bias.

For example, many employers are biased against hiring employees who have significant gaps in their resume. This is harming a huge portion of the population who lost their jobs, often through no fault of their own, at the beginning of the Great Recession. Predictive analytics has now shown that this attitude is actually costing companies money.

Carl Tsukahara of Evolv writes:

It’s now recognized that the use of predictive analytics can surface powerful conclusions from disparate data sources that can, in turn, serve as the catalyst to foster change in business culture, improve hiring and management practices, and enable more Americans to find gainful employment in fulfilling roles. In my own experience at Evolv, just one of the harmful hiring biases we’ve used predictive analytics to debunk is that “People who haven’t worked recently aren’t viable candidates.” Our technology platform looked across millions of data points on employees across our customer network to prove that the long term unemployed perform no worse than those without an extended jobless spell and have empowered our clients (including several of the companies that supported this week’s legislation) to hire those candidates using a predictive score based on this same technology. We hope this finding in particular helps that 32 percent get that interview, that call back – that chance to show employers that they too, can be great additions to a team.

-Wired.com

Some 300 companies are making significant changes in response to these findings, and they are seeing a benefit. According to the above-referenced article, Xerox was one of these companies. Changing their policies in response to these findings resulted in a 20% reduction in their attrition rate, which in turn saved the company a great deal of money.

So what’s the takeaway for you, and your business? Well, you’ve already gotten the benefit of someone else’s predictive analytics project today: you know that you can hire the long-term jobless without creating any problems for your business. You might even solve some.

But you can also get a great deal more out of predictive analytics just by recognizing that you can use predictive analytics to challenge your own assumptions. You could be making other assumptions about hiring, about employee benefits, about vendors, or about any other aspect of your business. You may think those assumptions are bringing great results…but does the data support your claims?

Asking yourself these questions takes your business intelligence program to the next level. You begin by diving into the problems you know about. You know you want more sales, so you start there. Challenging your assumptions, however, will grant you a path to the problems you just aren’t aware of yet–problems that are costing you time, costing you money, and costing you your next star employee.

Paul Kautza, Director of Eduction for TDWI, took some time to talk to TMA’s Training Director Tony Rathburn at the 10th Anniversary European TDWI Conference. You can watch the interview in the video below.

Tony’s comments on his European classes could easily apply to TMA’s training classes as well: “One of the major hurdles that organizations are facing is that we have a business perspective on the problem, we have an IT perspective on the problem, and we have a quantitative perspective on analytics projects. People are very skilled at what they’re doing, but they don’t talk to each other very well.”

All of these stakeholders must learn how to talk to each other if data mining and predictive analytics projects are to succeed.

Orange

Does Quantity Really Trump Quality?

 

During June’s Q&A session TMA’s Senior Consultant and Training Director Tony Rathburn said, “Many organizations need to worry about small data before they worry about big data.”   Tony adds that having greater quantity of data can either enhance precision or add distortion.

These are sentiments echoed by James Guszczca and Bryan Richardson in their excellent essay: Two Dogmas of Big Data: Understanding the Power of Analitics for Predicting Human Behavior.

What is small data, and what can organizations do with it?

Small data is readily available, less controversial, and much higher in value.

Guszczca and Richardson used the example of a university admissions office in order to drive their points home. The office wanted to admit students who truly had the best chance of succeeding at the university. The data set used to solve the problem was already right at hand: high school transcript data, which the students had already voluntarily provided as a part of the routine admissions process.

This data is not particularly glamorous or unstructured. There’s nothing new about using transcript information to make admissions decision. The difference was in the tools the office was now using to make better admissions decisions.

Takeaway question: what readily available, uncontroversial data sources might your organization have on hand right now? Chances are that data already hints at some aspect of human behavior, something your organization either wants more of or less of. And although the vast majority of most large organization’s data is text or unstructured data, the biggest bang for an initial effort is from your readily available structured data.

For example, you take down credit applications for an item which you sell. Those applications contain voluntary small data bits — namely, zipcodes, which can hint at average home values. Each application also contains information about how much money each prospect makes.

In the retail industry, perhaps some portion of your customer base regularly experiences “buyer’s remorse,” which leads to chargebacks and a loss of money. You want to minimize this behavior (chargebacks). The relatively small structured data located on those credit apps may be the primary source of information you need in order to make a substantial impact on reducing returns. It may tell you that you need to stop sending your reps to homes that are worth $300,000 or more. You and your reps have been assuming that a $300,000 home means the buyer has enough money to purchase the product. In truth, the data may reveal that the homeowners at that price point, who leverage certain levels of debt and a few other criteria that the model recognizes in combination, are commonly “in over their head” and unable to take on more credit obligations. The real sweet spot for your sales force may instead lay within homes in the $150,000 to $225,000 range, along with several other factors that the predictive model considers for an overall targeting score.

You don’t have to dig deeply into your existing data in order for a model to arrive at highly impactful results. You probably could have run the same highly effective predictions with the same data decades ago, as well.

Small data does not require Big Data’s “3Vs.”

“Big” data comes with problems of its own. By its very definition “it has such high volume, variety, and velocity so as to create problems for traditional data processing and analysis technologies.” (Guszczca and Richardson). Most organizations don’t truly need or want to get anywhere near Big Data, even if they don’t know it. Warehousing all that data becomes a huge project in its own right, one that is low value for most companies.

Small data holds high value with low effort.

As The Modeling Agency insists to its clients, even if the vast majority of the data that you collect and store is unstructured (textual) “Big Data”, start your analytics with available structured (numeric).

Structured data is like pressure-laden oil reserves at the surface, as opposed to Big Data which is deep and challenging to extract.  There is certainly value in Big Data, but like drilling deep and sideways, it takes far more effort for incremental gain.  Further, organizations will have far greater impact with a sound strategic approach to analytics that ensure actionable models that fit the environment and align well to organizational objectives, than attempting to transform textual data into an optimized model that fails for a host of strategic reasons.

The Modeling Agency guides organizations to conduct a comprehensive assessment of their resources, environment, situation and objectives to identify valid opportunities for actionable and measurable results that leadership will value.  Part of the assessment is to consider all available data resources – big or small; structured and unstructured – to ensure it supports organizational priorities and develop a sound project definition for a superior analytic process.  From there, TMA guides the organization’s analytic practitioners through a 6-Phase modeling process to assume full ownership of a highly effective and impactful analytic practice.

Don’t buy into the industry Big Data hype.  Those who actually part from the buzz and apply sound strategic implementation with training and experienced guidance will leapfrog their competition.

 

Does Quantity Really Trump Quality?

 

During June’s Q&A session TMA’s Senior Consultant and Training Director Tony Rathburn said, “Many organizations need to worry about small data before they worry about big data.”   Tony adds that having greater quantity of data can either enhance precision or add distortion.

These are sentiments echoed by James Guszczca and Bryan Richardson in their excellent essay: Two Dogmas of Big Data: Understanding the Power of Analitics for Predicting Human Behavior.

What is small data, and what can organizations do with it?

Small data is readily available, less controversial, and much higher in value.

Guszczca and Richardson used the example of a university admissions office in order to drive their points home. The office wanted to admit students who truly had the best chance of succeeding at the university. The data set used to solve the problem was already right at hand: high school transcript data, which the students had already voluntarily provided as a part of the routine admissions process.

This data is not particularly glamorous or unstructured. There’s nothing new about using transcript information to make admissions decision. The difference was in the tools the office was now using to make better admissions decisions.

Takeaway question: what readily available, uncontroversial data sources might your organization have on hand right now? Chances are that data already hints at some aspect of human behavior, something your organization either wants more of or less of. And although the vast majority of most large organization’s data is text or unstructured data, the biggest bang for an initial effort is from your readily available structured data.

For example, you take down credit applications for an item which you sell. Those applications contain voluntary small data bits — namely, zipcodes, which can hint at average home values. Each application also contains information about how much money each prospect makes.

In the retail industry, perhaps some portion of your customer base regularly experiences “buyer’s remorse,” which leads to chargebacks and a loss of money. You want to minimize this behavior (chargebacks). The relatively small structured data located on those credit apps may be the primary source of information you need in order to make a substantial impact on reducing returns. It may tell you that you need to stop sending your reps to homes that are worth $300,000 or more. You and your reps have been assuming that a $300,000 home means the buyer has enough money to purchase the product. In truth, the data may reveal that the homeowners at that price point, who leverage certain levels of debt and a few other criteria that the model recognizes in combination, are commonly “in over their head” and unable to take on more credit obligations. The real sweet spot for your sales force may instead lay within homes in the $150,000 to $225,000 range, along with several other factors that the predictive model considers for an overall targeting score.

You don’t have to dig deeply into your existing data in order for a model to arrive at highly impactful results. You probably could have run the same highly effective predictions with the same data decades ago, as well.

Small data does not require Big Data’s “3Vs.”

“Big” data comes with problems of its own. By its very definition “it has such high volume, variety, and velocity so as to create problems for traditional data processing and analysis technologies.” (Guszczca and Richardson). Most organizations don’t truly need or want to get anywhere near Big Data, even if they don’t know it. Warehousing all that data becomes a huge project in its own right, one that is low value for most companies.

Small data holds high value with low effort.

As The Modeling Agency insists to its clients, even if the vast majority of the data that you collect and store is unstructured (textual) “Big Data”, start your analytics with available structured (numeric).

Structured data is like pressure-laden oil reserves at the surface, as opposed to Big Data which is deep and challenging to extract.  There is certainly value in Big Data, but like drilling deep and sideways, it takes far more effort for incremental gain.  Further, organizations will have far greater impact with a sound strategic approach to analytics that ensure actionable models that fit the environment and align well to organizational objectives, than attempting to transform textual data into an optimized model that fails for a host of strategic reasons.

The Modeling Agency guides organizations to conduct a comprehensive assessment of their resources, environment, situation and objectives to identify valid opportunities for actionable and measurable results that leadership will value.  Part of the assessment is to consider all available data resources – big or small; structured and unstructured – to ensure it supports organizational priorities and develop a sound project definition for a superior analytic process.  From there, TMA guides the organization’s analytic practitioners through a 6-Phase modeling process to assume full ownership of a highly effective and impactful analytic practice.

Don’t buy into the industry Big Data hype.  Those who actually part from the buzz and apply sound strategic implementation with training and experienced guidance will leapfrog their competition.