HomeComapnyClientsLibraryPressContactDM Links

Click for a Printer-friendly Version - Adobe PDF

Common Errors in the Use of Overlay Data

By Jim Wheaton
Principal, Wheaton Group 

Original version of an article that appeared in the September 12, 1994 issue of "DM News"  

Individual and household overlay data often play major roles in descriptive as well as predictive research.  But the careless use of this information can result in more harm than good when it comes to making accurate marketing decisions.  However, by adhering to certain methods of incorporating overlay data into research and by properly interpreting the results, many of these errors can be avoided.

Descriptive Research Applications 

Handling Missing Data 
One common error in descriptive research applications results from the fact that individual and household overlay data invariably cannot be applied to a significant percentage of a given file.  The portion for which a specific data element cannot be applied generally ranges from 20% to 95%.  Therefore, whenever marketing decisions are based on a given demographic or lifestyle variable, often it is implicitly assumed that those individuals for whom data coverage does not exist have the identical profile. 

Consider a file in which the average age of the codeable records is 44.  Any marketing decisions that result from this information will be appropriate only if the uncodeable portion of the file also has an average age that approximates 44. 

Unfortunately, uncodeable individuals almost always are demographically different from the codeable, because representation on major overlay databases is skewed towards older, more stable individuals.  I call this the "Ozzie and Harriet factor."  The extent to which an individual has a mortgage, children, credit cards, and the like is the extent to which this individual is likely to be represented on a given overlay database. Conversely, those individuals who cannot be matched to an overlay database tend to be young renters who move frequently.  These people generally also are not affluent and not married.

Let's get back to our example, in which the average age of the codeable records is 44.  This is exactly what happened to the client of a major data compiler who, finding this average to be counter-intuitive, sought a second opinion.  Fortunately, techniques exist to adjust demographic and lifestyle profiles for the systematic bias that is inherent in missing data.  Application of one such adjustment algorithm shifted the average age of the file from 44 to 30!  This lower estimate agreed exactly with the client's "gut instinct," as well as with extensive survey research.

Hazards of "Marketing to the Mean" 
Another frequent mistake is what I'll call "marketing to the mean."  It is critical to look beyond the average to the distributions of a given variable.  A real-life example is a well-known fashion magazine whose average adjusted subscriber age is 36, but who in actuality has two target audiences: 

  • New-to-the-workforce 18 to 22 year old women who view the magazine as a "wish book.
  • Affluent women in their late-40s to mid-50s who reference the magazine when making purchase decisions.

In fact, individuals who are the average age of 36 are very poor prospects because many are parents with mortgages and little discretionary power for high-priced fashion merchandise. 

Multiple Overlay Variables 
Another common error is the assumption that the demographic and lifestyle overlay variables that stand out or "pop" on a file all describe the same group of individuals.  Assume, for example, that the following characteristics are over-represented on a file of diamond ring buyers:  "young," "male," "affluent" and "married."  It could be hazardous to conclude that the target audience is young, affluent, married males.  There just as likely could exist multiple audiences, such as: 

  • Young (single) males (of various income levels) who purchased an engagement ring.
  • Affluent couples (of various ages) who bought a ring to commemorate an important wedding anniversary.

This distinction has profound marketing implications.  Fortunately, multivariate statistical techniques such as CHAID (Chi-Square Automatic Interaction Detection) have the power to identify situations in which multiple target audiences exist. 

Predictive Research Applications 

Problems with Static Data 
A frequent problem when overlay demographics are incorporated into predictive models is that static data — sources purchased outright and not periodically updated — change meaning over time.  This occurs because of the large percentage of individuals who move every year.  This, in turn, results in an ever-increasing overlay rate for older, more stable people compared with their younger, more mobile counterparts. 

A real-life example of this phenomenon is a regression model in which two "political affiliation" overlay variables, "conservative" and "liberal," both "popped" positively.  The reason is that the variables were several years old and rapidly were becoming surrogates for the target audience — stable individuals in their 40's and 50's.

Hazards of Short-Term Data Fluctuations 
Unfortunately, even non-static data can change meaning over time.  An excellent example is "length of residence," a common overlay variable.  Because of peculiarities in the update cycle of at least one major data compiler, for three months every year essentially no one on its file shows a "length of residence" of less than one year.  In the absence of an adjustment to reflect this phenomenon, this would be problematic for a model developed for "new mover" merchandise such as window treatments. 

Predictive Power of Missing Data 
Many statisticians are unaware of the often remarkable explanatory power that is inherent in missing data.  Sometimes, for a given individual, the inability to apply specific demographic or psychographic information is more predictive than the information itself.  This has to do with the missing data bias discussed earlier. 

As an example, let's revisit the "length of residence" variable, which is created in significant part by comparing names at specific addresses in phone directories from one year to another.  Besides the usual problem of younger, mobile individuals having lower hit rates, we have additional bias because of those demographic groups that have a higher probability of opting for an unlisted telephone number.  With the unlisted-number group, it is very likely that the information required to calculate "length of residence" cannot be obtained.  These people generally fall into one of the following categories: 

  • Single women, urban residents, and the very affluent (who opt for unlisted numbers for security reasons).
  • The very poor (who cannot afford phones).

Therefore, the absence of "length of residence" information increases the probability that a given individual belongs to one or more of the groups listed above.  This might very well be more predictive than the knowledge that a given individual, for example, has resided at his or her address for three years.

In order to capture the predictive power of missing demographic and psychographic information, it is critical that missing data for a given predictor variable be assigned its own value when building a model.  This is contrary to the practice of many statisticians, who set missing data to the mean of all the observations for which information exists.  Others default to the equivalent Census-level variable, which is an improvement but still not optimal. 

A wonderful example of the missing data's potential predictive power is what I refer to as "The Unmodel," which was constructed to segment several large outside rental lists.  The top decile was driven largely by the absence of information on multiple overlay elements.  This is because the target audience was comprised of "un-Ozzie and Harriets" — single, downscale renters of apartment units. 

Consider, for example, the univariate relationship to response of several "Unmodel" predictor variables, where the "Missing" categories all correlate very highly with response:   

Response Rate by Income

Response Rate by Credit Card

Response Rate by Age

The resulting performance was quite good for a prospecting model, with "lift" — top 10% to average — of 209 (and "lift" — top 10% to bottom 10% — of 475).

Conclusion 
The use of overlay data can have a powerful impact on direct marketing research, if applied properly.  To ensure the effective incorporation of overlay data and the correct interpretation of results, there are several rules to keep in mind. 

First, for descriptive research, demographic and lifestyle profiles must be adjusted to reflect the "Ozzie and Harriet" bias that is inherent in major overlay databases.  It is also important to consider profile distributions rather than means when drawing marketing conclusions.  And finally, never assume that multiple overlay variables that "pop" on a file all describe the same group of individuals. 

For predictive research, incorporate static data into models with caution, recognizing that their meanings will change over time as they become surrogates for older, more stable individuals.  Also, be mindful of the fact that even non-static data can change meaning as suppliers update their databases.  And finally, recognize and take advantage of the fact that missing data often can provide remarkable explanatory power. 

Jim Wheaton is a Principal at Wheaton Group, and can be reached at 919-969-8859 or jim.wheaton@wheatongroup.com.  The firm specializes in direct marketing consulting and data mining, data quality assessment and assurance, and the delivery of cost-effective data warehouses and marts.  Jim is also a Co-Founder of Data University www.datauniversity.org. 

Top >>


Search Wheaton Group Published Articles
Go

Legal PolicySite MapContact Us

Copyright © 2004 Wheaton Group LLC. All rights reserved.