Statistical modeling is now well and truly in the spotlight, but it is very clear, from the way that stories depending on statistical models are being covered in the media, that many journalists have only the haziest notion of how to present and interpret numbers. Worse still, it seems that the public is very uncertain too. Numbers are not something you can leave to the finance department. They are the basis of how companies are measured, how external bodies like banks assess a business, and, in the last analysis, whether or not a business can continue: is it a ‘going concern.’ Unfortunately, in business, most numbers are ‘fuzzy,’ and a lot may be missing or not available at a reasonable cost. A model can be used to set out a scenario, but it is not a shortcut to the gods of prophecy.
“LIFE IS LIKE A SEWER” – Tom Lehrer
Models, like sewers, can only put out what’s put into them. Numbers are a representation of realty, not reality itself. How we look at them and produce them affects how we can use them.
Classification sounds incredibly dreary, but it governs how we look at numbers. Is a tomato a fruit, a vegetable, or an ingredient? If tomato production is included within fruit output (alongside apples), or vegetables (alongside potatoes), or as an ingredient (alongside skim milk powder), how does that affect the scale and behavior of your measurement of these sectors? Is ‘big’ or ‘little’ more or less important than moving ‘fast’ or ‘slow’? Is your business mainly paid for being ‘big’ – in which case, being slow may not be a problem – or being ‘fast’? How much does it cost to be ‘fast’? If it’s expensive and unimportant, why measure it? Any measurement of people must also recognize that people are not ball bearings; they have opinions, memories, and instincts, all of which affect behavior. In times of stress they display many reactions which are not about cool, measured calculation.
In the shorthand developed by behavioral economists, they are not ‘econs.’ They are ‘humans.’ This might be inconvenient, but it is unavoidable.
Lots or a Little
Let’s start with arithmetic, 2 is smaller than 4. If your preference survey tells you that 2 out of 10 users of a product prefer your version, but 4 out of 10 prefer your competitor’s, what does that mean? Is it a reason to change your product? Now change the context: 2 out of 10 users still prefer your product, but only 1 in a 100 people actually use the product at all. What is your biggest problem? Now raise the risk type: 2 out of 10 of your customers are getting a really bad allergic reaction to your product. What do you do? Now change the risk level: 2 out of a 100 of your customers get a really bad reaction. What do you do? Now make that 2 in a 1000. Does the fact that it is 2 make it less important, than if it were 4? The fact that 2 is smaller than 4 is a fact, but not always the most important fact, or even one that requires action. Now add another factor. Your measurement system is accurate to plus or minus 25%.
‘2’ might be 1.5 or 2.5.
‘4’ might be 3 or 5.
Two may not be that much smaller than four. ‘A 1000 people have died’- that’s a lot of people, and a lot of friends and family who will be affected. But over what time period, and how many people in the group continue to live? Is that unusually high, unusually low, or average for that group of people, over that timescale? Is the number actually a round, whole 1000, no more and no less? When was it measured?
“I wouldn’t go down that road, 3 people got killed.”
“How?”
“In a car crash.”
“What happened?”
“Driver had a heart attack.”
“When was this?”
“Three years ago.”
Numbers:Rarely Plain and Never Simple
Some ‘gee whiz’ numbers are magnets for news headlines: notably anything ending in ‘illion.’ ‘Half a million’ sounds bigger than 500,000, and a lot bigger than ‘1 in 20’ or ‘1 in a 100.’ For modeling purposes, any real-world number must be assumed to be an estimate. The numbers included in the model need to be assessed for their impact on the model, not how easy they are to collect: and how they are collected and recorded must also be assessed. Many figures published with great assurance in the media are estimates or sample surveys – GDP, money supply, inflation, the producer price index. It can be assumed that all data on consumer attitudes and buying behavior is an estimate derived from a sample. In this case, the framing of the question and the size of the sample are essential information.
So How Do We Treat ‘Numbers’
Statistics can only produce a model which is as good as the underlying data. Poor data cannot be changed into good data by statistical manipulation. (That’s called ‘making it up’). Some things are easy to measure – the number of tons of sugar delivered to a biscuit factory, for example, but ‘easy’ does not necessarily mean ’useful.’ There are several sayings to bear in mind here:
‘What can be measured will be managed.’ Measurables can be altered.
Use: If your order for sugar is US$20 per ton more expensive than an alternative source, this fact can be used in your cost profile. The sugar supply can be typified by numerical values such as cost, or product measurements such as water content.
Risk: By default, other measures which are not measurable or manageable like this may be discounted.
‘What can be measured may not matter’
Use: Measurements for one purpose, such as production factors like weight or water content, may not be applicable to others. ‘Quality’ means different things to the production manager, the product designer, and maybe, the consumer. It is important to identify what is measurable, but not to apply a factor where it has no meaning. In addition, if the cheap source of sugar is not reliable, but the expensive source is, the extra cost can be seen as a sort of insurance about reliability of supply.
Risk: Price is measurable, definable and important. If you get your cost profile wrong, you’re out of business. Some measurements are mission-critical – but not all.
‘What cannot be measured may matter most’
Use: For most food companies supplying consumer markets, what matters most is their reputation. Buying Fairtrade sugar does not deliver direct measurables – it is unlikely to be cheaper or even to be of different measurable physical quality. It can, however, be integral to a product promise and a source of business differentiation.
Risk: If a factor is highly subjective, then the unmeasurable factor is very hard to manage. Measures of ‘approval’ or ‘preference’ have no innate basis, because my opinion of my ‘approval’ can change for all sorts of reasons – including whether it’s a sunny day outside.
Complexity and Coin Flips
A complex model is not necessarily better than a simple one; in fact, there are good reasons why utility declines with complexity. Each set of data comes with its own potential for error. If you combine 2 datasets, the errors are multiplied together. Example: Dataset 1 is 95% accurate. Dataset 2 is 90% accurate. The combined accuracy for both datasets together is 0.95 multiplied by 0.90 or 0.86. If you add another dataset at anything less than 100% accuracy, overall accuracy falls again. Adding in datasets to increase the number of factors in a model, increases error. A model with 2 factors has 2 sources of error: a model with 6, has 6. (There may be more if factors start to ‘interfere’ with each other because they are not truly independent but let’s keep it simple). Even if accuracy for each factor is an impossibly high 98%, the 2-factor model will have accuracy of 96%, while the 6-factor model will have accuracy of 89%. At 90% accuracy – still extremely high – 2 factors take results down to 81% accuracy and 6 take it to 53%.
Once accuracy deteriorates to approaching 50%, those are the odds of flipping a coin. The model adds nothing.
Scale
Statistics is, from one point of view, the art of making a few represent the many. You do not need to count every item or event if you have a representative sample. This is faster, cheaper, and as practically accurate as a ‘census’ where every item or event is counted. If you have a truly random sample, this will be representative, but true mathematical randomness is hard to achieve. We also hit a credibility problem: unless trained in statistics, people find it hard to believe that ‘random’ is better than picking the ‘right’ sample.
Researching on a big scale only matters if you think there will be a lot of subsets with distinctively different profiles; in other words, that your ‘universe’ is really composed of several very distinct and separate universes, which can be proved to act differently and not to influence each other. The type of risk also plays a role. Step 3 of pharmaceutical research is large scale because the risk for a small subset of people who may be sensitive to a drug could be death. Scale has to be sufficient to make the results statistically reliable, and the choice of subjects, if not random, also has to be matched to a universe profile. That means the whole universe, not the parts that seem ‘likely’, because the purpose of the exercise is to determine what ‘likely’ means.
Where subgroups are very small, the risk of extreme results rises sharply. In a group of 12 possible scores, there are a range of mid-point values, depending on your protocol, and 2 extreme values: values of 1 or 12 are 2 out of 12 possible values, or 17%. In a group of 4 scores, the only choice is between an extreme value and a mid-point: 1, 4, 2 or 3. The real value for your product is a score of 10%. Out of a sample range of 12, the ‘real’ number may be represented by 1 or, less frequently, by 2. The study may throw an outlier result of 3 or 4, but this is unlikely to be repeated if you sample again. In a sample of 4, the value can only be represented by zero or 1. Zero is misleading: there is a value there, but if the score is 1, that’s 25%, and is also very misleading. Potential results are either that there is nothing there, or that incidence is more than double the real figure. Repeating measurement with another small sample does not help much.
Approximation and Suspicion
If your calculations and models produce a really surprising result – check them. One way to know whether a result is out of line is a quick and rough calculation: not 2,563 times 246, but 2,600 times 250, or even just 260 times 25 (multiply by 100 and divide by 4). This will at least identify results which are out by a factor of 10 or 100. Excel spreadsheets – and other types of spreadsheet- are not perfect. If your Excel insists on giving a strange or wrong answer, check that there is no existing calculation in what looks like a blank cell or put the calculation in a new spreadsheet. Excel formula strings are in strict mathematical logic, including the use of brackets. Break a complex calculation down into smaller formulae to check each step. Test each step for quality of data, the calculation used, and how likely the result is. The arrangement of brackets can make a big difference to the end result. (All this is old news to people with a maths or physics background, but not everyone has that). If pasting results or figures across from one sheet to another, test with an approximation. If it looks wrong, check that calculations have been pasted correctly.
In conclusion – Excel does not always tell the truth.
Nobody Lives in a Test-Tube
When creating a model, the main factors which affect an outcome may not be directly-related ones. As an extreme example – looking at sales data for sales of ready-to-eat sandwiches from January 2020 through to April 1st, and comparing this to sales data from 2019, will show a big drop in volume and value. This will not be due to standard marketing factors such as differences in costs or prices between 2019 and 2020, nor competitive activity. It is due to the enforced closure of outlets selling ready-to-eat (RTE) food, and the dearth of the primary market for RTE food, namely office workers. Such extreme events can come from anywhere: an outbreak of salmonella at a supplier, a fire at a main factory, suspension of production on a distant supply line due to an earthquake. You cannot ‘model’ these events because their frequency is very low, but you can run a scenario to check out the potential level of impact.There is a phrase often used in models – ’present trends continuing.’ The problem is that present trends (i.e., all the factors that have not been modelled to change) do not always continue.
The model may give a baseline for calculations, but it is not a forecast, and should not be treated as one, because real life contains many unknowable factors.
If It’s Broken, Fix It
When a factor can be seen to affect an outcome, the next question is what to do about it. Simply creating a measurement without a real-life response mechanism is a waste of effort: like putting a fire alarm on a house, but not having a fire brigade to answer it. If your model brings up negative results, what actions does that lead to?
Over what timescale, in which sequence?
How will you know if your actions:
a) are succeeding and so need to continue?
b) are not succeeding, and so need to change?
When do you make a decision to move from a) to b), or the other way?
Is that decision impeded by behavioral factors, like management commitment to solution (a), or a financial, social or legal problem about acting on (b)?
How much is that behavioral decision going to cost you?
Is it possible to avoid the behavioral aspect, or is that, in fact, a key component of the decision?
Continuity Matters
This author has just been taking part in a ‘citizen science’ project to digitize data on rainfall readings across the UK since the 1830s. The data currently exists on handwritten sheets. A full run of data will create more information for weather forecasting models over a long period, allowing recurring and non-recurring patterns to be identified. Some weather patterns are measured in decades and some events are very, very infrequent, but not impossible: unlikely over any individual decade, but more likely to be seen over a century. Continuous data becomes more valuable, the longer the dataset runs. While a short run of a few datapoints may indicate short-term business responses to immediate problems or opportunities, to get a reading on a longer-term underlying trend, you need long-run data.
Business data is very rarely available in a consistent form for more than a few years. This makes business modeling difficult. The frequency with which data is collected may also not be enough for long-term trending: quarterly figures over 5 years only gives you 20 data points. So long-run data is additionally valuable. The other way to get more data is to run several sets side by side, so not only company A over the last 5 years, but companies B, C and D, if they are in the same markets and have similar structures. Many of the external factors impacting on the companies will be the same, so identifiable differences could be traceable back to different actions. However, if all companies are affected by a single, extraordinary event – such as COVID-19 – then their differences may disappear, so this data loses value as well: it cannot be used to draw distinctions or rationales for different performance.
Models; Useful, But not the Real Thing
Because we all now have access to vast computing power on our desks, there is a temptation to think that all these numbers can be put into strange and wonderful new patterns which will reveal the future. Sometimes, this can be true, but application is limited. A large, plastic doll is a model of a human being. The factors which are represented are those chosen by the doll-maker. If it’s a crash test dummy, the doll will replicate the action of flesh and bone under physical stress, but you wouldn’t be fooled if it sat beside you in a cafe. The doll-maker may choose to represent human internal organs, but they won’t work like the real thing. For a crash test dummy, this extra modeling may be worthwhile, but not for, for example, a shop mannequin. Putting excessive weight on a single model, or even worse, making it beyond criticism, helps no one. Use them, but don’t worship them.