The idea is by using historical data one can determine the relationship between retail sales and the answers to each question and, by following the tree down, one can assess the probability retail sales rose given the other information.
This process also allows one to determine the informational content of each question in predicting retail sales. Say it was asked “is the first day of the month a weekday?”. If both possible answers show the same proportion of times retail sales rose, then this question doesn’t provide any additional information.
It says there is no apparent link between retail sales and the start day of the month. So one can determine which questions provide the most information.
The leap from a single decision tree to a random forest is similar to the concept of the wisdom of crowds. Each person brings their own knowledge to answering a question, so aggregating the answers increases the total amount of input information.
To construct a random forest, the results from many decision trees are calculated and averaged. Each of the trees is constructed from a random subset of the possible explanatory variables and is estimated using a random subset of the historical data.
ANZ research’s model considered 49 variables as well as the lagged value for these (and retail sales) for up to 12 months. In other words, 49 variables for 13 months and then lagged retail sales data for 12 months. This gave a total of 649 potential variables.
Each decision tree consisted of a random subset of those 649 variables. The capability of this technique to deal with this large number of explanatory variables is precisely the advantage it has over regular linear regression methods, which are more limited in the number of variables which can be considered for a given sample size.
To estimate and evaluate the model, ANZ Research used data from 2010 to May 2018. Of that data, 75 per cent was used to estimate the model while the remainder was used to evaluate it. This means the model has not already ‘seen’ the data upon which it is evaluated.
The mean absolute error for the model’s forecast of monthly retail sales growth was 0.31 percentage points – slightly less than the Bloomberg consensus of 0.36 percentage points.
As noted, random forests also show which variables are the most important in predicting retail sales. ANZ Research found two of the most-important variables were employment-related - the change in total employment four months ago and the change in the NAB business indicator for employment nine months ago.
Jack Chambers is an Economist and David Plank Head of Australian Economics at ANZ