In any industry, it’s important to keep abreast of the latest of the latest advancements in our field. This is of course also true in the world of location planning. At Geolytix we recently held our Model Innovation Day, an internal event which aims to give our Data Scientists a chance to challenge themselves whilst learning new modelling methodologies. It’s also a platform for us to share these ideas with each other, and a good excuse for ordering in pizza!
This session’s objective was to create a sales forecasting model for a Malaysian retailer, using a mix of demographic, mobile, affluence, competitor, POI and road data. The accuracy of each model was assessed against a hold-out sample of stores, which was revealed at the end of the day. The primary goal though was for each of us to challenge ourselves to try something new, whether it be a new approach, language or software.
By the end of the day we had a real mix of different approaches, including machine learning, scorecards and catchment models. Many of these different techniques worked well, and it was interesting to see how a wide-range of solutions could produce similar results. A mix of platforms and software were used by the team, including Postgres, Python (in IDEs and notebooks), R and Excel. It was great to see so many of the team trying new things and helping each other as the day progressed!
Unsurprisingly there were also challenges and limitations however, in building a model in a day. For example, one topic of debate was regarding one of the most highly correlated variables (a binary store operation factor), which was arguably a little questionable. Variables like these would likely been removed or ‘turned down’ before any of these models were to hit production in a live scenario. Another important consideration would also be how best to describe each model, and outline which variables contribute to store sales, and exactly how they do so. This understanding can be key for building user confidence in a model forecast. For example, my own attempt used an Auto-ML package (a hybrid of many ML models), generated very accurate results but was extremely difficult to understand - probably not a great combination for use in the real world! (Other machine learning approaches on the day had much better explainability).
Our winner for the day was Dan Dungate with his gradient boosted decision tree model. This was the most accurate, and nicely explainable, thanks to clear variable contribution charts he produced alongside the results (congrats Dan!). Most importantly though, the team had a great day and are going to reconvene to share their learnings from the session soon, and to discuss topics for our next innovation day.
Danny Hart, Head of Data Science at GEOLYTIX
Main Image: Photo Credit Lisa Taylor