LASSO: an algorithm for automated brand self-evaluation
In order to allow users to decide whether their own brand could be further extended, an algorithm was developed to determine a brand’s extension based on the LASSO scoring framework. This algorithm allows users to self-evaluate their brand according to the guidelines provided in this book, and receive recommendations as to how optimally the brand is being extended. Using state-of-the-art statistical techniques, the algorithm aims to simulate an expert assessment of brand extension in automated manner, allowing both consistent and objective brand evaluation.
To develop an algorithm that accurately characterizes complex phenomena, the most effective current methods rely on fitting a statistical model to a verified, known set of training examples in a process known as ‘supervised learning’. For the purpose of developing such an algorithm to characterize brand extension, a “gold-standard” dataset of brand evaluations was generated by an expert panel of three brand specialists, and this was used to optimize, train, and evaluate the model. The resulting algorithm produced by this analysis performs both accurately and consistently, providing a robust solution with which users may evaluate their own brands.
The dataset generated by the expert panel consists of both the LASSO scores and a corresponding determination of brand extension for 56 brands, including brands as famous and large as Coca-Cola, Mickey Mouse, and the NFL and as different as Chupa Chupps, FIFA World Cup, Nerf, and World of Warriors. The brands that were evaluated and each expert’s determination of brand extension (as either under-extended or optimally/over-extended) are listed in Table 1.
Roughly half of the brands characterized in this dataset were under-extended and the other half were either optimally-extended or over-extended. Note that this group of brands was selected by the panel to include companies across a diverse range of industries. By including brands of companies both large and small across many industries in this training dataset, the algorithm is able to generalize effectively to characterize a wide spectrum of brands. Indeed, the inclusivity of this training dataset should enable this algorithm to accurately classify brand extension.
To further improve the accuracy and real-world relevance of the algorithm, a subset of 25 of the brands was independently rated by each of the three experts. This overlap allows the model to capture the intrinsic, yet entirely valid, variation in these metrics. In addition, the overlapping set of examples allows a direct comparison of the agreement between predictions made by the algorithm and those made by human experts. See below a process flow chart depicting the steps taken from Data Collection through Model Selection and Training.
The task of determining whether a brand is extended to an optimal degree or not, is best suited to the group of statistical models that aim to classify examples into one category or another, a process known as ‘binary classification’. Many binary classification models exist, each with different strengths and weaknesses for various types of datasets and variables. To identify the best model for the problem of classifying brand extension, several of the most powerful model families from conventional statistics and modern machine learning were evaluated and compared.
According to best practices in statistics, a model cannot be evaluated against the data that it was trained on; an example used to “fit” the model cannot be used to judge the model’s performance, or serious biases will invalidate the results. There are many ways to avoid this bias, and all generally involve splitting the entire dataset into both “training” and “testing” subsets.
Here, the model being evaluated is fit on the “training” data, and then predictions are made on the “testing” dataset. The accuracy of these predictions is then used to measure the performance of the model. One of the most effective methods for generating these training and testing datasets is a technique known as ‘cross validation’.
The main benefit of this technique, over other techniques for validating a model, lies in the fact that it evaluates the model on every single example in the original dataset. Because of this, a model that classifies some types of brands much better than others will always be penalized to the same extent, while other methods of model evaluation may rate this model higher or lower in a fairly random manner.
Using this cross validation framework, five models were chosen that showed promise in predicting the expert classification of a particular brand’s extension. However, to further enhance the accuracy and reliability of predictions made by this algorithm, one additional step was added. Rather than choose the single best of the five top-performing models to generate the final predictions, the predictions of all five were combined with a machine learning technique known as ‘ensembling’.
Essentially, this technique generates individual predictions for each model, and each model then casts a “vote” for its prediction; these votes are then tallied and the prediction given by a majority of the models is used as the final prediction. For example, if three models predict that a brand is under-extended while the other two predict it is optimally-extended, the final “ensembled” model will predict that the brand is under-extended.
The power in this technique arises from the fact that the individual models, although performing fairly similarly to each other, make mistakes that are not identical. Because the models do not make exactly the same predictions, the majority consensus will more often be right than any individual model. After applying ensembling to the five best-performing models identified with cross validation, the algorithm’s performance increased significantly to nearly the same level as human experts, as detailed below.
Agreement Between Experts
The 25 common brands, for which the brand specialists independently scored on the LASSO rubric and assessed brand-extension, provide an observation of the true, inherent variability in brand assessment. While the LASSO framework provides a powerful, quantitative approach to brand assessment, variability is present in all real-world datasets and this must be considered both when generating a model and when evaluating it.
While training a model, including this inherent variability actually improves the performance of the resulting model. And, when evaluating the model, the agreement between human experts sets an upper limit on the predictive capabilities that can be expected of such an algorithm. To quantify this variability, the standard deviation between the expert scores for each of the LASSO metrics was determined for each brand in this set (Figure 1).
These were then averaged for each brand, providing a look at the inherent ambiguity or complexity in rating each brand (grey bars in Figure 1), as well as for each metric, providing a comparison of the variability for each of the LASSO variables (rightmost set of bars in Figure 1). Finally, the agreement between the experts’ classifications of each brand’s extension is documented in the bar coloring of Figure 2.
Overall, the panel of brand experts showed a very high level of agreement in their LASSO scoring metrics. The overwhelming majority of scores deviated by at most 1 point in only one of the three experts’ scores (on a scale of 1 – 5), suggesting that the LASSO rubric, when properly deployed, is capable of precisely and quantitatively characterizing brands.
With regard to the classification of brands as under-extended or not, the expert panel produced a unanimous classification for 19 of the 25 brands, suggesting that it is straightforward to determine brand extensibility in roughly 80% of cases, while one out of every five cases may be more involved and require further consideration. As a matter of reference in Figure 1 below, 0.47 is the standard deviation for a score where one expert disagrees from the other two by exactly 1 (e.g. expert scores of 4, 4, and 5).
Note that as a matter of reference in Figure 1, 0.47 is the standard deviation for a score where one expert disagrees from the other two by exactly 1 (e.g. expert scores of 4, 4, and 5). The standard deviation for MLB was 0.00 as indicated in the figure. Please note that the experts first rated brand extension on a more fine-grained five-point scale that was later down-sampled to a simpler ‘under or not-under extended’ rubric.
This was done due to limitations with the size of the dataset available, and may have resulted in the experts placing brands in the “slightly-under extended” category with differing frequency. Thus, a coarser rubric may have resulted in slightly higher unanimity between expert classifications.
The final algorithm, using an ensemble of five well-performing models, was evaluated using two methods. As both methods involve cross-validation, which is difficult to apply to the scores for more than one expert at a time, only the scores and classifications for one expert (Expert 1) were used to generate the model and predictions for this step.
For both methods of algorithm evaluation, cross-validation was used to first predict the extensibility of each brand in the dataset. The first method aims to assess the absolute capabilities of the algorithm to model this dataset, while the second compares the model’s performance with that of the human expert brand specialists.
For the first evaluation, these predictions were compared to the “true” classifications chosen by Expert 1. In this test, the algorithm correctly predicted the brand extensibility for 39 of the 49 total brands (79.6%) assessed by Expert 1. Notably, three of the five best performing individual models, all of which were used in the final ensemble, correctly predicted the classification of 36 of the 49 (73.6%) brands when evaluated by individually.
Although this does not seem to be drastically different from the success rate for the complete ensembled algorithm, the fact that all three top-performing models are able to predict exactly the same number of brand assessments correctly suggests that this may be an upper limit to accuracy of individual models with this data, and ensembling or other such techniques may be required.
The second method of evaluation used the set of brands scored by all three experts to determine how the algorithm’s predictions compare to a human’s predictions. By comparing the number of times that all three experts agreed on a classification of a brand’s extension to the number of times the algorithm correctly predicted the classifications of Expert 1, it is possible to characterize how well the algorithm performs with both ambiguous cases and for brands with more well-defined brand extension.
When expert judgment could not consistently determine a brand’s extension, the model performed poorly, correctly predicting only 3 of 6 (50%) of the classifications of Expert 1, a result no better than random.
When all three experts agree on the brand’s extension, however, the algorithm correctly classifies 15 of 19 brands (78.9%), indicating that the algorithm is truly capturing the intricacies of the LASSO scores that impact a brand’s extensibility.
Lastly, with this evaluation we are able to directly compare how well Experts 2 and 3 agreed with, or “predicted”, Expert 1’s classification of these brands with how well the algorithm predicted these classifications. In total, the two other experts predict Expert 1’s classification for 19 of these 25 example brands (76%) while the model predicts Expert 1’s choices on 18 of 25 examples (72%).
Again, with such a relatively small set of data it is difficult to make detailed inferences from these results, but the results do suggest that the algorithm performs respectably, even when compared to expert brand specialists. This is expected, as the algorithm is trained on data generated by these very experts. Also, since we believe that it is capturing the information relating to brand extension contained in the LASSO metrics, it follows that a larger volume of expert training data in the future will allow the model to better represent this information and become increasingly robust.
Additional enhancements such as including industry information in the model and the previously-mentioned fine-grained categories for brand extension are likely to further boost the model’s accuracy and precision. In summary the algorithm, as it currently exists, provides a repeatable, widely-deployable, and inherently objective method for both expert and amateur owners to evaluate their brands.
LASSO Methodology Q&A
Below are some detailed responses put together by our team of Brand Licensing Experts for some frequently asked questions.
- How have you determined what is 'gold-standard'? How many inputs were used in your dataset? How many companies? How did you qualify those companies and products?
Specifically, each of the experts scored between 28 and 50 brands, totaling 127 brand evaluations of 56 unique brands that served as the “gold-standard” dataset on which the algorithm was trained. These brands belonged to companies in 22 different industries, and included products, services, and media.
- The LASSO Model seems highly based on interpretation. Is that right? If so, how do you maintain consistency across scorers, or across your expert panel? Can the LASSO Model be run with true comparative value with any old person scoring the brand?
It is true that self-scoring based on the LASSO rubric will be subject to personal interpretations of the metric descriptions published in the book and that the scores will be affected by biases common in self-reported surveys. These pitfalls are to some extent unavoidable in this type of self-evaluation, but these drawbacks may be counterbalanced by the ability of the LASSO scoring assessment to reach a much wider audience and user pool by not requiring a user to retain a brand expert in each case. Regardless, there are mechanisms by which the impact of these response biases and individual interpretations has been blunted.
While it is impossible to phrase any survey question or evaluation description in a perfectly objective manner, guidelines that are specifically defined and neutrally worded will minimize these effects by reducing ambiguity and unintended, unconscious bias. The self-reported LASSO scoring model is unique to many surveys and self-evaluations in that very detailed descriptions of not just the scoring methods, but also the actual basis and background behind the metrics, was provided in the chapters of the book.
The users, when evaluating their brand, are provided with much more than a few lines describing the scoring guidelines. Rather, they are given a thorough delineation of the concepts that they are being asked to evaluate their brands on. This minimizes the ambiguity that arises when non-experts are required to perform these evaluations and maximizes the consistency in responses. Further, the questions have been phrased in a way that seeks to minimize the emotion involved in evaluating a user’s own brand, reducing potential unintended biases on the user’s part.
Given that any user conducting self-evaluations, and especially non-experts, will always have both biases and individual interpretations regardless of the question’s formulation, computational approaches must also be applied to reduce the effects of these confounding factors. <Note, this part hasn’t been done yet, as we have no data on non-expert user scores. However, this can be easily implemented once the algorithm has been publicly deployed and has over ~20 users.> By expecting and accounting for these inevitable issues, the LASSO Model is able to predict and model out the effects of these confounding factors to a certain extent.
For example, a user evaluating their own brand on a quantitative metric for which lower responses indicate a deficiency in their brand or product is highly likely to inflate their score. By having our expert panel assess brands that were also evaluated by non-expert brand owners using the LASSO web application, it is possible to compare responses to the rubric between both the non-biased expert panel as well as the heavily invested and non-expert brand owners. With enough of these comparisons, a model is able to incorporate this information to predict this overestimation of user scores for self-reported users and subsequently make its final prediction of brand extension more robust to these effects.
- How did you make the numbers 'relative' across the various sizes of companies and industries they were in? Does this matter here? I'm partly playing devil's advocate, but these kinds of questions immediately came to mind as I was reading. And if your goal is to have readers use your LASSO Model, and come to you for consult on brand extension, no doubt, they will have these questions too.
It is certainly important, when training the model, to have a set of brands that represents the diversity of the companies which the users will be trying to evaluate with this algorithm. For example, if only one industry were surveyed, the model would not learn how to use the LASSO metrics to determine a brand’s optimal extension, but rather it would learn how to predict this extension based on arbitrary features of companies in that industry.
This would lead to the model performing very poorly in other industries, since these industry-specific features would not be present or useful with these new industries. Note that the training data does not need to contain every industry that a user might want to evaluate with the algorithm, but just a diverse enough set so that information from any industry-specific features becomes ‘drowned-out’ relative to the information from the LASSO variables.
Here, brands were selected that the members of the expert panel were familiar with and felt comfortable ranking. The panel members did choose a set of brands that they felt to be representative across companies and industries. Although not perfectly stratified across these domains, over 20 industries were sampled, and no one industry represented more than a half of the surveyed brands.
While there are several industries that were more highly represented than others, most significant being the entertainment industry under which nearly half of the surveyed brands fall, the overall diversity of the training set makes this a reliable dataset to train the algorithm on.
Company size was not specifically controlled for in the training set, and it is a fact that almost all of the brands in this set arise from large companies. This is a consequence of requiring the expert panel members to only consider brands with which they were familiar, in order to ensure that the expert scores were robust and repeatable.
This does lead to a potential for model to perform better larger companies than smaller companies. Still, given the wide and disparate kinds of companies sampled, across very different industries and product types, the algorithm should be using the information from the LASSO variables to generalize well to companies that it has not seen before. While it is currently trained using brands from these large companies, the model’s better predictive ability for larger companies will diminish as the LASSO web application is used more often and the model is able to incorporate information from additional companies across industry, sector, and size.
* Note that brands could be counted twice if they fell within multiple industries
- “…the inclusivity of this training dataset should enable this algorithm to classify accurately brand extension even for industries not present in this dataset.” - this seems like a big claim... almost implausible.
We understand why this seems to be a grandiose or overconfident statement, but it is rooted in a more formal idea of the ability for a robust model to “generalize” to examples that it hasn’t been trained on yet, even if they are unlike the examples it has been trained on. We kind of touched on this above when talking about how a training dataset does not need to include all industries to generalize well to industries that it hasn’t seen. We’ll try to expand on this a bit here to make this clearer.
Having an inclusive training dataset is important for multiple reasons. The most immediately obvious reason to have as inclusive of a training dataset as possible is that if an industry is in your training dataset, the model will be trained on it, and the next time the model sees a company or brand from that industry it may be able to apply specific “knowledge” from having been trained on companies in that industry to improve its prediction. However, a less obvious benefit from having an inclusive and representative dataset is that information in the data which is more generally relevant has more of an impact in the model’s training, and the model captures these more widely-applicable “ideas” better.
If, for example, the model was only trained using brands in the gaming industry, where addictiveness may be overwhelmingly predictive of high brand extensibility regardless of other factors, the model might perform very poorly when faced with brands in the non-profit industry where other factors such being ownable and storied are also important. However, if the model were trained using examples from both industries, its use of all three metrics in informing its prediction would improve its performance on sports brands, where again all three metrics are highly useful.
A more generic illustration of how using a diverse and inclusive training dataset allows a prediction engine to “generalize” better by considering more relevant features comes from how a young child might learn the definition of a pet. A toddler brought up in a household with only dogs and cats may identify pets as being any animal with four legs, fur, and a tail. A child raised with dogs, cats, and fish, however, would not consider the legs, fur, or tail, but more accurately understand pets to be any animal which the family actively tends to and keeps.
Finally, a child who grows up on a farm would correctly learn that pets are animals which the family cares for and takes into their own home, as opposed to animals which are tended to but kept as livestock. In all cases, the “knowledge” learned is not inaccurate, but with more diverse examples the child learns to use features that define the true underlying concept better. When provided with animals that none of the children had seen, the first child may not identify a caged bird as a pet while the second child might incorrectly assume that a goat was a pet. Although not guaranteed, the third child would be most likely to categorize both of these examples correctly, despite not having seen them before, due to their learning with more inclusive and diverse “training data”.
- “To further improve the accuracy and real-world relevance of the algorithm, a subset of 25 of the brands was independently rated by each of the three experts. This overlap allows the model to capture the intrinsic, yet entirely valid, variation in these metrics. In addition, the overlapping set of examples allows a direct comparison of the agreement between predictions made by the algorithm and those made by human experts.” - did a statistician help you create your model?
One of our team members has a significant amount of formal training in statistics at the graduate level, and although he has mostly applied statistics to biological data (he’s a data scientist specializing in genomics), he has a good understanding of the necessary assumptions and best practices behind using these techniques for general data analysis.
Regarding the statement here about incorporating information from overlapping training examples from all the experts, it is important to note that for this dataset, we used techniques more commonly classified as machine learning, as opposed to traditional statistics. Although there is considerable overlap between the two fields and a lot of ambiguity over what constitutes their differences, the general difference between the two lies in who selects the features (variables) that are used in the model. In statistics modeling, the data analyst performs this feature selection and manually sets up the model, which is then automatically fitted (trained) using the data.
In machine learning, however, both the feature selection and the model training is performed automatically by the computer, with minimal input into the feature selection by the analyst. Both methods have benefits. Because statistical models are designed by the analyst, it is possible to interpret the model; statistical modeling lets you explain the relationship between the variables in the model.
However, partly because of their reliance on human curation as well as certain computational limits, they are limited to relatively simple models. Machine learning, on the other hand, strives foremost to predict the dependent variable with the best accuracy possible, and because the feature selection and model choice is performed computationally it is able to generate very complex models that predict complicated phenomena with state-of-the-art results. A consequence of this model complexity, however, is that models generated by advanced machine learning methods usually cannot be interpreted by humans.
We began this analysis trying to use only traditional statistical methods such as logistic regression, because we felt it would be helpful to be able to interpret the effects of the LASSO values on brand extension. However, we quickly found that machine learning methods performed a lot better for generating predictions in this dataset, as they often do with highly intricate, nonlinear relationships between the variables such as exists here.
As an aside, note that we do use our original logistic regression model in the final predictive algorithm, but it is only one “vote” among several other models. We bring all this up because it helps to explain why having this overlap in scores from the expert panel “allows the model to capture the intrinsic, yet entirely valid, variation in these metrics”. If using traditional statistical models, the formal way to add this intrinsic variation in metric scoring between users would be to include an additional random effect feature representing user-judgement in the model. Machine learning, again, does not require or often allow the user to perform this kind of manual feature selection, and simply learns its own features if they improve the final predictions. This is why including these common examples from all three brand experts lets the final model incorporate this additional variation.
- “One out of every five cases may be more involved and require further consideration” - how does the LASSO Model help the layperson distill whether they fall in the 80% camp or the 20% exception camp?
This is a valid concern, and one that is more difficult to address. Given the vast complexity of determining brand extension and the many intangible factors that affect this phenomenon, it is a challenging problem to objectively quantify and predict. At this point, 80% seems to be the best that can be expected from either human or algorithmic predictors. As more validated training data is collected, the power of big data machine learning techniques may make it possible to model this phenomenon better and possibly more objectively than even expert humans can; techniques such as artificial neural networks have shown this kind of revolutionary success when given very large, high quality datasets in many fields, such as business analytics and advertising.
In the short term however, we have a few more techniques to try which may be able to determine if a prediction is correct, even with this fairly small amount of curated data we currently have from this expert panel. Still, there will always be an upper limit to how complex this algorithm can get when trained on small datasets.
- “In all, however, the algorithm as it currently exists provides a repeatable, widely-deployable, and inherently objective method for both expert and amateur owners to evaluate their brand.” - I still question this claim. But I'm not a statistician.
As discussed above, biases and misinterpretations of the scoring rubric are inevitable in this kind of application, but through both education of the user from the book, well-worded and clear guidelines, and algorithmic correction for biases once data begins to be collected, the effects of these challenges can be minimized. At the end of the day, the availability of this algorithm and online self-evaluation tool will allow much wider adoption of the LASSO Model than would be possible solely through expert consultation, and the benefits created by this higher accessibility must be weighed against the inaccuracies that go along with it.
By observing the mistakes that are common and surveying amateur users of the application, over time it will be possible to incrementally improve the phrasing and user understanding of these metrics alongside the improvements to the algorithm.