Often in business, it’s helpful to understand how two metrics of interest relate to one another. This allows us to determine whether patterns occur between them, which can ultimately be taken advantage of to help make better decisions. A good example of related metrics that commonly have patterns which can be exploited include advertising spend vs. the interactions those ads receive, or investment in potential clients vs. return on investment in those clients.
Typically, when exploring the relationship between two metrics, a business analyst’s first instinct is to utilize a scatterplot or regression model that attempts to quantify the correlation. However, patterns can occur in other ways besides a direct relationship, rendering a typical regression model inadequate to uncover patterns that might occur within the data without first generating subsets or clustering the observations. While there are endless ways to find clusters within a data set, most require more advanced programming than many people wish to use, and lots can be complicated to explain to decision-makers.
Below, we’ll discuss a technique that is widely used in descriptive analytics as a means of understanding the distribution of numerical metrics relative to one another. As an added bonus, it only requires a spreadsheet: the Dual Matrix. This technique is useful in finding clusters within your data and can help identify patterns that can be used to justify allocation of resources towards a business objective. First, we’ll simply focus on the creation of the matrix and get your wheels turning around how it can be applied in your business context. Second, we’ll dive a bit deeper into the statistics and talk about how the granularity of your matrix can impact the analytical results.
To create a dual matrix, simply segment and label each of the two metrics of interest into percentile groups. This can be done in Excel (or your analytics tool of choice) through the following steps:
- Order your observations from low to high for the first metric of interest.
- Create an incremented index column starting at “1” and ending at the number of rows in your data set.
- Decide how many groupings to use. This will be the granularity of your matrix. Using 10 groups as an example, the number of observations in each group would be the number of observations in your entire data set divided by 10.
- This is the most mathematical step. Create another column equal to the index column you created, and divide by the number from Step 3. In Excel, this would look something like: =CEILING(index_column / (number_of_observations / number_of_groups)), 1)
See the image below for reference, where the number of observations equals 1,026, and we’re creating 10 groups.
- Repeat the above process for the second metric of interest, starting with re-sorting the values for Metric 2 from low to high.
- Create a pivot table. The columns created from the metric groupings should serve as the columns and rows, while the count of either one of the metric columns should serve as the values. With 10 groupings for each, this will give you a 10 x 10 grid with the count of observations in each of the cells.
This process produces a quick and intuitive way of visually segmenting your data. Patterns could emerge in the values, further allowing you to group adjacent cells into “regions” of business significance and meaning. The question, then, becomes, “How many percentile groups should we create to receive the greatest insights?” Continue below as we dive into the question around how the granularity of grouping impacts the analytical result.
In an application of these matrices, we encountered a business situation where dollars were being spent towards minimizing “Customer Minutes Interrupted” (also referred to as “CMI”). Figure 1 shows the dual matrices with the axes of CMI vs. Spend. “CMI High, Spend Low” is labeled as “CHSL,” etc. Ideally, the more money spent on the given customer’s infrastructure, the lower we’d like to see CMI.
We can consider the diagonal (top left to bottom right) being the “average” customer. Along the diagonal, the Customer Minutes Interrupted are arguably what you should expect, relative to other customers, given what was spent on these customers. Customers where it could be argued experienced a lower return on investment for the same level of CMI are shown above the diagonal. The entire top right of the matrix contains observations where more money was spent relative to others in the same CMI percentile group. The opposite holds true for the bottom right, and the distance from the diagonal is one measure of the disparity. This phenomenon remains constant regardless of the number of percentile groups used.
The number of percentile groups used to divide the metrics, however, creates different sized matrices – in this case, a 20 x 20 versus a 10 x 10 matrix – and can produce different insights. The smaller matrix aggregates the cells of the bigger matrix at the ratio of 1:4. And because the matrix is constructed with simple percentile groups, we can fairly easily see which four cells in the 20 x 20 matrix get aggregated into which cells of the 10 x 10.
Figure 1: Spend – Effect (CMI) matrix on 20 and 10 percentile groups – count of entities
For example, looking at the top left of the (more granular) 20 x 20 matrix, you can see that there are two cells at coordinates (0, 1) and (1, 0) that are “off-diagonal” but become part of the diagonal at (0, 0) in the smaller (aggregated) matrix. As shown below in Figure 2, there are 36 total entities “off-diagonal” that are included in the overall 62 “on-diagonal” entities in the 10 x 10 matrix. This is an elementary result of simple math, but it is a consequence that will determine how segments of our population are treated as a result of the analysis. Furthermore, as we’ll demonstrate below, the granularity impacts your confidence in whether observations in different boxes should, in fact, be treated differently for your business purposes.
Figure 2: The end cells of 20 x 20 and 10 x 10 matrix diagonals
Diving deeper into the math, the mean of the feature used in building the percentiles is affected by the different composition of the cells, as seen in Figure 3. The four cells reveal more granular values which might provide helpful insights, but on the other hand, they have a smaller sample size which can reduce confidence in the results.
Figure 3: The end cells of 20 x 20 and 10 x 10 matrix diagonals
Figure 4: Estimating the confidence intervals around the cell means
Looking at the confidence interval estimates, you’ll see that the upper left cell from the 20 x 20 matrix has a CI Width of 48,910 and a mean of 149,346. This produces a reliable confidence interval width. However, the lower right cell has a very broad interval (465,527) relative to the mean (294,253), so the result in that cell should be treated with some skepticism.
To assess the statistical confidence for the differences across the remainder of the cells in each matrix, the P-values of the test statistic are obtained for each cell, with the NULL hypothesis having no difference between the cells and their corresponding diagonal values.
A t-test has been conducted, assuming a t-distribution of difference between means of the cells and an unequal variance of the groups. The result is shown in Figure 5.
Figure 5: P-values indicating the confidence entries in each individual cell is significantly different than each respective, corresponding diagonal value
Green cells show certainty that a particular cell is, indeed, different than the cell on the diagonal. In business context, this is the confidence we have that we could be spending too much, or too little, for the return that is being observed (in our case, customer outages compared to other customers experiencing the same level of outages). But what’s most interesting is the trade-off between statistical significance and granularity, and it’s evident in the distribution of green cells relative to the red. The 10 x 10 matrix is “better behaved” in its significant pattern as, in general, red values are along the diagonal with increasing confidence as you move away from the diagonal. On the other hand, while the 20 x 20 matrix gives more granularity, there’s far more uncertainty that the population within each cell should be treated differently than the others.
In summary, the Dual Matrix approach helped provide some quick insights into a decision to be made: With which potential customers should we spend more or less money given the historical return on investment? Identifying the customers from the diagonal, while measuring the averages of the observations in each cell, helped decide where money should be spent and how much. So, next time you have an analytics task where you have two metrics of interest, it might be worth trying a Dual Matrix approach.