Part 1: Introduction – Part 2: Timer – Part 3: Volatile formulas – Part 4: Lookup formulas – Part 5: Sorting source data – Part 6: Table or List? – Part 7: Multiple criteria lookups – Part 8: Sum with multiple criteria – Part 9: Other factors – Part 10: Summary
Part 5: Sorting source data
In the previous part, we looked at the speed of 6 different lookup formulas. Whilst OFFSET/MATCH was the fastest formula, we decided that VLOOKUP and INDEX/MATCH were probably better options. In this part, we will be using the same 6 formulas, but comparing how they perform with sorted and unsorted data. I have read a number of times that sorting data will improve calculation speed, but how much of a speed boost will it actually give us?
Setting up the test
For this test, we will be using the same data as the previous part, but I have added an additional worksheet with the same data sorted. As a reminder: I started with 99,999 individual lines of data to represent all the sales of various salesmen. I removed the duplicate records and totaled the sales for each unique record. This left me with 42,941 unique data lines, which had the same total value as the original 99,999 lines.
From the 42,941 I selected 100 lines at random for the test. I carried out 5 timed tests on each formula. Using the MircoTimer I recorded how long the 100 lines took to recalculate.
The formulas being used were:
All of the data and workings from the tests can be downloaded here.
The results – Test #1
The table below shows the results:
Just take a few seconds to let the data in the table soak in. Did you notice that every formula based on the sorted data was slower? Between 30 and 61% slower! This really goes against the conventional wisdom found elsewhere on the web – which states that sorted data is faster.
The results – Test #2
It is possible that the data I selected to test was actually weighted towards the end of the source data, therefore it would take longer to calculate. I decided to carry out a second test – what would happen if all of the records being searched were at the top of the sorted data? Here are the results from that test:
The VLOOKUP, INDEX/MATCH and OFFSET/MATCH formulas based on the sorted data calculated in just 3% of the time! That’s 33 times faster than unsorted list.
VLOOKUP, INDEX/MATCH and OFFSET/MATCH formulas will all stop iterating through the source data once they find a matching record. As a result, if the searched data is weighted towards records earlier in the source data then it could produce much faster calculation times.
The surprise for me was still that SUMPRODUCT, SUMIF and SUM (Array) were 24%-29% slower. These formulas function by working through every line in the source data to calculate a TRUE/FALSE result, multiply that TRUE/FALSE result against the value and then summing all the results. Therefore, Excel is carrying out a large number unnecessary calculations. This is especially true when using a unique data source. The same number of unnecessary internal calculations are occurring on sorted and unsorted data, so I would have expected the times to be much closer.
If you know why a sorted list would take longer for SUMPRODUCT, then please contact me, as I would love to find out.
VLOOKUP, INDEX/MATCH and OFFSET/MATCH are again running at comparable speeds. If the results we are looking up are in the earlier records of the source data, then there can be a significant increase in speed. However, if they are not then we may experience no speed increase, or maybe even slightly slower speeds. Testing the speeds on each model we build will help to identify if it is worth sorting data the data to get a speed boost.
In the next part, we will consider the speed improvements we can achieve from laying out our data in a 2-dimensional table, rather than a single list.