Recent customer issue:
We are using the Lookup Transform to do joins in our Data Flows. Most Data Flows have more than one Lookup. The process worked fine in our development environment, but fails when we run in production because the source of the lookups have between 4-6 billion records each. How can we resolve this?
You have a number of design choices you can make when you’re doing lookups against a really big reference table.
First, review the different Lookup Cache Modes to get a better understanding of how they affect the Lookup, and the way you’d design your Data Flows. Note that the default mode (Full cache) means your lookup is going to pull all records from your data source into memory. This makes the lookup very fast at runtime, but it means that your data flow can’t start processing anything until the entire lookup table has been retrieved. The Lookup won’t spool the cache to disk, either – if the process runs out of memory, it will fail with an error – which is by design; If you can’t fit the cache in memory, you should be using an alternate design.
So what are your choices?
- Switch to the Partial cache mode
- If you have multiple Lookups going against the same reference data set, consider using the Cache Connection Manager
- If certain values are more common than others, use a Cascading Lookup Pattern with the high frequency values in a Full cache, and use a Partial cache for the rest
- Load the data into a staging table, and use the SQL engine to perform the join (ELT rather than ETL)
- Use a Merge Join transform instead of a Lookup (more below)
Each of these solutions can be a viable alternative, depending on the amount of data in your reference table, and the amount of incoming data you have. You don’t necessarily need to choose just one of them, either – you might find a combination of approaches works best.
Using Merge Join instead of Lookup
If you need to do a one-time join in your data flow (as opposed to multiple lookups), consider using a Merge Join transform instead of a Lookup transform. Jamie Thomson has a great post which compares the two approaches, and demonstrates that using Merge Join can be a lot more efficient than using a Lookup. The main reason for this is that Merge Join takes a streaming approach, rather than taking time to pre-cache its values. The streaming logic was further improved in SSIS 2012 as well – the Merge Join now prevents one input from getting too many buffers when one source is a lot faster than the other.
Keep the following things in mind when considering this approach:
- Both inputs must be sorted. Ideally, this sort can be pushed into the source query. If the data isn’t already sorted (i.e. no indexes), the cost of the sort might outweigh the benefits of this approach.
- A source component doesn’t end until it has read all of its data, so if your incoming data has a small number of rows, and you’re joining with a much larger data set (which is the case in this particular customer’s scenario), the Merge Join approach isn’t going to be ideal. A Partial Cache lookup tends to work best in these types of scenarios.
Do you have any other options for dealing with this scenario?
Similar design patterns can be found in the SQL Server 2012 Integration Services Design Patterns book available from Apress.