Lookups with really big reference tables

Recent customer issue:

We are using the Lookup Transform to do joins in our Data Flows. Most Data Flows have more than one Lookup. The process worked fine in our development environment, but fails when we run in production because the source of the lookups have between 4-6 billion records each. How can we resolve this?

You have a number of design choices you can make when you’re doing lookups against a really big reference table.

First, review the different Lookup Cache Modes to get a better understanding of how they affect the Lookup, and the way you’d design your Data Flows. Note that the default mode (Full cache) means your lookup is going to pull all records from your data source into memory. This makes the lookup very fast at runtime, but it means that your data flow can’t start processing anything until the entire lookup table has been retrieved. The Lookup won’t spool the cache to disk, either – if the process runs out of memory, it will fail with an error – which is by design; If you can’t fit the cache in memory, you should be using an alternate design.

So what are your choices?

Switch to the Partial cache mode
If you have multiple Lookups going against the same reference data set, consider using the Cache Connection Manager
If certain values are more common than others, use a Cascading Lookup Pattern with the high frequency values in a Full cache, and use a Partial cache for the rest
Load the data into a staging table, and use the SQL engine to perform the join (ELT rather than ETL)
Use a Merge Join transform instead of a Lookup (more below)

Each of these solutions can be a viable alternative, depending on the amount of data in your reference table, and the amount of incoming data you have. You don’t necessarily need to choose just one of them, either – you might find a combination of approaches works best.

Using Merge Join instead of Lookup

If you need to do a one-time join in your data flow (as opposed to multiple lookups), consider using a Merge Join transform instead of a Lookup transform. Jamie Thomson has a great post which compares the two approaches, and demonstrates that using Merge Join can be a lot more efficient than using a Lookup. The main reason for this is that Merge Join takes a streaming approach, rather than taking time to pre-cache its values. The streaming logic was further improved in SSIS 2012 as well – the Merge Join now prevents one input from getting too many buffers when one source is a lot faster than the other.

Keep the following things in mind when considering this approach:

Both inputs must be sorted. Ideally, this sort can be pushed into the source query. If the data isn’t already sorted (i.e. no indexes), the cost of the sort might outweigh the benefits of this approach.
A source component doesn’t end until it has read all of its data, so if your incoming data has a small number of rows, and you’re joining with a much larger data set (which is the case in this particular customer’s scenario), the Merge Join approach isn’t going to be ideal. A Partial Cache lookup tends to work best in these types of scenarios.

Do you have any other options for dealing with this scenario?

Similar design patterns can be found in the SQL Server 2012 Integration Services Design Patterns book available from Apress.

Lookups with really big reference tables

Using Merge Join instead of Lookup

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112