Using Excel to easily coalesce data

Using Excel to easily coalesce data

February 2023 update:

This is a task that can be accomplished many ways, but I still find that understanding these core concepts is valuable to working with lists of data and many business systems that make use of this kind of data. If the google analytics are to be trusted, a surprising number of people still referenced this post. Even a decade after I originally posted it. Go figure!

So it's with this in mind that I'm re-posting to my consolidated blog, since I recently sunset the original site.

Original post from October 2012:

Let's imagine you're working with a list of Contacts that lists the company that they work at, doing a little clean-up before it gets used.  Only some of the Contacts have an address listed, but they all have a Company address listed.  This needs to be updated so each Contact record has an address.

A list with data gaps to fill in with "something"--I've seen it a hundred times (or in some lists, a hundred thousand times)

In this case, the business requirements are:

  • We need to have one address on file for each contact
  • If we have that specific contact's address...great, use that
  • If we don't have a contact's address, use their Company's address instead

Doing a simple Copy/paste could be your friend if the list is only a handful of Contacts long...but this can be tedious if you're looking at dozens (or more).

I was doing something very similar to this for a client recently.  My go-to tool for this is really makes this problem simple to solve.  Come to think of it I use this same concept in a lot of different places when doing data prep--so I'll run with this example, but recognize that the same method can be used elsewhere in Excel.

OK I'm interested, what do I do?

Fortunately, the requirements above give us exactly the direction we need.  We only need one address for each contact, and each row has as either one or two.  We know that one address wins out (the contact's address) if it's there.  So from a list standpoint, we need to do the following:

  • Devise a systematic way to determine the correct address
  • Choose the correct address for each contact
  • Eliminate the extraneous data

Accomplishing this isn't too much work, but it does involve an Excel formula that's going to look like Greek when you first see it.  We'll get there in a second, but indulge me for a slight diversion.

When working with a list of data, my best practice in Excel is to keep the source data as pristine as possible.  This means I don't like to add/delete columns in the original list.  This ends up being important later on...when somebody comes with an updated list, or a slight variation of what rules to use when identifying the desired data.  If you've made your changes directly in the original workbook it becomes much more likely that you'll have to start from square one, whereas my method allows you to more easily sub in new data or change one of the underlying rules.

To accomplish this, I'm using two Worksheets (tabs) in one Excel Workbook (file).  "Raw List" (the raw list of data) and "Final Product" (where our modified version of the data will go).  When you see formulas such as ='Raw List'!C3 that really means this cell should equal the cell C3 in the "Raw List" worksheet.

Hopefully that little explanation will keep this next screenshot from looking like total hoodoo.

Excel Screenshot
Don't run away yet!

We're going to use this formula for each address field (Street 1, Street 2, City, State, Zip).  We're also going to make the assumption that if the "Contact Street 1" field is blank, that it is not a valid address and the Company address should be used (make sure this is a valid assumption based on your situation :)).  This translates to "If the Contact's Street value is blank, use Company Street, otherwise use Contact Street".

Adding the visual context in Excel:

Excel Formula Diagram

Sweet, now we're cooking.  Next, we want to use this same formula for each of the rest of the Address fields.

It looks so purrrrdy....

Note that when we're checking the subsequent fields like City, State, etc, we're always checking the Contact Street value.  This is important because we don't want to use parts of both addresses...we want the whole contact address or the whole company address.

At this point, we've done the most complicated part.  The other fields (like name, company name, etc) don't need a funky formula, they just need to be identified and added as columns in the data set.  You can take some time to review how this is done by downloading the Excel file I worked on, which has all of the formulas preserved on each of the two worksheets.

Great, so what's next?

From here, there are a few common things that I'll do as next steps:

  • Save results to a separate file, then save it as a new file.  This is a great way to force only the "final" version of the data is available moving forward.  A basic copy/paste has some problems since the formulas reference the original data set (and would start showing errors).  Below are two common approaches.
  1. One option for doing this is to do a copy/paste values into a new workbook.  This is great when you want to ensure there are no references to the original set of data
  2. An alternative is to copy the entire worksheet to a new workbook (by right-clicking on the worksheet name).  This will also update the formula references to look back to the original file
Note that either of these options is acceptable if the new file will be saved as a .csv file, since Excel will remove the incompatible features (formulas) when you click Save.  A .csv file is great if you want to import the data into Dynamics CRM.
  • Clean up the original workbook, making it less potentially-confusing to the next person to touch it.  Below are a couple of clean up ideas:
  1. Hide the "Raw List" worksheet (tab) so it isn't selected by accident
  2. Freeze the column headers in the Final Product worksheet by clicking Freeze Panes->Freeze Top Row on the View tab
  3. Protect the list from being accidentally modified by clicking Protect Sheet on the Review tab

Keep in mind, there are many other ways to accomplish this same task.  If there's another method you'd like to share, feel free to do so in the comments.