August 2007 – The Coding Pad

Sometimes when working with large excel spreadsheets, you run into the problem of duplicates, and it helps to be able to find them and delete them without having to manually go through the spreadsheet. If you have two or more columns of data, and you need to find and delete the duplicates, this is one way to do it. Say you have all your data in columns A and B:

Create a column that concatenates the information in A and B. You can either use the “CONCATENATE” function or use the ampersand (&) by entering the formula A1&B1 to cell C1. Remember, as with all formulae, you have to use the = sign before the formula, thus =A1&B1.
Copy and paste the formula down the column to the end of your data. For example, if your data runs from rows 1 to 120, copy and paste your data down to cell C120. An easy way to do this is, in Windows, select from the cell where your formula is to the last cell, and press Ctrl+D, or Edit -> Fill Down. For a mac, I’m not sure there’s a short-cut, except using Edit -> Fill Down.
The next step is to find and mark the duplicate entries. Go to column D, and in the first cell enter the formula enter the formula =IF(COUNTIF($C$1:C1,C1)>1,”Duplicate”,”Unique”
Copy and paste this formula down the column to the end of your data using the same process as in step 2.
This column now shows you what rows have duplicates and which have uniques. You now want to delete, or do something else to just the duplicates, so you want to sort your spreadsheet by duplicates and uniques. You can’t do this with your spreadsheet as it is now because of the formula, so copy all the data to a new worksheet, but you only want to paste values, not the formulas, so that you can sort them by column C. Use Edit -> Paste Special -> Values only.
Sort your spreadsheet using Column C, so that you all the duplicates are at the end of your spreadsheet, and once you delete them, or do whatever you want to with them, you can then delete column C.

This leaves you with only unique entries in your worksheet. You can extend this method to multiple columns for more complex spreadsheets and it should work fine.

Happy Coding!

In my last few posts I explained in a four-part tutorial how to convert a csv file to a sql file. I am going to summarize the steps in this post so that you have them all in one reference post. The tools that I prefer for this process are PHPMyAdmin and HeidiSQL, and of course your csv file. The assumption is that you have Apache, PHP, and MySQL all running on your machine, and you can log in as root admin.

Create a mySQL database and then create a table that corresponds to your csv file. Make sure that the fields correspond, and I recommend that you create an autoincrement unique ID field as your primary field, even if it doesn’t exist in your csv file. Read more on Creating the Database.
Using HeidiSQL, import your csv file into the database table that you have created, making sure to deselect the ID field if it doesn’t exist in your csv file. Read more on Importing the csv file.
Look through your newly populated database table to ensure that the data has all been imported correctly and that you have the correct number of entries. The best tool for this is PHPMyAdmin.
Finally, create the sql dump file using the export to sql command in either PHPMyAdmin or HeidiSQL. Make sure to select the correct exporting options depending on whether your sql dump file will create the database and populate it or will be updating an already existing database. Read more on Creating the sql dump file.

I hope that you have found the series of tutorials helpful, and that you are more comfortable working with all the database tools that I have been talking about.

Happy Coding!

Month: August 2007

Finding and Deleting Duplicates in Excel

From csv to sql – in brief