So today I noticed for the first time that there is a very annoying subtle, quirky behavior with how SQL Server imports excel spreadsheets. Yesterday I had imported a spreadsheet that my co-worker sent me (I originally provided her with a CSV file, and she made her changes, and re-saved as an excel spreadsheet (XLS)). I also imported a related file, that I was asked to merge with the data from the excel file. Pretty easy, two minute job -- or so I thought.
Well, today I was running a query to create a summary of the data, when I noticed that some of the APNs (Assessor's Parcel Number) were NULL. I was thinking "wtf?" So I double-checked the results, to make sure I was looking at records that were supposed to have the APN populated, and everything checked out. Some stress/panic set in. The next step, then, was to find out at what point the APNs got NULLed. I initially figured that somehow the individual fields got scrambled when my co-worker modified the data. A quick check on the excel spreadsheet would reveal whether this was true. I opened up the file, and all the APNs were there -- although some of them did get mangled by excel's formatting.
More stress. So then I had to backtrack one step further; by checking the results I gave to my co-worker. Maybe, I thought, I somehow fucked up the sql view. But everything was fine there too! Now I was really getting stressed, and so I started scratching my head (my head gets itchy when I'm stressing).
What could it be? I figured, if it's not my co-workers file, if it's not the file I sent her, then it has to be that somehow the data is getting corrupted during the import process. Maybe somehow I missed an error message when I imported the data? So I tried re-importing the data into a brand new table, and the exact same thing happened -- all the records got imported, and the import wizard displayed a 'import successful' message. But clearly, the data was still corrupted.
I tried to determine which fields were affected by comparing her modified data with the original data by using SQL EXCEPT. I executed the query, and compared one of the records against the original record to try to see where the difference lied, and as I found which field had problems, I'd remove it and start over again. I ended up with 7 or 8 fields that were corrupted. If it had just been the APNs that were affected, I could have just replaced the mangled data with the original data, but there were too many fields affected, and there was no unique field to merge the individual record back to the original data. I'd have to find several fields to uniquely identify a record, but that's painful, and there wasn't enough time to do it anyway.
I did not have a lot more time to research the issue, so I ended up re-using my original data, and merging it to the additional data my co-worker sent.
After I got some free time on my hands, I did some googling, and found on Microsoft's Books Online an import/export wizard sub-topic (
Excel Source) on importing excel data. And this is where I found my little nugget of information under the "Usage Considerations" heading:
"
Missing values. The Excel driver reads a certain number of rows (by default, 8 rows) in the specified source to guess at the data type of each column. When a column appears to contain mixed data types, especially numeric data mixed with text data, the driver decides in favor of the majority data type, and returns null values for cells that contain data of the other type. (In a tie, the numeric type wins.) Most cell formatting options in the Excel worksheet do not seem to affect this data type determination. You can modify this behavior of the Excel driver by specifying Import Mode. To specify Import Mode, add IMEX=1 to the value of Extended Properties in the connection string of the Excel connection manager in the Properties window. For more information, see PRB: Excel Values Returned as NULL Using DAO OpenRecordset."The problem is the user is not given, at the very least, a warning that this has happened. Who knows how many excel spreadsheets I've imported, and I've never received any type of error or warning about it. I trusted the import wizard that it copied the data exactly as it is found on the spreadsheet. What shocks me is that I did not notice until today.