For anyone familiar with programming these words are said often, perhaps accompanied by some more colorful language. Such is the case while I learned to use PROC SQL in SAS. As I started I thought to myself, ‘I’ve written hundreds of SQL queries this shouldn’t be hard.’ I honestly think I’d have been better off without knowing SQL before hand because there some subtle differences between the language of PROC SQL and that in SQL Server Management Studio. Standard Query Language indeed. HA!
Let’s start with the simple act of renaming a variable or column header. Here’s how I’d write this statement in SQL Server Management Studo:
SELECT VAR1 AS ‘My_New_Name’
So I did this in my first PROC SQL snippet, quite a bit. Highlight, Run:
What do you mean expecting a name, SAS? It’s right there! You even underlined it. Maybe there’s a stray comma somewhere (or one missing) that can throw off a SQL query. I checked…no stray commas. Maybe I have the variable names wrong. I checked…nope that’s not it. Oh wait, I see it now! I didn’t put the semicolon at the end of the SQL statement. That must be it. Let me fix that.
Alright let’s run that again.
Off to Google to see what’s going on here. Turns out this is a quirk in Microsoft’s SQL Server Management Studio – which is where I learned SQL and use it most often. Most every other place SQL leaves the variable alias unquoted just like PROC SQL. So subtle difference, lesson learned.
Just as soon as I got that fixed I ran into another problem – the bane of every data analysts existence – inconsistently formatted data. In this particular case a zip code variable with a mixture of 5-digit zips and 9-digit zips+4. This should be easy enough with the SUBSTR and PUT functions. So we use SUBSTR(PUT(zip_code, 9.), 1, 5). This should convert the zip code numbers into a character string and then starting at the first character return that and the next 5. Let’s take a look at that output:
It would appear that it’s treating those zip codes with 5-digits as if they have 4 zeros in the front. So it’s returning 00002. Time to Google…
Aha! We can use the COMPRESS function inside the SUBSTR and make sure those leading zeros don’t come into play. So instead we use SUBSTR(COMPRESS(PUT(zip_code, 9.)),1,5). Let’s run that.
There’s a supposed general 80/20 rule in the field of analytics. That is you spend 80% of your time on data processing, cleaning and manipulation just to get it prepared for the 20% of the time you do actual analysis. In my little time in the field this has certainly been the case, though sometimes it feels more like 99% of the time I’m prepping the data (that could be due to our particular database which is notoriously difficult to get data out of and has a ton of poorly formatted data.
So I’ve learned to pay attention to the small syntax differences in PROC SQL and SQL Server Management Studio AND to be careful with leading characters. Hopefully, this will help me cut down on the time I spend doing data prep and on to the fun stuff. Now on to the analysis!