It probably seems like I complain a lot about SAS (I have), so today I’ll write about something that I’ve learned from SAS that is really useful and a huge time saver when creating a generalized linear regression model. Proc GLMSelect makes the process of variable selection and transformation really easy.
Specifically the Effect statement takes care of a number of transformations that I’d normally do. Specifically the polynomial option makes quick work of combining variables, exponents and standardizing.
So as an example using the code:
Proc GLMSelect data = myData;
effect myPoly = polynomial(x1 x2 x3 / degrees=3)
…
Yields the following variables:
x1 x1*x2 x1*x3 x2*x3 x1^2 x2^2 x3^2 x1^3 x2^3 x3^3 x1^2*x2 and in continues on and on.
You can see how this saves A LOT of time when you have a large number of variables to include that you suspect may have some multicollinearity. For instance, in a project attempting to predict housing prices I used the following polynomial:
- house_size = living_area, num_floors, r_total_rms, r_bdrms, r_full_bth, land_sf, R_HALF_BTH, R_FPLACE, living_area_log, land_log, num_floors_log, total_rms_log, bdrms_log, full_bath_log
If you’re not into counting that’s 14 variables. I had a few other polynomials as well. In total there were 67 variables that went into Proc GLMSelect. With the polynomials included in the effect statement SAS considered more than 4,800 total variables, plus all the combinations and permutations. That would have taken quite a while to transform by hand and I’m not sure how long it would have taken Proc Reg to evaluate 4800+ variables.
So there you have it. I’ve written a positive post about SAS. I’m sure the more I continue using SAS the more things that I’ll find that I like. But I will say, the program I wrote for the housing price predictions was nearly 800 lines long. I know the same program in R would have been fewer than 200 lines, so there is that. Dang it. I couldn’t just end it without the R comparison. Sigh.