This week in learning SAS we began working with Proc Fastlclus – a procedure to group data together by two or more variables. This is particularly useful for data that has little linear relationship. Once you’ve identified clusters it helps to find similar characteristics among the clusters to aid in predicting outcomes for other observations with similar characteristics.
For example, we we were working with prescription details of opiate medications vs. other types from Medicare providers in NC. Each dot is a provider and the plot indicates the ratio of opioid prescriptions (x-axis) to nonopioid prescriptions (y-axis) by the total days of supply in the prescription. When plotted the data looks like this:
Proc Fastclus takes a number (k) of “centroids” set by the analyst and finds the cluster of points nearest that centroid. Using Proc Fastclus we re-plot the data using the cluster that was assigned. We get the resulting scatter plot:
Now it’s much easier to see some patterns in the plot and describe the data:
- Cluster 1 – Outliers of high ratio of opioids to nonopioids
- Cluster 2 – The larger group that have a high opioid/nonopiod prescription rate
- Cluster 3 – Those that have the higher nonpioid/opioid rate
- Cluster 4 – The outliers in the high nonopioid/opioid rate
- Cluster 5 – The vast majority of providers that have relatively low rates of both opioid and nonopioid prescriptions
Now we could explore each cluster to find commonalities in the providers – Are they of similar specialties? organizations? geographic areas? – that could be used to predict future prescription rates.
This could be very useful to me as I have a very similar looking scatter plot when I look at response vs. return on investment (ROI) for marketing efforts sent from our office:
So if I run proc fastlclus with 10 clusters and replot:
I may need to fine tune a little, but at least now it’s easy to see some of the commonalities. Just at a quick glance, clusters 7 and 8 appear to be the efforts that had a negative ROI. And then it seems that ROI was the most important in determining the clusters. It’s certainly a start. Time to keep working with fastclus and see how I can help improve our direct marketing at work.