In this lab, you will be introduced to two data sets that we will use in subsequent labs. The goal will be to successfully read both data sets in using SAS and R. In doing so, you will familiarize yourself with the data sets, several R and SAS help pages, and a few commonly used SAS procedures and R functions.
The Drug Overdose Data Set contains provisional counts for drug overdose deaths based on a current flow of mortality data in the National Vital Statistics System. The Data Set includes reported and predicted provisional counts of deaths due to drug overdose occurring nationally and in each jurisdiction (50 states and the District of Columbia). Counts represent the numbers of deaths due to drug overdose occurring in the 12-month periods ending in the month indicated. The Drug Overdose Data Set was obtained from the Data.gov which contains hundreds of thousands of freely available data sets, mostly from federal and other governmental organizations. Data.gov also provides some built-in tools for data analysis and visualization. More information about the Drug Overdose Data Set can be found at the following page:
https://catalog.data.gov/dataset/vsrr-provisional-drug-overdose-death-counts-54e35#
Citation: Ahmad FB, Rossen LM, Spencer MR, Warner M, Sutton P. Provisional drug overdose death counts. National Center for Health Statistics. 2018
Variables included in the Drug Overdose Data Set (and values each variable may take) are as follows:
Variable Name | Type | Length | Description |
---|---|---|---|
state | Text | 2 | Two letter state/jurisdiction abbreviation |
state_name | Text | 2 | State/jurisdiction name or abbreviation |
year | Text | 4 | Year (2015, 2016, 2017, or 2018) |
month | Text | 9 | Month (September, October, November, etc.) |
indicator | Text | 45 | Description of the quantity contained within data value |
data value | Num | 8 | Numeric value described by indicator |
predicted value | Num | 8 | Predicted value adjusting for reporting delays |
percent complete | Text | 5 | Percent completeness of data (100, 99.5+), used as a quality metric |
percent pending investigation | Num | 8 | Percent with cause of death pending investigation, used as a quality metric |
footnote | Text | 36 | Notes regarding potential underreporting, used as a quality metric |
There are 10,108 observations in the Drug Overdose Data Set contained within the VSRR_Provisional_Drug_Overdose_Death_Counts.csv file. The first four lines of VSRR_Provisional_Drug_Overdose_Death_Counts.csv are as follows:
State,State Name,Year,Month,Indicator,Data Value,Predicted Value,Percent Complete,Percent Pending Investigation,Footnote
AK,AK,2015,January,Number of Deaths,4034,,100,0,
AK,AK,2015,February,Number of Deaths,4084,,100,0,
AK,AK,2015,March,Number of Deaths,4101,,100,0,
Using the information provided above, write a SAS data
step to read in the VSRR_Provisional_Drug_Overdose_Death_Counts.csv file. Since the first line of the .csv file contains a list of the variable names, you will need to use the firstobs
option on the infile
statement to instruct SAS to start reading data in on the second line of the .csv file.
data overdose;
infile "/folders/myfolders/Lab 1/VSRR_Provisional_Drug_Overdose_Death_Counts.csv" dsd firstobs=2;
input state $
state_name $
year $
month $
indicator $
data_value
predicted_value
percent_complete $
percent_pending_investigation
footnote $;
run;
Now that we have successfully read in the Drug Overdose Data Set, let's suppose we want to print only the first 15 observations of the data set. How could this be accomplished?
proc print data=overdose (obs = 15);
run;
Does it appear as if some of the character variables have been truncated, meaning that only a certain number of characters were read in? By default, SAS reads and stores character input variables using 8 bytes. How can we avoid this truncation?
There are a few different ways to avoid this truncation. Perhaps the simplest is using a length
statement before the input
statement.
data overdose;
length state $ 2
state_name $ 2
year $ 4
month $ 9
indicator $ 45
percent_complete $ 5
footnote $ 36;
infile "/folders/myfolders/Lab 1/VSRR_Provisional_Drug_Overdose_Death_Counts.csv" dsd firstobs=2;
input state $
state_name $
year $
month $
indicator $
data_value
predicted_value
percent_complete $
percent_pending_investigation
footnote $;
run;
Now that we have read in the Drug Overdose Data Set without truncating character variable values, let’s print the first 15 observations once again but this time let’s only print the variables year, month, and indicator.
proc print data=overdose (obs = 15);
var year month indicator;
run;
By adding the length statement preceding the infile
statement we have fixed our truncation problem.
The University Rankings Data Set contains U.S. News & World Report 2017 Best Colleges rankings. These rankings are produced and published annually to help students worldwide compare the academic quality of U.S.-based schools. Princeton University takes the top spot in the Best National Universities category.
Variables included in the University Rankings Data Set are the following:
Name
Location
Rank
Tuition and fees
In-state
Undergrad Enrollment
There are 232 observations in the University Rankings Data Set contained within the US_University_Rankings.csv file. The first four lines of US_University_Rankings.csv are look as follows:
Name,Location,Rank,Tuition and fees,In-state,Undergrad Enrollment
Princeton University,"Princeton, NJ",1,"$45,320 ",,"5,402"
Harvard University,"Cambridge, MA",2,"$47,074 ",,"6,699"
University of Chicago,"Chicago, IL",3,"$52,491 ",,"5,844"
When you open this file you notice that UCLA is ranked 24th. In the recently published 2018 rankings, UCLA has jumped up in the rankings and now occupies the 19th spot. Go Bruins!
Using the information provided above, write a SAS data step to read in the US_University_Rankings.csv file
data rankings;
infile "/folders/myfolders/Lab 1/US_University_Rankings.csv" dsd;
input name $
location $
rank
tuition_and_fees $
in_state $
undergrad_enrollment $;
run;
After reading in the University Rankings Data Set print the first 25 observations using a proc print
procedure.
proc print data=rankings (obs = 25);
run;
Did the data set get read in correctly? Can we use firstobs
to tell SAS to ignore the first line of the US_University_Rankings.csv file, corresponding to the variable names, and to read observations starting with the second line? Can we make sure that truncation doesn't occur using a length
statement?
To deal with other issues that might arise when reading in data in SAS, it is a good idea to try familiarizing yourself with the following UCLA IDRE SAS Help Page: https://stats.idre.ucla.edu/sas/faq/what-are-some-common-options-for-the-infile-statement-in-sas/
data rankings;
length name $ 50
location $ 50;
infile "/folders/myfolders/Lab 1/US_University_Rankings.csv" dsd firstobs=2;
input name $
location $
rank
tuition_and_fees $
in_state $
undergrad_enrollment $;
run;
After having successfully read in the University Rankings Data Set skipping the first line containing the variable names, use a proc print
to verify and a proc contents
to examine the attributes associated with this newly created SAS data set.
proc print data=rankings (obs = 25);
run;
proc contents data=rankings;
run;
If instead, we wanted to print the 5 most expensive universities, we would use the following SAS program:
proc sort data=rankings;
by descending tuition_and_fees ;
run;
proc print data=rankings (obs = 5);
run;
Notice USC appears on this short list of expensive universities! In the future, we will use the proc sort
procedure frequently in lab and lecture.
In RStudio, create a new R Script by going to File -> New File -> R Script
. Write and execute your code from within this R Script.
Using the information provided above, write an R command to read in the VSRR_Provisional_Drug_Overdose_Death_Counts.csv file. Note: R does not accept “\” in path names. Instead, you will need to replace “\” with “/” or “\” when using path names in R. As a reminder, R is case sensitive.
Solution:
overdose <- read.table("Z:/myfolders/Lab 1/VSRR_Provisional_Drug_Overdose_Death_Counts.csv", sep=",", header=TRUE)
After reading in the Drug Overdose Data Set print the first 15 observations using the head
function. To learn more about the head
function, execute the following command in the command line:
help(head)
Now execute the head
function:
head(overdose,15)
Notice that R has assigned variable names (or, as R refers to them, column names) to the overdose
object using information from the first row in VSRR_Provisional_Drug_Overdose_Death_Counts.csv (which R refers to as the header
). Since column names are not allowed to have spaces, R has automatically named columns using '.' to take the place of the space it encountered when reading data in.
For instance, Predicted Value
was named Predicted.Value
.
Suppose we would like to change the variable (column) names for this R data set to match the variable names we used previously (when reading in the SAS data set). Use the following to explore use of the colnames
function to make this change.
help(colnames)
Now execute the colnames
function:
colnames(overdose) <- c("state","state_name","year","month",
"indicator","data_value","predicted_value",
"percent_complete","percent_pending_investigation","footnote")
Make sure the function behaved as intended by executing either of the following commands:
head(overdose)
colnames(overdose)
Using the information provided above, write an R command to read in the US_University_Rankings.csv file. Note: you may want to consider an alternate version of the read.table command called read.csv
.
help(read.csv)
rankings <- read.csv("Z:/myfolders/Lab 1/US_University_Rankings.csv", header=TRUE)
After reading in the University Rankings Data Set, print the last 15 observations using the tail
function.
tail(rankings,15)
You should now feel comfortable reading in both the Drug Overdose and University Ranking Data Sets in SAS and R. We will return to these data sets in future labs.