Lab 1: Introduction to SAS and R

In this lab, you will be introduced to two data sets that we will use in subsequent labs. The goal will be to successfully read both data sets in using SAS and R. In doing so, you will familiarize yourself with the data sets, several R and SAS help pages, and a few commonly used SAS procedures and R functions.

The Drug Overdose Data Set

The Drug Overdose Data Set contains provisional counts for drug overdose deaths based on a current flow of mortality data in the National Vital Statistics System. The Data Set includes reported and predicted provisional counts of deaths due to drug overdose occurring nationally and in each jurisdiction (50 states and the District of Columbia). Counts represent the numbers of deaths due to drug overdose occurring in the 12-month periods ending in the month indicated. The Drug Overdose Data Set was obtained from the Data.gov which contains hundreds of thousands of freely available data sets, mostly from federal and other governmental organizations. Data.gov also provides some built-in tools for data analysis and visualization. More information about the Drug Overdose Data Set can be found at the following page:

https://catalog.data.gov/dataset/vsrr-provisional-drug-overdose-death-counts-54e35#

Citation: Ahmad FB, Rossen LM, Spencer MR, Warner M, Sutton P. Provisional drug overdose death counts. National Center for Health Statistics. 2018

Variables included in the Drug Overdose Data Set (and values each variable may take) are as follows:

Variable Name Type Length Description
state Text 2 Two letter state/jurisdiction abbreviation
state_name Text 2 State/jurisdiction name or abbreviation
year Text 4 Year (2015, 2016, 2017, or 2018)
month Text 9 Month (September, October, November, etc.)
indicator Text 45 Description of the quantity contained within data value
data value Num 8 Numeric value described by indicator
predicted value Num 8 Predicted value adjusting for reporting delays
percent complete Text 5 Percent completeness of data (100, 99.5+), used as a quality metric
percent pending investigation Num 8 Percent with cause of death pending investigation, used as a quality metric
footnote Text 36 Notes regarding potential underreporting, used as a quality metric

There are 10,108 observations in the Drug Overdose Data Set contained within the VSRR_Provisional_Drug_Overdose_Death_Counts.csv file. The first four lines of VSRR_Provisional_Drug_Overdose_Death_Counts.csv are as follows:

State,State Name,Year,Month,Indicator,Data Value,Predicted Value,Percent Complete,Percent Pending Investigation,Footnote
AK,AK,2015,January,Number of Deaths,4034,,100,0,
AK,AK,2015,February,Number of Deaths,4084,,100,0,
AK,AK,2015,March,Number of Deaths,4101,,100,0,

Reading the Nursery Data Set in Using SAS

Using the information provided above, write a SAS data step to read in the VSRR_Provisional_Drug_Overdose_Death_Counts.csv file. Since the first line of the .csv file contains a list of the variable names, you will need to use the firstobs option on the infile statement to instruct SAS to start reading data in on the second line of the .csv file.

In [ ]:
data overdose;
 infile "/folders/myfolders/Lab 1/VSRR_Provisional_Drug_Overdose_Death_Counts.csv" dsd firstobs=2;
  input state $
        state_name $ 
        year $ 
        month $
        indicator $
        data_value 
        predicted_value 
        percent_complete $
        percent_pending_investigation
        footnote $;
run;

Now that we have successfully read in the Drug Overdose Data Set, let's suppose we want to print only the first 15 observations of the data set. How could this be accomplished?

In [ ]:
proc print data=overdose (obs = 15);
run;

Does it appear as if some of the character variables have been truncated, meaning that only a certain number of characters were read in? By default, SAS reads and stores character input variables using 8 bytes. How can we avoid this truncation?

There are a few different ways to avoid this truncation. Perhaps the simplest is using a length statement before the input statement.

In [ ]:
data overdose;
 length state $ 2
        state_name $ 2
        year $ 4
        month $ 9
        indicator $ 45
        percent_complete $ 5
        footnote $ 36;
 infile "/folders/myfolders/Lab 1/VSRR_Provisional_Drug_Overdose_Death_Counts.csv" dsd firstobs=2;
  input state $
        state_name $ 
        year $ 
        month $
        indicator $
        data_value 
        predicted_value 
        percent_complete $
        percent_pending_investigation
        footnote $;
run;

Now that we have read in the Drug Overdose Data Set without truncating character variable values, let’s print the first 15 observations once again but this time let’s only print the variables year, month, and indicator.

In [ ]:
proc print data=overdose (obs = 15);
 var year month indicator;
run; 

By adding the length statement preceding the infile statement we have fixed our truncation problem.

The University Rankings Data Set

The University Rankings Data Set contains U.S. News & World Report 2017 Best Colleges rankings. These rankings are produced and published annually to help students worldwide compare the academic quality of U.S.-based schools. Princeton University takes the top spot in the Best National Universities category.

Variables included in the University Rankings Data Set are the following:

   Name      
   Location
   Rank      
   Tuition and fees
   In-state      
   Undergrad Enrollment

There are 232 observations in the University Rankings Data Set contained within the US_University_Rankings.csv file. The first four lines of US_University_Rankings.csv are look as follows:

Name,Location,Rank,Tuition and fees,In-state,Undergrad Enrollment
Princeton University,"Princeton, NJ",1,"$45,320 ",,"5,402"
Harvard University,"Cambridge, MA",2,"$47,074 ",,"6,699"
University of Chicago,"Chicago, IL",3,"$52,491 ",,"5,844"

When you open this file you notice that UCLA is ranked 24th. In the recently published 2018 rankings, UCLA has jumped up in the rankings and now occupies the 19th spot. Go Bruins!

Reading the University Rankings Data Set in Using SAS

Using the information provided above, write a SAS data step to read in the US_University_Rankings.csv file

In [ ]:
data rankings;
  infile "/folders/myfolders/Lab 1/US_University_Rankings.csv" dsd;
  input name $ 
        location $ 
        rank 
        tuition_and_fees $
        in_state $
        undergrad_enrollment $;
run;

After reading in the University Rankings Data Set print the first 25 observations using a proc print procedure.

In [ ]:
proc print data=rankings (obs = 25);
run;

Did the data set get read in correctly? Can we use firstobs to tell SAS to ignore the first line of the US_University_Rankings.csv file, corresponding to the variable names, and to read observations starting with the second line? Can we make sure that truncation doesn't occur using a length statement?

To deal with other issues that might arise when reading in data in SAS, it is a good idea to try familiarizing yourself with the following UCLA IDRE SAS Help Page: https://stats.idre.ucla.edu/sas/faq/what-are-some-common-options-for-the-infile-statement-in-sas/

In [ ]:
data rankings;
  length name $ 50
         location $ 50;
  infile "/folders/myfolders/Lab 1/US_University_Rankings.csv" dsd firstobs=2;
  input name $ 
        location $ 
        rank 
        tuition_and_fees $
        in_state $
        undergrad_enrollment $;
run;

After having successfully read in the University Rankings Data Set skipping the first line containing the variable names, use a proc print to verify and a proc contents to examine the attributes associated with this newly created SAS data set.

In [ ]:
proc print data=rankings (obs = 25);
run;

proc contents data=rankings;
run;

If instead, we wanted to print the 5 most expensive universities, we would use the following SAS program:

In [ ]:
proc sort data=rankings;
 by descending tuition_and_fees ;
run;

proc print data=rankings (obs = 5);
run;

Notice USC appears on this short list of expensive universities! In the future, we will use the proc sort procedure frequently in lab and lecture.

Reading the Nursery Data Set in Using R

In RStudio, create a new R Script by going to File -> New File -> R Script. Write and execute your code from within this R Script.

Using the information provided above, write an R command to read in the VSRR_Provisional_Drug_Overdose_Death_Counts.csv file. Note: R does not accept “\” in path names. Instead, you will need to replace “\” with “/” or “\” when using path names in R. As a reminder, R is case sensitive.

Solution:

overdose <- read.table("Z:/myfolders/Lab 1/VSRR_Provisional_Drug_Overdose_Death_Counts.csv", sep=",", header=TRUE)

After reading in the Drug Overdose Data Set print the first 15 observations using the head function. To learn more about the head function, execute the following command in the command line:

help(head)

Now execute the head function:

head(overdose,15)

Notice that R has assigned variable names (or, as R refers to them, column names) to the overdose object using information from the first row in VSRR_Provisional_Drug_Overdose_Death_Counts.csv (which R refers to as the header). Since column names are not allowed to have spaces, R has automatically named columns using '.' to take the place of the space it encountered when reading data in.

For instance, Predicted Value was named Predicted.Value.

Suppose we would like to change the variable (column) names for this R data set to match the variable names we used previously (when reading in the SAS data set). Use the following to explore use of the colnames function to make this change.

help(colnames)

Now execute the colnames function:

colnames(overdose) <- c("state","state_name","year","month",
                        "indicator","data_value","predicted_value",
                        "percent_complete","percent_pending_investigation","footnote")

Make sure the function behaved as intended by executing either of the following commands:

head(overdose)
colnames(overdose)

Reading the University Rankings Data Set in Using R

Using the information provided above, write an R command to read in the US_University_Rankings.csv file. Note: you may want to consider an alternate version of the read.table command called read.csv.

 help(read.csv)
rankings <- read.csv("Z:/myfolders/Lab 1/US_University_Rankings.csv", header=TRUE)

After reading in the University Rankings Data Set, print the last 15 observations using the tail function.

tail(rankings,15)

You should now feel comfortable reading in both the Drug Overdose and University Ranking Data Sets in SAS and R. We will return to these data sets in future labs.