How do I read data into R?

There is no such thing as an R system file similar to a Strata .dta or an SPSS .sav file. Instead, R reads data from a variety of formats, including files created in other statistical packages, directly into working memory. R generally lacks intuitive commands for data management, so users typically prefer to clean and prepare data with SAS, Stata, or SPSS. Once the data are ready, several functions are available for getting the data into R.

Reading Data Files in SPSS, Stata, and SAS Formats

The Foreign package can be used to ready data stored as SPSS .sav files, Stata .dta files, or SAS XPORT libraries. If Foreign is not already installed on your local computer, then perform the following:

  1. Select "Packages > Install Package(s)" from the menu bar.
  2. If prompted, select the closest CRAN mirror.
  3. On the "Packages" dialog box, scroll down to select "Foreign" and click "OK."
  4. To use the commands in Foreign, attach the library using the "Library" function. At the prompt, type "> library(foreign)".

For example, if there is an SPSS file called "survey.sav" saved in the C:\my data directory, then the "read.spss" function from the Foreign library will ready the file into R as:

> dataSPSS<-read.spss("C:\mydata/survey.save", to.data.frame=TRUE)

This would create a data object called "dataSPSS" that is ready for analysis. The "to.data.frame" argument, whose default value is FALSE, tells R to treat the object as a data frame.

When specifying the pathname, R understands forward slashes, whereas Windows reads backward slashes. If you must read several data files from the same directory, then the amount of typing can be reduced by first setting the working directory and then using the relative pathname.

For example:

> setwd("C:/mydata") > dataSPSS<-read.spss("survey.sav", to.data.frame=TRUE)

To search for the location of a data file, type:

> dataSPSS<-(file.choose(), to.data.frame=TRUE)

This will open a dialog box that can be used to navigate to the appropriate folder.

R will assume that any value labels recorded in the SPSS file refer to factors (categorical variables) and will store the labels rather than the original numbers. For example, a variable named "gender" may be coded 0=male and 1=female, and the labels are saved in the .sav file. When R reads in the data from SPSS, the values of the variable will be "male" and "female" instead of "0" and "1". This is the default behavior, but it can be changed in the call to the read.spss function as:

> dataSPSS<-read.spss(file.choose(), use.value.labels=FALSE)

Reading Stata files is equally straightforward using the read.dta function. Assuming there is a Stata data file survey.dta in the C:\mydata folder, then the appropriate syntax is:

> dataStata<-read.dta("C:/mydata/survey.dta")

-OR-

> dataStata<-read.dta(file.choose()

The created object is automatically a data frame. The default is to convert value labels into factor levels, but this can be turned off by using the following syntax:

> dataStata<-read.dta(file.choose(), convert.factors=FALSE)

NOTE: Stata sometimes changes how it stores data files from one version to the next, and the Foreign package may lag behind. If the read.dta command returns an error, then save the data in Stata using the .saveold command. This creates a .dta file saved in a previous version of Stata that read.dta may be more likely to recognize.

R can also read SAS XPORT libraries. The function takes only a single argument with the pathname:

> dataXPORT<-read.xport("C:/mydata/survey")

The function returns a data frame if there is a single dataset in the library or a list of data frames if there are multiple datasets.

Reading in ASCII Files

R can also easily read in space-, tab-, and comma-delimited text files. The read.table function handles the first two cases and read.csv handles the other. For example, an ASCII data file, survey.dat, includes white space that separates the values for each variable. The following syntax reads in this data:

> dataTEXT<-read.table("C:/mydata/survey.dat", header=TRUE, sep= " ")

The header argument tells R that the first row includes variable names. Its default is FALSE. The sep argument specifies that values are separated by white space, which is the default.

If the values are separated by tabs, then the value of the sep argument will change to:

> dataTAB<-read.table("C:/mydata/survey.dat", header=TRUE, sep= "\t")

The read.csv command is available for reading data files with comma-separated values, which uses the following syntax:

> dataCOM<-read.csv("C:/mydata/survey.csv", header=TRUE)

The following are also equivalent:

> setwd("C:/mydata") > dataCOM<-read.csv("survey.csv", header=TRUE)

-AND-

> dataCOM<-read.csv(file.choose(), header=TRUE)

It is possible to read fixed-format ASCII files (those with prespecified columns and no delimiters) using the read.fwf function. However, this task is time consuming. We recommend using the available setup files to read fixed-format data into another package and then use the commands in R's Foreign library.

Data in Excel Format

The easiest way to get Excel data into R is to save the spreadsheet as a comma-separated file and use R's read.csv function. The file type can be altered in Excel by changing the "Save As Type" option to "CSV (Comma Delimited)".