Read CSV
Read Excel files
readxl_example()
read_excel()
excel_sheets()
Import data from other Statistical software
Read sas
Read STATA
Read SPSS
Best practices for Data Import
Read CSV
One of the most widely data store is the .csv (comma-separated values) file formats. R loads an array of libraries during the start-up, including the utils package. This package is convenient to open csv files combined with the reading.csv() function. Here is the syntax for read.csv
read.csv(file, header = TRUE, sep = “,”)
Argument:
file: PATH where the file is stored
header: confirm if the file has a header or not, by default, the header is set to TRUE
sep: the symbol used to split the variable. By default, ,
.
We will read the data file name mtcats. The csv file is stored online. If your .csv file is stored locally, you can replace the PATH inside the code snippet. Don’t forget to wrap it inside ‘ ‘. The PATH needs to be a string value. For mac user, the path for the download folder is:
“/Users/USERNAME/Downloads/FILENAME.csv”
For windows user:
“C:\Users\USERNAME\Downloads\FILENAME.csv”
Note that, we should always specify the extension of the file name.
.csv .xlsx .txt …
PATH <- ‘https://raw.githubusercontent.com/guru99-edu/R-Programming/master/mtcars.csv'
df <- read.csv(PATH, header = TRUE, sep = ‘,’)
length(df)
Output:
[1] 12
class(df$X)
Output:
[1] “factor”
R, by default, returns character values as Factor. We can turn off this setting by adding stringsAsFactors = FALSE.
PATH <- ‘https://raw.githubusercontent.com/guru99-edu/R-Programming/master/mtcars.csv' df <-read.csv(PATH, header =TRUE, sep = ‘,’, stringsAsFactors =FALSE) class(df$X)
Output:
[1] “character”
The class for the variable X is now a character.
Read Excel files
Excel files are very popular among data analysts. Spreadsheets are easy to work with and flexible. R is equipped with a library readxl to import Excel spreadsheet. Use this code
require(readxl)
to check if readxl is installed in your machine. If you install r with r-conda-essential, the library is already installed. You should see in the command window:
Output:
Loading required package: readxl.
If the package does not exit, you can install it with the conda library or in the terminal, use conda install -c mittner r-readxl. Use the following command to load the library to import excel files.
library(readxl)
readxl_example()
We use the examples included in the package readxl during this tutorial. Use code
readxl_example()
to see all the available spreadsheets in the library.
To check the location of the spreadsheet named clippy.xls, simple use
readxl_example(“geometry.xls”)
If you install R with conda, the spreadsheets are located in Anaconda3/lib/R/library/readxl/extdata/filename.xls
read_excel()
The function read_excel() is of great use when it comes to opening xls and xlsx extention. The syntax is:
read_excel(PATH, sheet = NULL, range= NULL, col_names = TRUE) arguments: -PATH: Path where the excel is located -sheet: Select the sheet to import. By default, all -range: Select the range to import. By default, all non-null cells -col_names: Select the columns to import. By default, all non-null columns
We can import the spreadsheets from the readxl library and count the number of columns in the first sheet.
Store the path of datasets.xlsx
example <- readxl_example(“datasets.xlsx”)
Import the spreadsheet
df <- read_excel(example)
Count the number of columns
length(df)
Output:
[1] 5
excel_sheets()
The file datasets.xlsx is composed of 4 sheets. We can find out which sheets are available in the workbook by using excel_sheets() function
example <- readxl_example(“datasets.xlsx”)
excel_sheets(example)
Output:
[1] “iris” “mtcars” “chickwts” “quakes”
If a worksheet includes many sheets, it is easy to select a particular sheet by using the sheet arguments. We can specify the name of the sheet or the sheet index. We can verify if both function returns the same output with identical().
example <- readxl_example(“datasets.xlsx”) quake <- read_excel(example, sheet = “quakes”) quake_1 <-read_excel(example, sheet = 4) identical(quake, quake_1)
Output:
[1] TRUE
We can control what cells to read in 2 ways
Use n_max argument to return n rows Use range argument combined with cell_rows or cell_cols
For example, we set n_max equals to 5 to import the first five rows.
Read the first five row: with header
iris <-read_excel(example, n_max =5, col_names =TRUE)
If we change col_names to FALSE, R creates the headers automatically.
Read the first five row: without header
iris_no_header <-read_excel(example, n_max =5, col_names =FALSE)
iris_no_header In the data frame iris_no_header, R created five new variables named X__1, X__2, X__3, X__4 and X__5
We can also use the argument range to select rows and columns in the spreadsheet. In the code below, we use the excel style to select the range A1 to B5.
Read rows A1 to B5
example_1 <-read_excel(example, range = “A1:B5”, col_names =TRUE) dim(example_1)
Output:
[1] 4 2
We can see that the example_1 returns 4 rows with 2 columns. The dataset has header, that the reason the dimension is 4×2.
In the second example, we use the function cell_rows() which controls the range of rows to return. If we want to import the rows 1 to 5, we can set cell_rows(1:5). Note that, cell_rows(1:5) returns the same output as cell_rows(5:1).
Read rows 1 to 5
example_2 <-read_excel(example, range =cell_rows(1:5),col_names =TRUE) dim(example_2)
Output:
[1] 4 5
The example_2 however is a 4×5 matrix. The iris dataset has 5 columns with header. We return the first four rows with header of all columns
In case we want to import rows which do not begin at the first row, we have to include col_names = FALSE. If we use range = cell_rows(2:5), it becomes obvious our data frame does not have header anymore.
iris_row_with_header <-read_excel(example, range =cell_rows(2:3), col_names=TRUE) iris_row_no_header <-read_excel(example, range =cell_rows(2:3),col_names =FALSE)
We can select the columns with the letter, like in Excel.
Select columns A and B
col <-read_excel(example, range =cell_cols(“A:B”)) dim(col)
Output:
[1] 150 2
Note : range = cell_cols(“A:B”), returns output all cells with non-null value. The dataset contains 150 rows, therefore, read_excel() returns rows up to 150. This is verified with the dim() function. read_excel() returns NA when a symbol without numerical value appears in the cell. We can count the number of missing values with the combination of two functions
sum is.na
Here is the code
iris_na <-read_excel(example, na =“setosa”) sum(is.na(iris_na))
Output:
[1] 50
We have 50 values missing, which are the rows belonging to the setosa species.
Import data from other Statistical software
We will import different files format with the heaven package. This package support SAS, STATA and SPSS softwares. We can use the following function to open different types of dataset, according to the extension of the file:
SAS: read_sas() STATA: read_dta() (or read_stata(), which are identical) SPSS: read_sav() or read_por(). We need to check the extension
Only one argument is required within these function. We need to know the PATH where the file is stored. That’s it, we are ready to open all the files from SAS, STATA and SPSS. These three function accepts an URL as well.
library(haven)
haven comes with conda r-essential otherwise go to the link or in the terminal conda install -c conda-forge r-haven
Read sas
For our example, we are going to use the admission dataset from IDRE.
PATH_sas <- ‘https://github.com/guru99-edu/R-Programming/blob/master/binary.sas7bdat?raw=true' df <- read_sas(PATH_sas) head(df)
Output:
# A tibble: 6 x 4
ADMIT GRE GPA RANK
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
Read STATA
For STATA data files you can use read_dta(). We use exactly the same dataset but store in .dta file.
PATH_stata <- ‘https://github.com/guru99-edu/R-Programming/blob/master/binary.dta?raw=true' df <- read_dta(PATH_stata) head(df)
Output:
# A tibble: 6 x 4
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
Read SPSS
We use the read_sav()function to open a SPSS file. The file extension “.sav”
PATH_spss <- ‘https://github.com/guru99-edu/R-Programming/blob/master/binary.sav?raw=true' df <- read_sav(PATH_spss) head(df)
Output:
# A tibble: 6 x 4
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
Best practices for Data Import
When we want to import data into R, it is useful to implement following checklist. It will make it easy to import data correctly into R:
The typical format for a spreadsheet is to use the first rows as the header (usually variables name). Avoid to name a dataset with blank spaces; it can lead to interpreting as a separate variable. Alternatively, prefer to use ‘’ or ‘-.’ Short names are preferred Do not include symbol in the name: i.e: exchange_rate$_€ is not correct. Prefer to name it: exchange_rate_dollar_euro Use NA for missing values otherwise; we need to clean the format later.
Summary
Following table summarizes the function to use in order to import different types of file in R. The column one states the library related to the function. The last column refers to the default argument. Following table shows the different ways to import a selection with read_excel() function.