Chapter 2 Import

2.1 Introduction

The first step in the typical data science project involves importing data into R. There are numerous packages for different data types all with varying preferences on speed and efficiency. Here are some R packages for importing data into R:

2.2 Tabular Data

Tabular data consists of variables, observations and values to form data frames. This is the most common format of organized data and many packages are developed to work with this type of data.

2.2.1 readr

readr: Read flat/tabular text files from disk (or a connection). readr has some benefits over the base/utils version as smart column type parsing and not automatically converting strings into factors.

2.2.1.1 Examples

Here is an example of credit card applications data set from the UCI Machine Learning Repository using readr:

library(readr)
cc_apps <- read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", col_names = F)
head(cc_apps)

## # A tibble: 6 × 16
##      X1    X2    X3    X4    X5    X6    X7    X8    X9   X10   X11   X12
##   <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1     b 30.83 0.000     u     g     w     v  1.25     t     t    01     f
## 2     a 58.67 4.460     u     g     q     h  3.04     t     t    06     f
## 3     a 24.50 0.500     u     g     q     h  1.50     t     f     0     f
## 4     b 27.83 1.540     u     g     w     v  3.75     t     t    05     t
## 5     b 20.17 5.625     u     g     w     v  1.71     t     f     0     f
## 6     b 32.08 4.000     u     g     m     v  2.50     t     f     0     t
## # ... with 4 more variables: X13 <chr>, X14 <chr>, X15 <int>, X16 <chr>

2.2.2 readxl

readxl: Import excel files into R. Supports ‘.xls’ via the embedded ‘libxls’ C library (http://sourceforge.net/projects/libxls/) and ‘.xlsx’ via the embedded ‘RapidXML’ C++ library (http://rapidxml.sourceforge.net). Works on Windows, Mac and Linux without external dependencies.

2.2.2.1 Examples

Here is an example from the default of credit card clients data set from the UCI Machine Learning Repository using readxl:

# download the excel file first from the link
library(readxl)
default_cc <- read_excel("default of credit card clients.xls")

# alternative reading from a URL
require(RCurl)
require(gdata)
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
default_cc <- read.xls(url)

head(default_cc)

##    X        X1  X2        X3       X4  X5    X6    X7    X8    X9   X10
## 1 ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
## 2  1     20000   2         2        1  24     2     2    -1    -1    -2
## 3  2    120000   2         2        2  26    -1     2     0     0     0
## 4  3     90000   2         2        2  34     0     0     0     0     0
## 5  4     50000   2         2        1  37     0     0     0     0     0
## 6  5     50000   1         2        1  57    -1     0    -1     0     0
##     X11       X12       X13       X14       X15       X16       X17
## 1 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
## 2    -2      3913      3102       689         0         0         0
## 3     2      2682      1725      2682      3272      3455      3261
## 4     0     29239     14027     13559     14331     14948     15549
## 5     0     46990     48233     49291     28314     28959     29547
## 6     0      8617      5670     35835     20940     19146     19131
##        X18      X19      X20      X21      X22      X23
## 1 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 2        0      689        0        0        0        0
## 3        0     1000     1000     1000        0     2000
## 4     1518     1500     1000     1000     1000     5000
## 5     2000     2019     1200     1100     1069     1000
## 6     2000    36681    10000     9000      689      679
##                            Y
## 1 default payment next month
## 2                          1
## 3                          1
## 4                          0
## 5                          0
## 6                          0

2.3 Hierarchical Data

Hierarchical Data is a tree-structure data format such as XML, HTML, JSON. Popular methods for accessing this data are known as web scraping or web data mining when the goal is to parse data on a web page into a analysis-ready format such as a data frame.

2.3.1 jsonlite

jsonlite: A fast JSON parser and generator optimized for statistical data and the web.

2.3.1.1 Examples

TBD

2.3.2 xml2

xml2: Work with XML files using a simple, consistent interface. Built on top of the ‘libxml2’ C library.

2.3.2.1 Examples

TBD

2.3.3 rvest

rvest: Wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, then manipulate, HTML and XML.

2.3.3.1 Examples

TBD

2.4 Relational Data

Relational Data consists of a collection of data items (tables) organized as a set based on the data contents and its relation.

2.4.1 DBI

DBI: A database interface definition for communication between R and relational database management systems. All classes in this package are virtual and need to be extended by the various R/DBMS implementations.

2.4.1.1 Examples

TBD

2.4.2 RMySQL

RMySQL: Implements ‘DBI’ Interface to ‘MySQL’ and ‘MariaDB’ Databases.

2.4.2.1 Examples

TBD

2.4.2.2 RPostgreSQL

RPostgreSQL:Database interface and PostgreSQL driver for R This package provides a Database Interface (DBI) compliant driver for R to access PostgreSQL database systems. In order to build and install this package from source, PostgreSQL itself must be present your system to provide PostgreSQL functionality via its libraries and header files. These files are provided as postgresql-devel package under some Linux distributions. On Microsoft Windows system the attached libpq library source will be used. A wiki and issue tracking system for the package are available at Google Code at https://code.google.com/p/rpostgresql/

2.4.2.3 Examples

TBD

2.5 Distributed Data

Distributed Data consists of non-relational formats with quick access to data over a large number of nodes (data spread over many different computers).

2.5.1 sparklyr

sparklyr: Filter and aggregate Spark datasets then bring them into R for analysis and visualization.

2.5.1.1 Examples

TBD

2.6 Additional Import Methods

Different Data Formats: The R programming language and environment is continuously increasing its capacity with new packages to work with different types of proprietory data formats from statistical software packages that are used on industry teams.

2.6.1 haven

haven: Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files.

2.6.1.1 Examples

Here is an example from Macquarie University data repository for the applied finance and actuarial studies of importing a SAS data set:

library(haven)
claims <- read_sas("http://www.businessandeconomics.mq.edu.au/our_departments/Applied_Finance_and_Actuarial_Studies/acst_docs/glms_for_insurance_data/data/claims_sas_miner.sas7bdat")
head(claims)

## # A tibble: 6 × 33
##          ID KIDSDRIV   PLCYDATE TRAVTIME    CAR_USE POLICYNO BLUEBOOK
##       <chr>    <dbl>     <date>    <dbl>      <chr>    <dbl>    <dbl>
## 1 100058542        0 1996-03-17 17.09181    Private 36292520     9860
## 2 100093408        0 1993-07-26 17.98656    Private 31958061     1500
## 3 100208113        0 1994-06-06 47.00727 Commercial 42433312    30460
## 4 100237269        0 1999-01-19 31.24381    Private 49896544    16580
## 5  10042968        0 1999-05-18 13.96243 Commercial 79298192    23030
## 6 100737644        0 1996-02-28 45.79204    Private 43393435    20730
## # ... with 26 more variables: INITDATE <date>, RETAINED <dbl>,
## #   NPOLICY <dbl>, CAR_TYPE <chr>, RED_CAR <chr>, OLDCLAIM <dbl>,
## #   CLM_FREQ <dbl>, REVOLKED <chr>, MVR_PTS <dbl>, CLM_AMT <dbl>,
## #   CLM_DATE <date>, CLM_FLAG <chr>, BIRTH <date>, AGE <dbl>,
## #   HOMEKIDS <dbl>, YOJ <dbl>, INCOME <dbl>, GENDER <chr>, MARRIED <chr>,
## #   PARENT1 <chr>, JOBCLASS <chr>, MAX_EDUC <chr>, HOME_VAL <dbl>,
## #   SAMEHOME <dbl>, DENSITY <chr>, YEARQTR <chr>

2.6.2 foreign

foreign: Functions for reading and writing data stored by some versions of Epi Info, Minitab, S, SAS, SPSS, Stata, Systat and Weka and for reading and writing some dBase files.

2.6.2.1 Examples

TBD

2.6.3 Zipped Data

Accessing Zipped Data files: Zip archives are actually more a ‘filesystem’ with content, meta data, and/or documentation.

Create a temp file. file name (eg tempfile())
Use download.file() to download the file into the temp object that is being reserved for the file
Use unzip() to extract the target file from temp file by reading the meta data on what specific data set you want which is contained in the zip file
Remove the temp file via unlink()

temp <- tempfile()
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip",temp)
unzip(temp, "bank.csv")
bank_marketing <- read.csv("bank.csv", sep=";") # sometimes its the default
unlink(temp)
head(bank_marketing)

##   age         job marital education default balance housing loan  contact
## 1  30  unemployed married   primary      no    1787      no   no cellular
## 2  33    services married secondary      no    4789     yes  yes cellular
## 3  35  management  single  tertiary      no    1350     yes   no cellular
## 4  30  management married  tertiary      no    1476     yes  yes  unknown
## 5  59 blue-collar married secondary      no       0     yes   no  unknown
## 6  35  management  single  tertiary      no     747      no   no cellular
##   day month duration campaign pdays previous poutcome  y
## 1  19   oct       79        1    -1        0  unknown no
## 2  11   may      220        1   339        4  failure no
## 3  16   apr      185        1   330        1  failure no
## 4   3   jun      199        4    -1        0  unknown no
## 5   5   may      226        1    -1        0  unknown no
## 6  23   feb      141        2   176        3  failure no