This page contains links to data sets I commonly use across classes.
Most data sets are real world data, and are as such not fully prepared
for analysis. Data preparation tasks such as creating new variables,
refactoring categorical variable and dealing with missing values may be
Each entry contains the following pieces.
- Information about the data set and where it comes from.
Citation/credits where possible.
- The link to the data set. ❌ Right click to save
this file to your class folder. Do not left click to open.
- [optional] A link to the raw data file. There may
be times when the raw, or unprocessed, data is needed.
- A codebook where available. Also called a
data dictionary this document tells you what data This
will most often be the codebook for the raw data (will not include
variables created by me in the data management script).
- A data management script where available. Presented
as a RMarkdown script, this will show you exactly what data processing
steps have been taken already.
- The code to read the data into R and any notes that may go along
with it. ⚠️ You will always have to change this path to your specific
- The dimensions (number of rows and columns) of the data set. You
should use this information to confirm that the data set you download
and import into R matches this information.
The National Longitudinal Study of Adolescent to Adult Health (Add
Health) is a longitudinal study of a nationally representative sample of
adolescents in grades 7-12 in the United States during the 1994-95
school year. The Add Health cohort has been followed into young
adulthood with four in-home interviews, the most recent in 2008, when
the sample was aged 24-32. Add Health is re-interviewing cohort members
in a Wave V follow-up from 2016-2018 to collect social, environmental,
behavioral, and biological data with which to track the emergence of
chronic disease as the cohort moves through their fourth decade of life.
More info at: http://www.cpc.unc.edu/projects/addhealth
- Clean Data, Raw data
- Data Management script
- The cleaned AddHealth data set is provided as an R data file, not
single external data set. The code below uses the
function to load the data directly into your environment. ⚠️ Note the
absence of the assignment arrow
<-. This is intentional
and your data will not load correctly if you try to use the arrow.
##  6504 992
- Ames: All residential home sales in Ames,
Iowa between 2006 and 2010. The data set contains many explanatory
variables on the quality and quantity of physical attributes of
residential homes in Iowa sold between 2006 and 2010. Most of the
variables describe information a typical home buyer would like to know
about a property (square footage, number of bedrooms and bathrooms, size
of lot, etc.). A detailed discussion of variables can be found in the
original paper: De Cock D. 2011. Ames, Iowa: Alternative to the
Boston Housing Data as an End of Semester Regression Project. Journal of
Statistics Education; 19(3).
ames <- read.csv("ames.csv", header=TRUE)
##  2930 82
- countyComplete: Characteristics of
different counties in the United States. Information on this data set
can be found in the full Open Intro Data Codebook.
Just search for the data set name.
county <- read.csv("countyComplete.csv", header=TRUE, stringsAsFactors = FALSE)
##  3116 56
- Crime_Data: State and regional level
information on crime and murder rates.
crime <- readxl::read_excel("Crime_Data.xlsx")
##  51 11
- Depress Tab delimited text file.
The depression data set is from the first set of interviews of a
prospective study of depression in the adult residents of Los Angeles
County and includes 294 observations. More details on the origin and
study design can be found in Practical Multivariate Analysis, 5th
edition by Afifi, May and Clark. The codebook can be downloaded as a text
depress <- read.delim("depress_081217.txt", header=TRUE,sep="\t")
##  294 37
- Download Times An experiment run by
a student to detect if his internet speed varied across different times
of the day. This tab-delimited data set contains two variables:
time as a categorical time of day (Early, Evening, Late),
and the time (in
sec) it took to download a particular
file. The file downloaded was the same each time to the same
dt <- read.delim("DownloadTimes.txt", header=TRUE, sep="\t")
##  48 2
- dsmall This is a randomly drawn sample from
diamonds data set found in the
package. For those not using
R I have provided a tab
delimited text file for download. For those using R, use the code below
to create the
dsmall data set. The codebook
can be found on the ggplot2 documentation site.
set.seed(1410) # Make the sample reproducible
diamonds <- ggplot2::diamonds # load the data without loading the ggplot2 package
dsmall <- diamonds[sample(nrow(diamonds), 1000), ] # create the subset
##  1000 10
- email: Right Click and select
save link as to save this file to your class
folder. These data represent incoming emails for the first
three months of 2012 for David Diez’s (An Open Intro Statistics
Textbook author) Gmail Account, early months of 2012. All
personally identifiable information has been removed. Email Codebook
email <- read.delim("email.txt", header=TRUE, stringsAsFactors = FALSE, sep="\t")
##  3921 21
- Full moon on Dementia A study
observed 15 nursing home patients with dementia and recorded the number
of aggressive incidents each day for 12 weeks. Then they totaled the
counts of aggressive incidents per patient on “moon” days (full moon
+/-1 day) and “other” days.
moon <- read.delim("dementia_moon.txt", sep="\t", header=TRUE)
##  15 3
HS and Beyond
- High School and Beyond. The High School and
Beyond (HS&B) Longitudinal Study was the second study conducted as
part of NCES’ National Longitudinal Studies Program. This program was
established to study the educational, vocational, and personal
development of young people, beginning with their elementary or high
school years and following them over time as they take on adult roles
and responsibilities. http://nces.ed.gov/statprog/handbook/pdf/hsb.pdf
hsb2 <- read.delim("hsb2.txt", sep="\t")
##  200 11
- Lung Tab-delimited text file. This
data come from a study on chronic respiratory disease and the effects of
various types of smog on lung function of children and adults in the Los
Angeles area. More details on the origin and study design can be found
in Practical Multivariate Analysis, 5th edition by Afifi, May and Clark.
The codebook can be downloaded as a text
fev <- read.delim("Lung_081217.txt", header=TRUE,sep="\t")
##  150 32
- NCbirths: Publicly released data on a
random sample of births recorded in North Carolina in 2004.Codebook
ncbirths <- read.csv("NCbirths.csv", header=TRUE, stringsAsFactors = FALSE)
##  1000 13
- ParentalHIV: Data collected as part
of a clinical trial to evaluate behavioral interventions for families
with a parent with HIV. The data include information on a subset of 252
adolescent children of parents with HIV. The Codebook describes the variables and gives
a brief description of their meaning. The data is owned by by Dr. Mary
Jane Rotheram-Borus, Professor of Psychology and Behavioral Sciences,
Director of the Center for Community Health, Neuropsychiatric Institute,
UCLA and used with permission in conjunction with the textbook Practical
Multivariate Analysis by Afifi et.al.
parHIV <- read.delim("PARHIV_081217.txt", header=TRUE, stringsAsFactors = FALSE, sep="\t")
##  252 123
Physical Activity & BMI
pabmi <- read.delim("PABMI.txt", header=TRUE,sep="\t")
##  100 3
washpost <- read_excel("fatal-police-shootings-data.xlsx")
This page last updated on 2022-11-26 13:13:39