This page contains links to data sets I commonly use across classes. Most data sets are real world data, and are as such not fully prepared for analysis. Data preparation tasks such as creating new variables, refactoring categorical variable and dealing with missing values may be necessary.

Each entry contains the following pieces.

  1. Information about the data set and where it comes from. Citation/credits where possible.
  2. The link to the data set. ❌ Right click to save this file to your class folder. Do not left click to open.
    1. [optional] A link to the raw data file. There may be times when the raw, or unprocessed, data is needed.
  3. A codebook where available. Also called a data dictionary this document tells you what data This will most often be the codebook for the raw data (will not include variables created by me in the data management script).
  4. A data management script where available. Presented as a RMarkdown script, this will show you exactly what data processing steps have been taken already.
  5. The code to read the data into R and any notes that may go along with it. ⚠️ You will always have to change this path to your specific data location.
  6. The dimensions (number of rows and columns) of the data set. You should use this information to confirm that the data set you download and import into R matches this information.


The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. Add Health is re-interviewing cohort members in a Wave V follow-up from 2016-2018 to collect social, environmental, behavioral, and biological data with which to track the emergence of chronic disease as the cohort moves through their fourth decade of life. More info at:

  • Clean Data, Raw data
  • Codebook
  • Data Management script
  • The cleaned AddHealth data set is provided as an R data file, not single external data set. The code below uses the load() function to load the data directly into your environment. ⚠️ Note the absence of the assignment arrow <-. This is intentional and your data will not load correctly if you try to use the arrow.
## [1] 6504  992


Ames: All residential home sales in Ames, Iowa between 2006 and 2010. The data set contains many explanatory variables on the quality and quantity of physical attributes of residential homes in Iowa sold between 2006 and 2010. Most of the variables describe information a typical home buyer would like to know about a property (square footage, number of bedrooms and bathrooms, size of lot, etc.). A detailed discussion of variables can be found in the original paper: De Cock D. 2011. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education; 19(3).

ames <- read.csv("ames.csv")
## [1] 2930   82


  • countyComplete: Characteristics of different counties in the United States. Information on this data set can be found in the full Open Intro Data Codebook. Just search for the data set name.
county <- read.csv("countyComplete.csv")
## [1] 3142   15

Crime Data

  • Crime_Data: State and regional level information on crime and murder rates.
crime <- readxl::read_excel("Crime_Data.xlsx")
## [1] 51 11


  • Depress Tab delimited text file. The depression data set is from the first set of interviews of a prospective study of depression in the adult residents of Los Angeles County and includes 294 observations. More details on the origin and study design can be found in Practical Multivariate Analysis Afifi The codebook can be downloaded as a text file.
depress <- read.delim("Depress.txt")
## [1] 294  37

Download Times

  • Download Times An experiment run by a student to detect if his internet speed varied across different times of the day. This tab-delimited data set contains two variables: time as a categorical time of day (Early, Evening, Late), and the time (in sec) it took to download a particular file. The file downloaded was the same each time to the same computer.
dt <- read.delim("DownloadTimes.txt")
## [1] 48  2

Diamonds (small)

  • dsmall This is a randomly drawn sample from the diamonds data set found in the ggplot2 package. For those not using R I have provided a tab delimited text file for download. For those using R, use the code below to create the dsmall data set. The codebook can be found on the ggplot2 documentation site.
set.seed(1410) # Make the sample reproducible
diamonds <- ggplot2::diamonds # load the data without loading the ggplot2 package
dsmall <- diamonds[sample(nrow(diamonds), 1000), ] # create the subset
## [1] 1000   10


  • email: Right Click and select save link as to save this file to your class folder. These data represent incoming emails for the first three months of 2012 for David Diez’s (An Open Intro Statistics Textbook author) Gmail Account, early months of 2012. All personally identifiable information has been removed. Email Codebook
email <- read.delim("email.txt")
## [1] 3921   21


  • Full moon on Dementia A study observed 15 nursing home patients with dementia and recorded the number of aggressive incidents each day for 12 weeks. Then they totaled the counts of aggressive incidents per patient on “moon” days (full moon +/-1 day) and “other” days.
moon <- read.delim("dementia_moon.txt")
## [1] 15  3

HS and Beyond

  • High School and Beyond. The High School and Beyond (HS&B) Longitudinal Study was the second study conducted as part of NCES’ National Longitudinal Studies Program. This program was established to study the educational, vocational, and personal development of young people, beginning with their elementary or high school years and following them over time as they take on adult roles and responsibilities.
hsb2 <- read.delim("hsb2.txt")
## [1] 200  11

Lung Function

  • Lung Tab-delimited text file. This data come from a study on chronic respiratory disease and the effects of various types of smog on lung function of children and adults in the Los Angeles area. More details on the origin and study design can be found in Practical Multivariate Analysis by Afifi et al. The codebook can be downloaded as a text file.
fev <- read.delim("Lung.txt")
## [1] 150  32

NC Births

  • NCbirths: Publicly released data on a random sample of births recorded in North Carolina in 2004.Codebook
ncbirths <- read.csv("NCbirths.csv")
## [1] 1000   13

Parental HIV

  • ParentalHIV: Data collected as part of a clinical trial to evaluate behavioral interventions for families with a parent with HIV. The data include information on a subset of 252 adolescent children of parents with HIV. The Codebook describes the variables and gives a brief description of their meaning. The data is owned by by Dr. Mary Jane Rotheram-Borus, Professor of Psychology and Behavioral Sciences, Director of the Center for Community Health, Neuropsychiatric Institute, UCLA and used with permission in conjunction with the textbook Practical Multivariate Analysis by Afifi
parHIV <- read.delim("Parhiv.txt")
## [1] 251 111

Physical Activity & BMI

pabmi <- read.delim("PABMI.txt")
## [1] 100   3

Police Shootings

washpost <- read_excel("fatal-police-shootings-data.xlsx")

This page last updated on 2024-01-05 15:03:47