This page contains links to data sets I commonly use across classes. Most data sets are real world data, and are as such not fully prepared for analysis. Data preparation tasks such as creating new variables, refactoring categorical variable and dealing with missing values may be necessary.
Each entry contains the following pieces.
- Information about the data set and where it comes from. Citation/credits where possible.
- The link to the data set. ❌ Right click to save this file to your class folder. Do not left click to open. a. [optional] A link to the raw data file. There may be times when the raw, or unprocessed, data is needed.
- A codebook where available. Also called a data dictionary this document tells you what data This will most often be the codebook for the raw data (will not include variables created by me in the data management script).
- A data management script where available. Presented as a RMarkdown script, this will show you exactly what data processing steps have been taken already.
- The code to read the data into R and any notes that may go along with it. ⚠️ You will always have to change this path to your specific data location.
- The dimensions (number of rows and columns) of the data set. You should use this information to confirm that the data set you download and import into R matches this information.
The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. Add Health is re-interviewing cohort members in a Wave V follow-up from 2016-2018 to collect social, environmental, behavioral, and biological data with which to track the emergence of chronic disease as the cohort moves through their fourth decade of life. More info at: http://www.cpc.unc.edu/projects/addhealth
- Clean Data, Raw data
- Data Management script
- The cleaned AddHealth data set is provided as an R data file, not single external data set. The code below uses the
load()function to load the data directly into your environment. ⚠️ Note the absence of the assignment arrow
<-. This is intentional and your data will not load correctly if you try to use the arrow.
##  6504 992
- Ames: All residential home sales in Ames, Iowa between 2006 and 2010. The data set contains many explanatory variables on the quality and quantity of physical attributes of residential homes in Iowa sold between 2006 and 2010. Most of the variables describe information a typical home buyer would like to know about a property (square footage, number of bedrooms and bathrooms, size of lot, etc.). A detailed discussion of variables can be found in the original paper: De Cock D. 2011. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education; 19(3).
ames <- read.csv("../../../static/data/ames.csv", header=TRUE) dim(ames)
##  2930 82
- countyComplete: Characteristics of different counties in the United States. Information on this data set can be found in the full [Open Intro Data Codebook](../../../data/Open Intro Data Codebook.pdf). Just search for the data set name.
county <- read.csv("../../../static/data/countyComplete.csv", header=TRUE, stringsAsFactors = FALSE) dim(county)
##  3116 56
- Crime_Data: State and regional level information on crime and murder rates.
crime <- readxl::read_excel("../../../static/data/Crime_Data.xlsx") dim(crime)
##  51 11
- Depress Tab delimited text file. The depression data set is from the first set of interviews of a prospective study of depression in the adult residents of Los Angeles County and includes 294 observations. More details on the origin and study design can be found in Practical Multivariate Analysis, 5th edition by Afifi, May and Clark. The codebook can be downloaded as a text file.
depress <- read.delim("../../../static/data/depress_081217.txt", header=TRUE,sep="\t") dim(depress)
##  294 37
Download Times An experiment run by a student to detect if his internet speed varied across different times of the day. This tab-delimited data set contains two variables:
timeas a categorical time of day (Early, Evening, Late), and the time (in
sec) it took to download a particular file. The file downloaded was the same each time to the same computer.
dt <- read.delim("../../../static/data/DownloadTimes.txt", header=TRUE, sep="\t") dim(dt)
##  48 2
dsmall This is a randomly drawn sample from the
diamondsdata set found in the
ggplot2package. For those not using
RI have provided a tab delimited text file for download. For those using R, use the code below to create the
dsmalldata set. The codebook can be found on the ggplot2 documentation site.
set.seed(1410) # Make the sample reproducible diamonds <- ggplot2::diamonds # load the data without loading the ggplot2 package dsmall <- diamonds[sample(nrow(diamonds), 1000), ] # create the subset dim(dsmall)
##  1000 10
email: Right Click and select
save link asto save this file to your class folder. These data represent incoming emails for the first three months of 2012 for David Diez’s (An Open Intro Statistics Textbook author) Gmail Account, early months of 2012. All personally identifiable information has been removed. Email Codebook
email <- read.delim("../../../static/data/email.txt", header=TRUE, stringsAsFactors = FALSE, sep="\t") dim(email)
##  3921 21
- Full moon on Dementia A study observed 15 nursing home patients with dementia and recorded the number of aggressive incidents each day for 12 weeks. Then they totaled the counts of aggressive incidents per patient on “moon” days (full moon +/-1 day) and “other” days.
moon <- read.delim("../../../static/data/dementia_moon.txt", sep="\t", header=TRUE) dim(moon)
##  15 3
HS and Beyond
- High School and Beyond. The High School and Beyond (HS&B) Longitudinal Study was the second study conducted as part of NCES’ National Longitudinal Studies Program. This program was established to study the educational, vocational, and personal development of young people, beginning with their elementary or high school years and following them over time as they take on adult roles and responsibilities. http://nces.ed.gov/statprog/handbook/pdf/hsb.pdf
hsb2 <- read.delim("../../../static/data/hsb2.txt", sep="\t") dim(hsb2)
##  200 11
- Lung Tab-delimited text file. This data come from a study on chronic respiratory disease and the effects of various types of smog on lung function of children and adults in the Los Angeles area. More details on the origin and study design can be found in Practical Multivariate Analysis, 5th edition by Afifi, May and Clark. The codebook can be downloaded as a text file.
fev <- read.delim("../../../static/data/Lung_081217.txt", header=TRUE,sep="\t") dim(fev)
##  150 32
- NCbirths: Publicly released data on a random sample of births recorded in North Carolina in 2004. Codebook
ncbirths <- read.csv("../../../static/data/NCbirths.csv", header=TRUE, stringsAsFactors = FALSE) dim(ncbirths)
##  1000 13
- ParentalHIV: Data collected as part of a clinical trial to evaluate behavioral interventions for families with a parent with HIV. The data include information on a subset of 252 adolescent children of parents with HIV. The Codebook describes the variables and gives a brief description of their meaning. The data is owned by by Dr. Mary Jane Rotheram-Borus, Professor of Psychology and Behavioral Sciences, Director of the Center for Community Health, Neuropsychiatric Institute, UCLA and used with permission in conjunction with the textbook Practical Multivariate Analysis by Afifi et.al.
parHIV <- read.delim("../../../static/data/PARHIV_081217.txt", header=TRUE, stringsAsFactors = FALSE, sep="\t") dim(parHIV)
##  252 123
Physical Activity & BMI
- Physical Activity and BMI Physical activity measured as the number of steps in thousands.
pabmi <- read.delim("../../../static/data/PABMI.txt", header=TRUE,sep="\t") dim(pabmi)
##  100 3
- Police Shootings: Excel file on police shootings in 2015 as compiled by the Washington Post. Data downloaded from https://github.com/washingtonpost/data-police-shootings on 9/11/16.
washpost <- read_excel("../../../static/data/fatal-police-shootings-data.xlsx") dim(washpost)
This page last updated on2021-01-27 17:48:00