The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. Add Health is re-interviewing cohort members in a Wave V follow-up from 2016-2018 to collect social, environmental, behavioral, and biological data with which to track the emergence of chronic disease as the cohort moves through their fourth decade of life.
More info at: http://www.cpc.unc.edu/projects/addhealth
This is a CSV file, so we could read it in using read.csv
but I prefer the functionality of read_csv
found in the readr
package.
library(readr)
rawdata <- read_csv(file="AddHealth_Wave_IV.csv")
Goof ups are bound to happen. Let’s rename our rawdata
as mydata
, and do all our data cleaning on mydata
. That way if/when we goof up, we just need to run the code chunk below and it will reset mydata
back to pre-recodes (and not have to read the entire CSV data set from the hard drive all over again.)
mydata <- rawdata
We could use dplyr to %>% select()
only variables we want to keep on the data set in this above code chunk, but we can also “clean up” the data set at the end before saving the clean file to disk.
BIO_SEX is coded as 1=male, 2=female, 6=missing. I want to recode this into an indicator of being female.
mydata$female <- mydata$BIO_SEX-1
mydata$female[mydata$BIO_SEX==6] <- NA
table(mydata$BIO_SEX, mydata$female ,useNA="always")
##
## 0 1 <NA>
## 1 3147 0 0
## 2 0 3356 0
## 6 0 0 1
## <NA> 0 0 0
For plotting purposes, I would also like to have a categorical version of gender.
mydata$female_c <- factor(mydata$female, labels=c("Male", "Female"))
table(mydata$female, mydata$female_c ,useNA="always")
##
## Male Female <NA>
## 0 3147 0 0
## 1 0 3356 0
## <NA> 0 0 1
table(mydata$H4GH1)
##
## 1 2 3 4 5
## 979 1963 1683 434 55
No missing variables, but I want to apply labels and convert to a factor variable.
mydata$genhealth <- factor(mydata$H4GH1,
labels = c("Excellent", "Very good", "Good", "Fair", "Poor"))
With how many people have you had a romantic or sexual relationship that lasted less than 6 months since 2001?
table(mydata$H4TR6)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 3049 515 424 261 176 137 74 41 46 18 95 6 23 9 6
## 15 16 17 18 20 21 22 23 24 25 26 28 30 37 38
## 28 1 2 3 25 2 1 1 1 11 3 4 10 1 1
## 40 45 50 60 65 75 90 95 996 998
## 3 1 8 4 1 1 1 7 40 74
mydata$H4TR6[mydata$H4TR6 >990] <- NA
boxplot(mydata$H4TR6)