The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. Add Health is re-interviewing cohort members in a Wave V follow-up from 2016-2018 to collect social, environmental, behavioral, and biological data with which to track the emergence of chronic disease as the cohort moves through their fourth decade of life.
More info at: http://www.cpc.unc.edu/projects/addhealth
This is a CSV file, so we could read it in using read.csv
but I prefer the functionality of read_csv
found in the readr
package.
library(readr)
rawdata <- read_csv(file="AddHealth_Wave_IV.csv")
Goof ups are bound to happen. Let’s rename our rawdata
as mydata
, and do all our data cleaning on mydata
. That way if/when we goof up, we just need to run the code chunk below and it will reset mydata
back to pre-recodes (and not have to read the entire CSV data set from the hard drive all over again.)
mydata <- rawdata
We could use dplyr to %>% select()
only variables we want to keep on the data set in this above code chunk, but we can also “clean up” the data set at the end before saving the clean file to disk.
BIO_SEX is coded as 1=male, 2=female, 6=missing. I want to recode this into an indicator of being female.
mydata$female <- mydata$BIO_SEX-1
mydata$female[mydata$BIO_SEX==6] <- NA
table(mydata$BIO_SEX, mydata$female ,useNA="always")
##
## 0 1 <NA>
## 1 3147 0 0
## 2 0 3356 0
## 6 0 0 1
## <NA> 0 0 0
For plotting purposes, I would also like to have a categorical version of gender.
mydata$female_c <- factor(mydata$female, labels=c("Male", "Female"))
table(mydata$female, mydata$female_c ,useNA="always")
##
## Male Female <NA>
## 0 3147 0 0
## 1 0 3356 0
## <NA> 0 0 1
table(mydata$H4GH1)
##
## 1 2 3 4 5
## 979 1963 1683 434 55
No missing variables, but I want to apply labels and convert to a factor variable.
mydata$genhealth <- factor(mydata$H4GH1,
labels = c("Excellent", "Very good", "Good", "Fair", "Poor"))
With how many people have you had a romantic or sexual relationship that lasted less than 6 months since 2001?
table(mydata$H4TR6)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 3049 515 424 261 176 137 74 41 46 18 95 6 23 9 6
## 15 16 17 18 20 21 22 23 24 25 26 28 30 37 38
## 28 1 2 3 25 2 1 1 1 11 3 4 10 1 1
## 40 45 50 60 65 75 90 95 996 998
## 3 1 8 4 1 1 1 7 40 74
mydata$H4TR6[mydata$H4TR6 >990] <- NA
boxplot(mydata$H4TR6)
Median number of opposite-sex partners in lifetime among sexually experienced men and women aged 25-44 years of age 2002, 2006-2010 and 2011-2015: 6.7 for men, 3.8 for women. Key Statistics from the National Survey of Family Growth - N Listing https://www.cdc.gov/nchs/nsfg/key_statistics/n.htm
Using the above information, I am going to truncate the number of short term partners to be below 10.
mydata$casual_part <- ifelse(mydata$H4TR6 > 10, 10, mydata$H4TR6)
boxplot(mydata$casual_part)
table(mydata$H4LM3)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 54 889 1138 1090 685 481 255 138 75 40 93 8 18 2 14
## 16 18 20 22 23 25 30 50 96 97 98
## 2 2 16 2 1 2 3 3 3 96 4
mydata$H4LM3[mydata$H4LM3 >90] <- NA
boxplot(mydata$H4LM3)
mydata$njobs <-mydata$H4LM3
summary(mydata$H4BMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.40 23.70 27.80 42.49 33.40 999.00 1390
There are out of range variables here. Check codebook and remove using an ifelse
statement that says Make a new variable BMI
. If H4BMI
is less than 100 set BMI
equal to H4BMI
. Otherwise (if H4BMI
is \(geq\) 100) set BMI
to missing.
mydata$BMI <- ifelse(mydata$H4BMI < 100, mydata$H4BMI, NA)
table(mydata$H4BPCLS)
##
## 1 2 3 4 6 7 9
## 1718 2269 791 205 48 4 36
Set missing and change labels
mydata$H4BPCLS[mydata$H4BPCLS %in% c(6, 7, 9)] <- NA
mydata$bp_class <- factor(mydata$H4BPCLS, labels = c('Normal', 'Pre-HTN', 'HTN-I', 'HTN-II'))
table(mydata$H4BPCLS, mydata$bp_class, useNA="always")
##
## Normal Pre-HTN HTN-I HTN-II <NA>
## 1 1718 0 0 0 0
## 2 0 2269 0 0 0
## 3 0 0 791 0 0
## 4 0 0 0 205 0
## <NA> 0 0 0 0 1521
Time they wake up on workdays consists of three variables: hours (H4SP1H
), minutes(H4SP1M
) and if the time listed is in AM or PM (H4SP1T
). I want to convert this to a continuous 24 hour time variable.
mydata$H4SP1T[mydata$H4SP1T %in% c(6, 8)] <- NA
mydata$H4SP1M[mydata$H4SP1M %in% c(96, 98)] <- NA
mydata$H4SP1H[mydata$H4SP1H %in% c(96, 98)] <- NA
Confirm
table(mydata$H4SP1H, useNA="always")
##
## 1 2 3 4 5 6 7 8 9 10 11 12 <NA>
## 40 48 95 309 1007 1582 1078 406 261 139 72 64 1403
summary(mydata$H4SP1M, useNA="always")
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 13.77 30.00 59.00 1404
table(mydata$H4SP1T, useNA="always")
##
## 1 2 <NA>
## 4843 258 1403
mydata$ampm <- car::recode(mydata$H4SP1T, "1=0; 2=12")
But we don’t want to add 12 hours to 12pm, so we need to change the rows when hours = 12 and time = pm to not add any time.
mydata$ampm[mydata$H4SP1H == 12 & mydata$H4SP1T == 2] <- 0
mydata$wakeup <- mydata$H4SP1H + mydata$ampm + mydata$H4SP1M/60
summary(mydata$wakeup)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.750 6.500 7.023 7.500 23.500 1404
Wake up times range from 1am to 11:30 pm. No out of range times seen.
Check to make sure the times added up correctly. Print out the top 5 rows (using slice
) for only the time columns (select
), and dropping the rows with missing values (na.omit
)
library(dplyr)
mydata %>% select(H4SP1H, H4SP1M, ampm, wakeup) %>% na.omit %>% slice(1:5)
## # A tibble: 5 x 4
## H4SP1H H4SP1M ampm wakeup
## <int> <int> <dbl> <dbl>
## 1 6 0 0 6.0
## 2 9 0 0 9.0
## 3 6 30 0 6.5
## 4 9 30 0 9.5
## 5 5 0 0 5.0
Row 1 check: 6 hours, 0 minutes, am, –> 0600 (6). Row 3 check: 6 hours, 30 minutes, am –> 0630 (6.5). Time was calculated correctly.
mydata$H4SP2T[mydata$H4SP2T %in% c(6, 8)] <- NA
mydata$H4SP2M[mydata$H4SP2M %in% c(96, 98)] <- NA
mydata$H4SP2H[mydata$H4SP2H %in% c(96, 98)] <- NA
mydata$ampm2 <- car::recode(mydata$H4SP2T, "1=0; 2=12")
mydata$ampm2[mydata$H4SP2H == 12 & mydata$H4SP2T == 2] <- 0
mydata$bedtime <- mydata$H4SP2H + mydata$ampm2 + mydata$H4SP2M/60
summary(mydata$bedtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 12.00 22.00 17.23 23.00 23.98 1402
We lost people who went to bed or woke up between 12 and 1am - whups.
Indicator for people who likely work the night shift - go to bed between 5am and 3pm
odd.sleep.records <- which(mydata$bedtime > 5 & mydata$bedtime < 15)
# Not run -- run this line if you want to limit your analysis sample to only those with "normal" sleep schedule
#mydata <- mydata[-odd.sleep.records,]
Only useful after eliminating those with odd sleep patterns (above)
# people who go to bed early are expected to wake up before midnight.
mydata$sleep_duration <- mydata$wakeup - mydata$bedtime
# these people go to bed after 3pm, but before midnight. Get list of records where this occurs
normal.sleep.records <- which(mydata$bedtime < 24 & mydata$bedtime > 15)
# for these people who sleep over midnight, calculate their sleep duration.
mydata$sleep_duration[normal.sleep.records] <- (24-mydata$bedtime[normal.sleep.records]) +
mydata$wakeup[normal.sleep.records]
summary(mydata$sleep_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -9.500 6.000 7.500 5.467 8.000 31.000 1404
boxplot(mydata$sleep_duration)
# not run
# also kick out ppl with calculated sleep time over 12 h
#mydata <- filter(mydata, sleep_duration > 12)
table(mydata$H4TO1)
##
## 0 1 6 8
## 1773 3324 11 6
Recode 6 & 8 to missing, create a copy that is a factor variable.
mydata$H4TO1[mydata$H4TO1 %in% c(6,8)] <- NA
mydata$eversmoke_c <- factor(mydata$H4TO1, labels=c("Non Smoker", "Smoker"))
Now think about your personal earnings. In {2006/2007/2008},how much income did you receive from personal earnings before taxes, that is, wages or salaries, including tips, bonuses, and overtime pay, and income from self-employment?
Personal Income is highly skewed right.
mydata$H4EC2[mydata$H4EC2 > 999995 ] <- NA
boxplot(mydata$H4EC2)
summary(mydata$H4EC2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 15000 30000 35046 45000 999995 1658
The Federal poverty level was $10,210 for a single person household in 2008. Let’s do two things 1. Make an indicator variable called poverty
for if a person’s income is below the federal poverty level
mydata$poverty <- ifelse(mydata$H4EC2 < 10210, 1, 0)
table(mydata$poverty)
##
## 0 1
## 3850 996
income
that contains the personal earnings for individuals who make above this poverty limit, but below 250,000mydata$income <- ifelse(mydata$H4EC2<10201 | mydata$H4EC2>250000, NA, mydata$H4EC2)
summary(mydata$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10225 25000 35000 40150 50000 250000 2681
boxplot(mydata$income)
Now Income is still skewed right. Let’s take a log transformation of it in effort to make the distribution more normal.
mydata$logincome <- log(mydata$income)
boxplot(mydata$logincome)
qqnorm(mydata$logincome)
qqline(mydata$logincome, col="red")
Still slightly skewed right, but better than before.
Optional but helpful when dealing with large data sets. Only include variables that I will be using in analysis. This uses the dplyr
library select()
function.
addhealth <- mydata %>% select(female, female_c, genhealth,BMI, bp_class, wakeup, bedtime, sleep_duration, poverty, income, logincome, eversmoke_c, njobs, casual_part)
str(addhealth)
Save the cleaned addhealth
data set as a .Rdata
file for use in graphing and analysis.
save(addhealth, file="addhealth_clean.Rdata")
This document was compiled on 2018-01-25 18:31:57 and with the following system information:
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 dplyr_0.7.2 readr_1.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.14 bindr_0.1 knitr_1.18
## [4] magrittr_1.5 hms_0.3 splines_3.4.1
## [7] MASS_7.3-47 lattice_0.20-35 R6_2.2.2
## [10] rlang_0.1.1 minqa_1.2.4 stringr_1.2.0
## [13] car_2.1-5 tools_3.4.1 parallel_3.4.1
## [16] nnet_7.3-12 pbkrtest_0.4-7 grid_3.4.1
## [19] nlme_3.1-131 mgcv_1.8-17 quantreg_5.33
## [22] MatrixModels_0.4-1 htmltools_0.3.6 assertthat_0.2.0
## [25] yaml_2.1.14 lme4_1.1-13 rprojroot_1.2
## [28] digest_0.6.12 tibble_1.3.3 Matrix_1.2-10
## [31] nloptr_1.0.4 glue_1.1.1 evaluate_0.10.1
## [34] rmarkdown_1.8 stringi_1.1.5 compiler_3.4.1
## [37] backports_1.1.0 SparseM_1.77 pkgconfig_2.0.1