The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32. Add Health is re-interviewing cohort members in a Wave V follow-up from 2016-2018 to collect social, environmental, behavioral, and biological data with which to track the emergence of chronic disease as the cohort moves through their fourth decade of life.

More info at: http://www.cpc.unc.edu/projects/addhealth

Import

This is a CSV file, so we could read it in using read.csv but I prefer the functionality of read_csv found in the readr package.

library(readr)
rawdata <- read_csv(file="AddHealth_Wave_IV.csv")

Trim down variables

Goof ups are bound to happen. Let’s rename our rawdata as mydata, and do all our data cleaning on mydata. That way if/when we goof up, we just need to run the code chunk below and it will reset mydata back to pre-recodes (and not have to read the entire CSV data set from the hard drive all over again.)

mydata <- rawdata 

We could use dplyr to %>% select() only variables we want to keep on the data set in this above code chunk, but we can also “clean up” the data set at the end before saving the clean file to disk.

Recode variables

Gender

BIO_SEX is coded as 1=male, 2=female, 6=missing. I want to recode this into an indicator of being female.

mydata$female <- mydata$BIO_SEX-1
mydata$female[mydata$BIO_SEX==6] <- NA
table(mydata$BIO_SEX, mydata$female ,useNA="always")
##       
##           0    1 <NA>
##   1    3147    0    0
##   2       0 3356    0
##   6       0    0    1
##   <NA>    0    0    0

For plotting purposes, I would also like to have a categorical version of gender.

mydata$female_c <- factor(mydata$female, labels=c("Male", "Female"))
table(mydata$female, mydata$female_c ,useNA="always")
##       
##        Male Female <NA>
##   0    3147      0    0
##   1       0   3356    0
##   <NA>    0      0    1

General Health

table(mydata$H4GH1)
## 
##    1    2    3    4    5 
##  979 1963 1683  434   55

No missing variables, but I want to apply labels and convert to a factor variable.

mydata$genhealth <- factor(mydata$H4GH1, 
                              labels = c("Excellent", "Very good", "Good", "Fair", "Poor")) 

Relationships

With how many people have you had a romantic or sexual relationship that lasted less than 6 months since 2001?

table(mydata$H4TR6)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 3049  515  424  261  176  137   74   41   46   18   95    6   23    9    6 
##   15   16   17   18   20   21   22   23   24   25   26   28   30   37   38 
##   28    1    2    3   25    2    1    1    1   11    3    4   10    1    1 
##   40   45   50   60   65   75   90   95  996  998 
##    3    1    8    4    1    1    1    7   40   74
mydata$H4TR6[mydata$H4TR6 >990] <- NA
boxplot(mydata$H4TR6)

Median number of opposite-sex partners in lifetime among sexually experienced men and women aged 25-44 years of age 2002, 2006-2010 and 2011-2015: 6.7 for men, 3.8 for women. Key Statistics from the National Survey of Family Growth - N Listing https://www.cdc.gov/nchs/nsfg/key_statistics/n.htm

Using the above information, I am going to truncate the number of short term partners to be below 10.

mydata$casual_part <- ifelse(mydata$H4TR6 > 10, 10, mydata$H4TR6)
boxplot(mydata$casual_part)                           

Number of jobs

table(mydata$H4LM3)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   15 
##   54  889 1138 1090  685  481  255  138   75   40   93    8   18    2   14 
##   16   18   20   22   23   25   30   50   96   97   98 
##    2    2   16    2    1    2    3    3    3   96    4
mydata$H4LM3[mydata$H4LM3 >90] <- NA
boxplot(mydata$H4LM3)

mydata$njobs <-mydata$H4LM3

BMI

summary(mydata$H4BMI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.40   23.70   27.80   42.49   33.40  999.00    1390

There are out of range variables here. Check codebook and remove using an ifelse statement that says Make a new variable BMI. If H4BMI is less than 100 set BMI equal to H4BMI. Otherwise (if H4BMI is \(geq\) 100) set BMI to missing.

mydata$BMI <- ifelse(mydata$H4BMI < 100, mydata$H4BMI, NA)

Blood Pressure Class

table(mydata$H4BPCLS)
## 
##    1    2    3    4    6    7    9 
## 1718 2269  791  205   48    4   36

Set missing and change labels

mydata$H4BPCLS[mydata$H4BPCLS %in% c(6, 7, 9)] <- NA
mydata$bp_class <- factor(mydata$H4BPCLS, labels = c('Normal', 'Pre-HTN', 'HTN-I', 'HTN-II'))
table(mydata$H4BPCLS, mydata$bp_class, useNA="always")
##       
##        Normal Pre-HTN HTN-I HTN-II <NA>
##   1      1718       0     0      0    0
##   2         0    2269     0      0    0
##   3         0       0   791      0    0
##   4         0       0     0    205    0
##   <NA>      0       0     0      0 1521

Sleep

Wake up on workdays

Time they wake up on workdays consists of three variables: hours (H4SP1H), minutes(H4SP1M) and if the time listed is in AM or PM (H4SP1T). I want to convert this to a continuous 24 hour time variable.

  • First set values to missing
mydata$H4SP1T[mydata$H4SP1T %in% c(6, 8)] <- NA
mydata$H4SP1M[mydata$H4SP1M %in% c(96, 98)] <- NA
mydata$H4SP1H[mydata$H4SP1H %in% c(96, 98)] <- NA

Confirm

table(mydata$H4SP1H, useNA="always")
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 <NA> 
##   40   48   95  309 1007 1582 1078  406  261  139   72   64 1403
summary(mydata$H4SP1M, useNA="always")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00   13.77   30.00   59.00    1404
table(mydata$H4SP1T, useNA="always")
## 
##    1    2 <NA> 
## 4843  258 1403
  • Recode am/pm to numbers of hours to add (0 for AM, 12 for PM).
mydata$ampm <- car::recode(mydata$H4SP1T, "1=0; 2=12")

But we don’t want to add 12 hours to 12pm, so we need to change the rows when hours = 12 and time = pm to not add any time.

mydata$ampm[mydata$H4SP1H == 12 & mydata$H4SP1T == 2] <- 0
  • Now create 24 hour time by adding hours + am/pm + minutes
mydata$wakeup <- mydata$H4SP1H + mydata$ampm + mydata$H4SP1M/60
summary(mydata$wakeup)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   5.750   6.500   7.023   7.500  23.500    1404

Wake up times range from 1am to 11:30 pm. No out of range times seen.

Check to make sure the times added up correctly. Print out the top 5 rows (using slice) for only the time columns (select), and dropping the rows with missing values (na.omit)

library(dplyr)
mydata %>% select(H4SP1H, H4SP1M, ampm, wakeup) %>% na.omit %>% slice(1:5)
## # A tibble: 5 x 4
##   H4SP1H H4SP1M  ampm wakeup
##    <int>  <int> <dbl>  <dbl>
## 1      6      0     0    6.0
## 2      9      0     0    9.0
## 3      6     30     0    6.5
## 4      9     30     0    9.5
## 5      5      0     0    5.0

Row 1 check: 6 hours, 0 minutes, am, –> 0600 (6). Row 3 check: 6 hours, 30 minutes, am –> 0630 (6.5). Time was calculated correctly.

Bedtime up on workdays

mydata$H4SP2T[mydata$H4SP2T %in% c(6, 8)] <- NA
mydata$H4SP2M[mydata$H4SP2M %in% c(96, 98)] <- NA
mydata$H4SP2H[mydata$H4SP2H %in% c(96, 98)] <- NA

mydata$ampm2 <- car::recode(mydata$H4SP2T, "1=0; 2=12")
mydata$ampm2[mydata$H4SP2H == 12 & mydata$H4SP2T == 2] <- 0
mydata$bedtime <- mydata$H4SP2H + mydata$ampm2 + mydata$H4SP2M/60
summary(mydata$bedtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   12.00   22.00   17.23   23.00   23.98    1402

We lost people who went to bed or woke up between 12 and 1am - whups.

Indicator for people who likely work the night shift - go to bed between 5am and 3pm

odd.sleep.records <- which(mydata$bedtime > 5 & mydata$bedtime < 15)

# Not run -- run this line if you want to limit your analysis sample to only those with "normal" sleep schedule
#mydata <- mydata[-odd.sleep.records,]

Sleep Duration

Only useful after eliminating those with odd sleep patterns (above)

# people who go to bed early are expected to wake up before midnight. 
mydata$sleep_duration <- mydata$wakeup - mydata$bedtime 

# these people go to bed after 3pm, but before midnight. Get list of records where this occurs
normal.sleep.records <- which(mydata$bedtime < 24 & mydata$bedtime > 15)

# for these people who sleep over midnight, calculate their sleep duration. 
mydata$sleep_duration[normal.sleep.records] <- (24-mydata$bedtime[normal.sleep.records]) +
                                                      mydata$wakeup[normal.sleep.records]
summary(mydata$sleep_duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -9.500   6.000   7.500   5.467   8.000  31.000    1404
boxplot(mydata$sleep_duration)

# not run
# also kick out ppl with calculated sleep time over 12 h
#mydata <- filter(mydata, sleep_duration > 12)

Smoking (H4TO1)

table(mydata$H4TO1)
## 
##    0    1    6    8 
## 1773 3324   11    6

Recode 6 & 8 to missing, create a copy that is a factor variable.

mydata$H4TO1[mydata$H4TO1 %in% c(6,8)] <- NA
mydata$eversmoke_c <- factor(mydata$H4TO1, labels=c("Non Smoker", "Smoker"))

Income

Now think about your personal earnings. In {2006/2007/2008},how much income did you receive from personal earnings before taxes, that is, wages or salaries, including tips, bonuses, and overtime pay, and income from self-employment?

Personal Income is highly skewed right.

mydata$H4EC2[mydata$H4EC2 > 999995 ] <- NA
boxplot(mydata$H4EC2)

summary(mydata$H4EC2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   15000   30000   35046   45000  999995    1658

The Federal poverty level was $10,210 for a single person household in 2008. Let’s do two things 1. Make an indicator variable called poverty for if a person’s income is below the federal poverty level

mydata$poverty <- ifelse(mydata$H4EC2 < 10210, 1, 0)
table(mydata$poverty)
## 
##    0    1 
## 3850  996
  1. Make a new variable called income that contains the personal earnings for individuals who make above this poverty limit, but below 250,000
mydata$income <- ifelse(mydata$H4EC2<10201 | mydata$H4EC2>250000, NA, mydata$H4EC2)
summary(mydata$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10225   25000   35000   40150   50000  250000    2681
boxplot(mydata$income)

Now Income is still skewed right. Let’s take a log transformation of it in effort to make the distribution more normal.

mydata$logincome <- log(mydata$income)
boxplot(mydata$logincome)

qqnorm(mydata$logincome)
qqline(mydata$logincome, col="red")

Still slightly skewed right, but better than before.

Trim down variables

Optional but helpful when dealing with large data sets. Only include variables that I will be using in analysis. This uses the dplyr library select() function.

addhealth <- mydata %>% select(female, female_c, genhealth,BMI, bp_class, wakeup, bedtime, sleep_duration, poverty, income, logincome, eversmoke_c, njobs, casual_part) 
str(addhealth)

Export analysis data

Save the cleaned addhealth data set as a .Rdata file for use in graphing and analysis.

save(addhealth, file="addhealth_clean.Rdata")

Session Info

This document was compiled on 2018-01-25 18:31:57 and with the following system information:

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2 dplyr_0.7.2  readr_1.1.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.14       bindr_0.1          knitr_1.18        
##  [4] magrittr_1.5       hms_0.3            splines_3.4.1     
##  [7] MASS_7.3-47        lattice_0.20-35    R6_2.2.2          
## [10] rlang_0.1.1        minqa_1.2.4        stringr_1.2.0     
## [13] car_2.1-5          tools_3.4.1        parallel_3.4.1    
## [16] nnet_7.3-12        pbkrtest_0.4-7     grid_3.4.1        
## [19] nlme_3.1-131       mgcv_1.8-17        quantreg_5.33     
## [22] MatrixModels_0.4-1 htmltools_0.3.6    assertthat_0.2.0  
## [25] yaml_2.1.14        lme4_1.1-13        rprojroot_1.2     
## [28] digest_0.6.12      tibble_1.3.3       Matrix_1.2-10     
## [31] nloptr_1.0.4       glue_1.1.1         evaluate_0.10.1   
## [34] rmarkdown_1.8      stringi_1.1.5      compiler_3.4.1    
## [37] backports_1.1.0    SparseM_1.77       pkgconfig_2.0.1