The present document includes the pre-processing steps to read the intensive longitudinal data collected with the Fitbit Charge 3 (FC3) and Survey Sparrows from a sample of 93 adolescents with and without insomnia.
The types of data collected in this study were the following:
dailyAct
= diurnal daily activity* data including the day-by-day total No. of steps and the resting heart rate recorded with the FC3, and the associated metrics automatically averaged by the Fitabase system, for each participant
hourlySteps
= hourly steps count data including the hour-by-hour No. of steps recorded with the FC3 device, automatically organized by the Fitabase system (the same information is included at a lower temporal resolution in the dailyAct dataset), for each participant
sleepLog
= nocturnal daily sleep data including the log-by-log sleep measures recorded with the FC3 device, automatically organized by the Fitabase system with one row per identified sleep period (sleepLog), for each participant
sleepEBE
= nocturnal sleep data including the 30-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘light,’ ‘deep,’ or ‘REM’ sleep, for each participant
classicEBE
= nocturnal sleep data including the 60-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘sleep,’ automatically organized by the Fitabase system with one row per epoch, for each participant
HR.1min
= diurnal and nocturnal heart rate (HR) data including 60-sec epoch-by-epoch HR values recorded by the Fitbit device, automatically organized by the Fitabase system with one row per epoch, for each participant
dailyDiary
= daily diary reports including day-by-day appraisal of psychological distress recorded and automatically stored by Survey Sparrow (SurveySparrow Inc.) for each participant
demos
= demographic data (i.e., group, sex, and BMI) manually stored for each participant
Here, we remove all objects from the R global environment, and we set the system time zone.
# removing all objets from the workspace
rm(list=ls())
# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")
The following R packages are used in this document (see References section):
# required packages
packages <- c("ggplot2","gridExtra","lubridate","tidyr","dplyr","tcltk")
# generate packages references
knitr::write_bib(c(.packages(), packages),"packagesProc.bib")
# # run to install missing packages
# xfun::pkg_attach2(packages, message = FALSE); rm(list=ls())
Here, we use the multidata.read
function to read the data downloaded from the the Fitabase and the Survey Sparrow clouds. The data were downloaded separately for each participant, and stored into separate folders.
#' @title Reading data from multiple files (one per subject)
#' @param data.path = character vector indicating the path to the folder including the data file.
#' @param idChar = numeric vector of length 2 indicating the first and the last letter in the file names to be used as participants' identification code
#' @param groupChar = numeric vector of length 2 indicating the first and the last letter in the file names to be used as group identification code (default: NA)
#' @param nSubj = integer indicating the number of participants that should be included in the data files (for sanity check)
#' @param surveySparrow = ad-hoc argument to process the data files collected with Survey Sparrow
multidata.read <- function(data.path,idChar,groupChar=NA,nSubj=NA,surveySparrow=FALSE){
# reading files from data.path
paths <- list.files(data.path)
# iteratively reading each file and adding it to the same data.frame
for(path in paths){
new.data <- read.csv(paste(data.path,path,sep="/"))
new.data$ID <- as.factor(paste("s",substr(path,idChar[1],idChar[2]),sep="")) # participant's identifier from file name
if(!is.na(groupChar[1]) & !is.na(groupChar[2])){
new.data$group <- as.factor(substr(path,groupChar[1],groupChar[2])) # group's identifier from file name
new.data <- new.data[,c(ncol(new.data)-1,ncol(new.data),1:(ncol(new.data)-2))] # sorting columns
} else { new.data <- new.data[,c(ncol(new.data),1:(ncol(new.data)-1))] }
# ad-hoc code for surveySparrow data (shortening column names, adding missing columns)
if(surveySparrow==TRUE){
colnames(new.data) <-
gsub("PleaseindicateifthestresswasrelatedtooneormoreofthefollowingfactorsafteryoumadeyourselectionsclickOK","",
gsub("AreyouworriedaboutanyormoreofthefollowingafteryoumadeyourselectionsclickOK","",
gsub("\\.","",colnames(new.data))))
if(ncol(new.data)==40){ new.data[,c("SubmissionId","TimeZone","DeviceType","BrowserLanguage")] <- NULL }
if(ncol(new.data)==34){ # adding COVID-related columns (empty) in first subjects' data
new.data <- cbind(new.data[,1:6],Coronavirusrelatednegmydailyactivitieswererestrictedduetocoronavirus=NA,
new.data[,7:14],Coronavirusrelatednegmydailyactivitieswererestrictedduetocoronavirus=NA,
new.data[,15:ncol(new.data)]) }
if(new.data[1,"ID"]=="s095"){ colnames(new.data)[32:33] <- colnames(data)[32:33] } # fixing wrongly encoded col names
colnames(new.data)[which(colnames(new.data)=="Coronavirusrelatednegmydailyactivitieswererestrictedduetocoronavirus")] <-
c("COVIDrestrictions_stress","COVIDrestrictions_worry") }
# adding the new participant to the dataset
if(path == paths[1]){ data <- new.data } else { data <- rbind(data,new.data) }}
# sanity check based on No. of subjects
if(!is.na(nSubj)){ cat("sanity check:",nlevels(as.factor(data$ID)) == nSubj) }
return(data) }
# Fitbit - dailyAct (6,019 days from 93 participants)
dailyAct <- multidata.read(data.path="DATA/Fitbit All Daily Activity",idChar=c(5,7),groupChar=c(8,11),nSubj=93)
## sanity check: TRUE
# Fitbit - hourlySteps (143,961 hours from 93 participants)
hourlySteps <- multidata.read(data.path="DATA/Fitbit Hourly Steps",idChar=c(5,7),groupChar=c(8,11),nSubj=93)
## sanity check: TRUE
# Fitbit - sleepLog (5,403 nights from 93 participants)
sleepLog <- multidata.read(data.path="DATA/Fitbit Sleep Log",idChar=c(5,7),groupChar=c(8,11),nSubj=93)
## sanity check: TRUE
# Fitbit - sleepEBE (4,243,037 epochs from 93 participants)
sleepEBE <- multidata.read(data.path="DATA/Fitbit Sleep Stages (30sec)",idChar=c(5,7),groupChar=c(8,11),nSubj=93)
## sanity check: TRUE
# Fitbit - classicEBE (2,505,034 epochs from 93 participants)
classicEBE <- multidata.read(data.path="DATA/Fitbit Sleep Classic (1min)",idChar=c(5,7),groupChar=c(8,11),nSubj=93)
## sanity check: TRUE
# Fitbit - HR.1min (6,986,307 epochs from 93 participants)
HR.1min <- multidata.read(data.path="DATA/Fitbit HR (1min)",idChar=c(5,7),groupChar=c(8,11),nSubj=93)
## sanity check: TRUE
# Survey Sparrow - dailyDiary (5,133 daily responses from 93 participants)
dailyDiary <- multidata.read(data.path="DATA/Survey Sparrows",idChar=c(5,7),groupChar=c(8,11),nSubj=93,surveySparrow=TRUE)
## sanity check: TRUE
Then, we read the demos dataset, including demographic variables (N = 107) that were recorded in the demographics.csv data file.
# Demographics - demos (107 participants)
demos <- read.csv2("DATA/demographics.csv",header=TRUE) # csv2 because saved with ; as column separator
Here, we inspect the structure of each dataset. Note that each FC3-derived dataset includes all variables available from the Fitabase platform.
# dailyAct
str(dailyAct)
## 'data.frame': 6019 obs. of 20 variables:
## $ ID : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ActivityDate : chr "1/7/2019" "1/8/2019" "1/9/2019" "1/10/2019" ...
## $ TotalSteps : int 11023 16524 14904 8000 0 0 0 0 5273 6718 ...
## $ TotalDistance : num 9.66 13.17 11.97 6.62 0 ...
## $ TrackerDistance : num 9.66 13.17 11.97 6.62 0 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 2.84 0 1.97 1.24 0 ...
## $ ModeratelyActiveDistance: num 4.55 3.54 2.28 0.21 0 ...
## $ LightActiveDistance : num 2.27 9.47 7.71 5.06 0 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 16 0 21 11 0 0 0 0 9 7 ...
## $ FairlyActiveMinutes : int 54 39 34 3 0 0 0 0 3 3 ...
## $ LightlyActiveMinutes : int 113 414 266 153 0 0 0 0 147 222 ...
## $ SedentaryMinutes : int 1257 266 453 735 1440 1440 1440 1440 1281 1208 ...
## $ Calories : int 2138 2669 2519 2013 1505 1505 1505 1505 1945 2085 ...
## $ Floors : int 4 9 13 10 0 0 0 0 0 0 ...
## $ CaloriesBMR : int 1505 1505 1505 1505 1505 1505 1505 1505 1505 1505 ...
## $ MarginalCalories : int 479 790 716 356 0 0 0 0 308 360 ...
## $ RestingHeartRate : int 60 64 64 65 NA NA NA NA NA NA ...
# hourlySteps
str(hourlySteps)
## 'data.frame': 143961 obs. of 4 variables:
## $ ID : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ActivityHour: chr "1/7/2019 0:00" "1/7/2019 1:00" "1/7/2019 2:00" "1/7/2019 3:00" ...
## $ StepTotal : int 0 0 0 0 0 0 0 0 0 0 ...
# sleepLog
str(sleepLog)
## 'data.frame': 5403 obs. of 30 variables:
## $ ID : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LogId : num 2.08e+10 2.08e+10 2.08e+10 2.11e+10 2.11e+10 ...
## $ StartTime : chr "1/7/2019 10:06:30 PM" "1/8/2019 9:59:30 PM" "1/9/2019 10:45:00 PM" "1/25/2019 9:53:00 PM" ...
## $ Duration : int 36060000 35460000 32280000 36780000 34800000 34560000 31620000 31560000 29040000 27240000 ...
## $ Efficiency : int 89 87 90 90 92 91 91 93 93 93 ...
## $ IsMainSleep : chr "True" "True" "True" "True" ...
## $ SleepDataType : chr "stages" "stages" "stages" "stages" ...
## $ MinutesAfterWakeUp : int 0 0 0 0 0 0 0 0 0 2 ...
## $ MinutesAsleep : int 507 502 469 511 515 496 448 451 423 387 ...
## $ MinutesToFallAsleep : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TimeInBed : int 601 591 538 613 580 576 527 526 484 454 ...
## $ ClassicAsleepCount : int NA NA NA NA NA NA NA NA NA NA ...
## $ ClassicAsleepDuration : int NA NA NA NA NA NA NA NA NA NA ...
## $ ClassicAwakeCount : int NA NA NA NA NA NA NA NA NA NA ...
## $ ClassicAwakeDuration : int NA NA NA NA NA NA NA NA NA NA ...
## $ ClassicRestlessCount : int NA NA NA NA NA NA NA NA NA NA ...
## $ ClassicRestlessDuration: int NA NA NA NA NA NA NA NA NA NA ...
## $ StagesWakeCount : int 53 43 43 53 45 46 36 34 42 24 ...
## $ StagesWakeDuration : int 94 89 69 102 65 80 79 75 61 67 ...
## $ StagesWakeThirtyDayAvg : int 0 94 92 84 89 84 83 83 82 79 ...
## $ StagesLightCount : int 45 32 31 46 39 42 22 28 29 24 ...
## $ StagesLightDuration : int 353 297 283 330 272 287 258 269 243 178 ...
## $ StagesLightThirtyDayAvg: int 0 353 325 311 316 307 304 297 294 288 ...
## $ StagesDeepCount : int 2 2 3 3 6 4 3 4 4 4 ...
## $ StagesDeepDuration : int 66 91 97 92 119 104 75 65 88 103 ...
## $ StagesDeepThirtyDayAvg : int 0 66 79 85 87 93 95 92 89 89 ...
## $ StagesREMCount : int 17 20 17 16 18 15 20 16 19 8 ...
## $ StagesREMDuration : int 88 114 89 89 124 105 115 117 92 106 ...
## $ StagesREMThirtyDayAvg : int 0 88 101 97 95 101 102 103 105 104 ...
# sleepEBE
str(sleepEBE)
## 'data.frame': 4243037 obs. of 7 variables:
## $ ID : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LogId : num 2.08e+10 2.08e+10 2.08e+10 2.08e+10 2.08e+10 ...
## $ Time : chr "1/7/2019 22:06" "1/7/2019 22:07" "1/7/2019 22:07" "1/7/2019 22:08" ...
## $ Level : chr "light" "light" "light" "light" ...
## $ ShortWakes: chr "" "" "" "" ...
## $ SleepStage: chr "light" "light" "light" "light" ...
# classicEBE
str(classicEBE)
## 'data.frame': 2505034 obs. of 5 variables:
## $ ID : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group: Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date : chr "1/7/2019 22:06" "1/7/2019 22:07" "1/7/2019 22:08" "1/7/2019 22:09" ...
## $ value: int 1 1 1 2 1 1 2 2 1 1 ...
## $ logId: num 2.08e+10 2.08e+10 2.08e+10 2.08e+10 2.08e+10 ...
# HR.1min
str(HR.1min)
## 'data.frame': 6986307 obs. of 4 variables:
## $ ID : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group: Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : chr "1/7/2019 3:50:00 PM" "1/7/2019 3:51:00 PM" "1/7/2019 3:52:00 PM" "1/7/2019 3:53:00 PM" ...
## $ Value: int 84 74 69 69 68 68 70 65 70 70 ...
# dailyDiary
str(dailyDiary)
## 'data.frame': 5133 obs. of 36 variables:
## $ ID : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Howstressfulwasyourday : chr "Not so stressful" "Not so stressful" "Not at all stressful" "Somewhat stressful" ...
## $ SchoolegIhadanexam : chr "School (e.g., I had an exam)" "" "" "" ...
## $ FamilyegIhadanargumentwithmyparents : chr "" "" "" "" ...
## $ HealthegIhadanaccident : chr NA NA NA NA ...
## $ COVIDrestrictions_stress : chr NA NA NA NA ...
## $ RelationswithyourpeersegIhadafightwithmyfriend : chr "" "" "" "" ...
## $ Other : chr "Other" "Other" "" "Other" ...
## $ Howisyourmoodrightnow : chr "Somewhat good" "Very good" "Very good" "Very good" ...
## $ Howworrieddoyoufeelrightnow : chr "Not so worried" "Not so worried" "Not so worried" "Not at all worried" ...
## $ SchoolegtomorrowIhaveanexam : chr "School (e.g., tomorrow I have an exam)" "" "" "" ...
## $ FamilyegtomorrowIneedtodosomethingimportantwithmyparents : chr "" "" "" "" ...
## $ HealthegtomorrowIhaveanimportantvisittothedoctor : chr NA NA NA NA ...
## $ RelationswithyourpeersegmyfriendaskedmetotalkandIdonotknowwhatitisabout: chr "" "" "" "" ...
## $ COVIDrestrictions_worry : chr NA NA NA NA ...
## $ SleepegIamworriedthatIamnotgoingtosleepwelltonight : chr NA NA NA NA ...
## $ OtheregIamworriedaboutsomethingelsehappeningtomorrow : chr "Other (e.g., I am worried about something else happening tomorrow)" "Other (e.g., I am worried about something else happening tomorrow)" "Other (e.g., I am worried about something else happening tomorrow)" "" ...
## $ TotalScore : int 2 1 NA 2 1 NA NA 2 6 3 ...
## $ StartedTime : chr "4/9/2019 21:10" "4/8/2019 21:08" "4/7/2019 21:29" "4/6/2019 21:01" ...
## $ SubmittedTime : chr "4/9/2019 21:10" "4/8/2019 21:08" "4/7/2019 21:29" "4/6/2019 21:01" ...
## $ CompletionStatus : chr "Completed" "Completed" "Completed" "Completed" ...
## $ IPAddress : chr "66.180.182.236" "66.180.182.232" "172.56.34.144" "172.58.231.24" ...
## $ Location : logi NA NA NA NA NA NA ...
## $ DMSLatLong : logi NA NA NA NA NA NA ...
## $ ChannelType : chr "EMAIL" "EMAIL" "EMAIL" "EMAIL" ...
## $ ChannelName : chr NA NA NA NA ...
## $ DeviceID : logi NA NA NA NA NA NA ...
## $ DeviceName : logi NA NA NA NA NA NA ...
## $ Browser : chr NA NA NA NA ...
## $ OS : chr NA NA NA NA ...
## $ ContactName : chr NA NA NA NA ...
## $ ContactEmail : chr NA NA NA NA ...
## $ ContactMobile : logi NA NA NA NA NA NA ...
## $ ContactPhone : logi NA NA NA NA NA NA ...
## $ ContactJobTitle : logi NA NA NA NA NA NA ...
# demos
str(demos)
## 'data.frame': 107 obs. of 13 variables:
## $ id : chr "INSA001FBSR" "INSA002FSRB" "INSA004MBSR" "INSA005FSRB" ...
## $ sex : int 0 0 1 0 1 0 0 1 1 0 ...
## $ age : chr "19.3322473" "19.28570288" "18.80930706" "18.754663" ...
## $ BMI : chr "20.59570313" "18.54801038" "31.09001041" "19.93383743" ...
## $ insomnia : int 0 0 0 0 1 0 0 0 0 0 ...
## $ DSMinsomnia : int 0 0 0 0 1 0 0 0 0 0 ...
## $ sub_insomnia: int 0 0 0 0 NA 0 0 0 0 0 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ X.2 : logi NA NA NA NA NA NA ...
## $ X.3 : logi NA NA NA NA NA NA ...
## $ X.4 : logi NA NA NA NA NA NA ...
## $ X.5 : logi NA NA NA NA NA NA ...
Here, we recode the time format of each dataset to consistently synchronize the temporal coordinates across measurement modalities. For each dataset, dependently on its temporal resolution, we recode or create the ActivityDate (i.e., indicating the day of the year using the ‘yyyy-mm-dd’ format), and the time variable within (i.e., indicating the time within each day using the ‘hh:mm:ss’ format).
The timeCheck
function is used to check whether the temporal coordinates of the data points match with the data collection interval (i.e., from Jaunary 2019 to April 2021), and whether missing data points are present within the data.
#' @title Recoding and checking data temporal synchronization
#' @param data = data.frame.
#' @param day = character string indicating the name of the variable including the day of the year
#' @param hour = character string indicating the name of the variable including the time within the day (optional, default: NA)
#' @param returnInfo = logical indicating whether the participants' compliance dataset should be returned instead of the recoded dataset (default: FALSE)
#' @param printInfo = logical indicating whether the summary information on participants' compliance should be printed (default: TRUE)
#' @param input.dayFormat = character string indicating the current format of the 'day' variable (default: "%m/%d/%Y"). See ?strptime for details
#' @param output.dayFormat = character string indicating the desired format of the 'day' variable (default: "%Y-%m-%d"). See ?strptime for details
#' @param output.hourFormat = character string indicating the desired format of the 'hour' variable (defualt: "%m/%d/%Y %H:%M"). See ?strptime for details
timeCheck <- function(data,ID="ID",day="ActivityDate",hour=NA,returnInfo=FALSE,printInfo=TRUE,
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M",
add30=FALSE,LogId=NA,day.withinNight=FALSE){
require(ggplot2); require(gridExtra) # required packages for data visualization
# setting column names
if(!is.na(day)){ colnames(data)[which(colnames(data)==day)] <- "day" }
colnames(data)[which(colnames(data)==ID)] <- "ID"
if(!is.na(hour)){ colnames(data)[which(colnames(data)==hour)] <- "hour"
if(!is.na(LogId)){ colnames(data)[which(colnames(data)==LogId)] <- "LogId"
data$LogId <- as.factor(as.character(data$LogId))}}
# setting time format
if(!is.null(data$day[1])){ # setting day format
data$day <- as.Date(format(as.POSIXct(data$day,format=input.dayFormat),format=output.dayFormat)) }
if(!is.na(hour)){ # setting hour format
data$hour <- as.POSIXct(data$hour,format=output.hourFormat,tz="GMT")
# resolution from minute to seconds (i.e., for 30-sec EBE data: adds 30 sec to each other epoch)
if(add30==TRUE & !is.na(LogId)){
for(LOG in levels(data$LogId)){
data[data$LogId==LOG,"hour"][seq(from=2,to=nrow(data[data$LogId==LOG,]),by=2)] <-
data[data$LogId==LOG,"hour"][seq(from=2,to=nrow(data[data$LogId==LOG,]),by=2)] + 30
if(difftime(head(data[data$LogId==LOG,"hour"],2)[2],head(data[data$LogId==LOG,"hour"],1),units="secs")>30){
data[data$LogId==LOG,"hour"][1] <- data[data$LogId==LOG,"hour"][1] + 30 # adds 30 sec when sleep starts at :30
}}}
if(is.na(day) | is.null(data$day[1])){ # when hour but not day is specified, day is computed from hour
data$day <- as.Date( format( as.POSIXct(data$hour,format=input.dayFormat), output.dayFormat))
if(!is.na(hour) & !is.na(LogId) & day.withinNight==TRUE){ # date as the first epoch's date
for(LOG in levels(data$LogId)){ data[data$LogId==LOG,"day"] <- data[data$LogId==LOG,"day"][1]}}}}
# sorting data by participant, day, and time
if(is.na(hour)){ data <- data[order(data$ID,data$day),]
}else{ data <- data[order(data$ID,data$day,data$hour),] }
# checking number of data and temporal interval for each participant
info <- data.frame(ID=levels(data$ID))
if(!is.na(hour)){
data$IDday <- as.factor(paste(data$ID,data$day,sep="_")) # participant X day identifier
dataDay <- data[!duplicated(data$IDday),] # taking only the first raw for each participant X day combination
} else { dataDay <- data }
for(i in 1:nrow(info)){
IDdata <- dataDay[dataDay$ID==info[i,"ID"],]
info[i,"nDays"] <- nrow(IDdata) # nDays = No. of data points (days) per participant
info[i,"Tint"] <- as.integer(difftime(tail(IDdata$day,1),head(IDdata$day,1),units="days")) + 1
# counting No. of missing days
nMissingDays <- maxdayDiff <- 0
if(nrow(IDdata)>1){
for(j in 2:nrow(IDdata)){ dayDiff <- as.integer(difftime(IDdata[j,"day"],IDdata[j-1,"day"],units="days"))
if(dayDiff>1){ nMissingDays <- nMissingDays + as.integer(dayDiff) - 1
if(dayDiff>maxdayDiff){ maxdayDiff <- dayDiff }}}
info[i,"nMissingDays"] <- nMissingDays
info[i,"maxdayDiff"] <- maxdayDiff }
}
# counting No. of duplicates
if(!is.na(hour)){
data$IDhour <- as.factor(paste(data$ID,data$hour,sep=""))
dupHour <- nrow(data[duplicated(data$IDhour),])
} else {
dataDay$IDday <- as.factor(paste(dataDay$ID,dataDay$day,sep="_"))
dupDay <- nrow(dataDay[duplicated(dataDay$IDday),]) }
# plotting the temporal distribution of ActivityDate
if(printInfo==TRUE){
grid.arrange(ggplot(data[,which(duplicated(colnames(data))==FALSE)],aes(day)) + geom_histogram(bins=30) +
ggtitle(paste(day,"distribution ( example format:",data$day[1]," )")),
ggplot(info,aes(nDays)) + geom_histogram(bins=30) + ggtitle("No. of non-missing days per participant"),nrow=2)
# printing information
if(!is.na(hour)){ cat(nrow(data),"observations in",nrow(dataDay),"days from",nlevels(data$ID),"participants:")
} else { cat(nrow(data),"days from",nlevels(data$ID),"participants:") }
cat("\n\n- mean No. of days/participant =",
round(mean(info$nDays),2)," SD =",round(sd(info$nDays),2)," min =",min(info$nDays)," max =",max(info$nDays),
"\n- mean data collection duration (days) =",
round(mean(info$Tint),2),"- SD =",round(sd(info$Tint),2)," min =",min(info$Tint)," max =",max(info$Tint),
"\n\n- mean No. of missing days per participant =",round(mean(info$nMissingDays),2),
" SD =",round(sd(info$nMissingDays),2)," min =",min(info$nMissingDays)," max =",max(info$nMissingDays),
"\n- mean No. of consecutive missing days per participant =",round(mean(info$maxdayDiff),2),
" SD =",round(sd(info$maxdayDiff),2)," min =",min(info$maxdayDiff)," max =",max(info$maxdayDiff))
if(!is.na(hour)){ cat("\n\n- No. of duplicated cases by hour (same ID and hour) =",dupHour)
}else{ cat("\n\n- No. of duplicated cases by day (same ID and day) =",dupDay) }}
# resetting column names
if(!is.na(day)) { colnames(data)[which(colnames(data)=="day")] <- day }
colnames(data)[which(colnames(data)=="ID")] <- ID
if(!is.na(hour)){ colnames(data)[which(colnames(data)=="hour")] <- hour }
# data output
if(returnInfo==TRUE){ return(info) }else{ return(data) }
}
dailyAct
data are stored in a dataset with one row per day. Thus, in this dataset we only need to recode the ActivityDate
variable.
# recoding day and hour, and checking time and missing data points
dailyAct <- timeCheck(data=dailyAct,ID="ID",day="ActivityDate",hour=NA,
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d")
## Warning: package 'ggplot2' was built under R version 4.0.5
## 6019 days from 93 participants:
##
## - mean No. of days/participant = 64.72 SD = 5.25 min = 49 max = 100
## - mean data collection duration (days) = 64.72 - SD = 5.25 min = 49 max = 100
##
## - mean No. of missing days per participant = 0 SD = 0 min = 0 max = 0
## - mean No. of consecutive missing days per participant = 0 SD = 0 min = 0 max = 0
##
## - No. of duplicated cases by day (same ID and day) = 0
Comments:
the ActivityDate
variable has been successfully recoded with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 63 non-missing days of data, with only one participant (s039) having less than 50 days (i.e., 49), and only two participants having more than 75 days (i.e., s001: 88 days; s041: 100 days)
no missing days are observed
no duplicates are included in the dataset
Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.
par(mfrow=c(3,3))
for(ID in levels(dailyAct$ID)){
plot((1:nrow(dailyAct[dailyAct$ID==ID,])~dailyAct[dailyAct$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="") }
Comments:
dailyAct
dataHere, we save the processed dailyAct
dataset to be used in the following steps.
save(dailyAct,file="DATA/datasets/dailyAct_timeProcessed.RData")
hourlySteps
data are stored in a dataset with one row per hour. Thus, in this dataset we need to recode the ActivityHour
variable, based on which the ActivityDate
variable is computed.
# recoding day and hour, and checking time and missing data points
hourlySteps <- timeCheck(data=hourlySteps,ID="ID",day="ActivityDate",hour="ActivityHour",
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")
## 143961 observations in 6006 days from 93 participants:
##
## - mean No. of days/participant = 64.58 SD = 5.21 min = 49 max = 99
## - mean data collection duration (days) = 64.58 - SD = 5.21 min = 49 max = 99
##
## - mean No. of missing days per participant = 0 SD = 0 min = 0 max = 0
## - mean No. of consecutive missing days per participant = 0 SD = 0 min = 0 max = 0
##
## - No. of duplicated cases by hour (same ID and hour) = 0
Comments:
the ActivityDate
variable has been succesfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 63 nonmissing days of data, with only one participant (s039) having less than 50 days (i.e., 49), and only two participants having more than 75 days (i.e., s001: 88 days; s041: 100 days)
no missing days are observed
Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.
par(mfrow=c(3,3))
for(ID in levels(hourlySteps$ID)){
plot((1:nrow(hourlySteps[hourlySteps$ID==ID,])~hourlySteps[hourlySteps$ID==ID,"ActivityHour"]),main=ID,xlab="",ylab="",
cex=0.5) }
Comments:
hourlySteps
dataHere, we save the processed hourlySteps
dataset to be used in the following steps.
save(hourlySteps,file="DATA/datasets/hourlySteps_timeProcessed.RData")
sleepLog
data exported from Fitabase are stored in a dataset with one row per sleep period, and multiple sleep periods can be found in the same participant × day combination. Thus, in this dataset we need to recode the StartTime
variable (i.e., the “date and time the sleep record started”), based on which the ActivityDate
variable is computed.
Here, before applying the timeCheck
function, we need standardize the format of this variable (i.e., sometimes encoded with the “PM”/“AM” specification, and sometimes not).
# standardizing StartTime format and converting as POSIXct
sleepLog$lastTimeChar <- substr(sleepLog$StartTime,nchar(sleepLog$StartTime),nchar(sleepLog$StartTime))
sleepLog[sleepLog$lastTimeChar=="M","StartTime2"] <-
as.POSIXct(sleepLog[sleepLog$lastTimeChar=="M","StartTime"],format="%m/%d/%Y %I:%M:%S %p",tz="GMT") # AM/PM specification
sleepLog[sleepLog$lastTimeChar!="M","StartTime2"] <-
as.POSIXct(sleepLog[sleepLog$lastTimeChar!="M","StartTime"],format="%m/%d/%Y %H:%M",tz="GMT") # no AM/PM specification
sleepLog[is.na(sleepLog$StartTime2),"StartTime2"] <-
as.POSIXct(sleepLog[is.na(sleepLog$StartTime2),"StartTime"],format="%d/%m/%Y %H.%M",tz="GMT") # no AM/PM + '.' instead of ':'
sleepLog$StartTime <- sleepLog$StartTime2 # updating StartTime
# recoding day and hour, and checking time and missing data points
p <- timeCheck(data=sleepLog,ID="ID",day="ActivityDate",hour="StartTime",
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")
## 5403 observations in 4387 days from 93 participants:
##
## - mean No. of days/participant = 47.17 SD = 12.57 min = 6 max = 86
## - mean data collection duration (days) = 76.82 - SD = 38.93 min = 7 max = 331
##
## - mean No. of missing days per participant = 29.65 SD = 40.59 min = 0 max = 275
## - mean No. of consecutive missing days per participant = 16.95 SD = 37.07 min = 0 max = 268
##
## - No. of duplicated cases by hour (same ID and hour) = 1
Comments:
the ActivityDate
variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 40-60 nonmissing days of data, with only one participant showing less than 10 days (s089: 6 days), and further three participants showing less than 20 days (s038, 41, and 105)
two cases have the same participant’s identifier and temporal coordinate (see section 2.3.1)
a substantial No. of missing days is observed, with the 28% of the sample showing 10+ consecutive missing days, and the 38% showing 20+ consecutive missing days. Here, we can see that at least some cases of long missing data periods are due to one or two isolated data points (nights) that were recorded some months later than the other data points. These cases will be better discussed in the data cleaning section.
(sleepLog_compliance <-
timeCheck(data=sleepLog,ID="ID",day="ActivityDate",hour="StartTime",returnInfo=TRUE,printInfo=FALSE,
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M"))
# plotting missing days
par(mfrow=c(1,2))
hist(sleepLog_compliance$nMissingDays,main="No. of missing days",breaks=30)
hist(sleepLog_compliance$maxdayDiff,main="Max No. of consecutive missing days",breaks=30)
# showing examples of participant with > 250 consecutive missing days
for(ID in sleepLog_compliance[sleepLog_compliance$maxdayDiff>50,"ID"]){
print(tail(sleepLog[sleepLog$ID==ID,c("ID","LogId","StartTime")],3)) }
## ID LogId StartTime
## 257 s006 21983522520 2019-04-15 23:17:00
## 258 s006 21996799829 2019-04-16 23:17:00
## 259 s006 25381321835 2020-01-09 23:28:30
## ID LogId StartTime
## 1191 s025 22737210199 2019-06-16 02:02:30
## 1192 s025 22750117624 2019-06-17 00:54:00
## 1193 s025 24233083393 2019-10-10 22:52:30
## ID LogId StartTime
## 1511 s029 22924261808 2019-07-01 01:12:00
## 1512 s029 23768589455 2019-09-06 22:18:30
## 1513 s029 24046024418 2019-09-27 22:57:00
## ID LogId StartTime
## 1582 s030 23099294456 2019-07-15 01:09:00
## 1583 s030 23110901994 2019-07-16 02:02:00
## 1584 s030 23834537413 2019-09-10 22:45:00
## ID LogId StartTime
## 1913 s038 23951489399 2019-09-19 22:14:30
## 1914 s038 24914189245 2019-12-05 22:03:00
## 1915 s038 24997629259 2019-12-12 22:47:30
## ID LogId StartTime
## 1981 s040 23056415228 2019-07-08 08:59:30
## 1982 s040 23056415229 2019-07-09 00:50:30
## 1983 s040 25340851080 2020-01-07 22:28:00
## ID LogId StartTime
## 2470 s053 24978955724 2019-12-09 00:27:00
## 2471 s053 24987038952 2019-12-12 00:29:00
## 2472 s053 25885992780 2020-02-12 23:20:30
## ID LogId StartTime
## 2711 s058 25000689911 2019-12-11 00:06:30
## 2712 s058 25000689912 2019-12-12 01:04:00
## 2713 s058 25886123469 2020-02-12 23:40:00
Here, we recode further important variables (as defined according to Fitabase (accessed on March 10th 2021):
LogId
: identifying separate sleep periods for each participant. Here, we remove one case (i.e., the same highlighted above with the timeCheck
function) showing the same LogId but a less realistic sleep duration (i.e., 1.3h) than another case (i.e., sleep duration = 11.3h).# LogId as factor
sleepLog <- p # dataset processed with timeCheck
sleepLog$LogId <- as.factor(sleepLog$LogId)
# sanity check
nrow(sleepLog) - nlevels(sleepLog$LogId) # only one LogId is included twice
## [1] 1
sleepLog[sleepLog$LogId==names(summary(sleepLog$LogId)[which(summary(sleepLog$LogId)==2)]),] # showing LogId
# removing dupicated LogId with SleepDataType = "stages"
sleepLog <- sleepLog[!(sleepLog$LogId==names(summary(sleepLog$LogId)[which(summary(sleepLog$LogId)==2)]) &
sleepLog$SleepDataType=="stages"),]
cat("sanity check:",nlevels(as.factor(as.character(sleepLog$LogId)))==nrow(sleepLog)) # now each row has its own identifier
## sanity check: TRUE
SleepDataType
: indicates the sleep algorithm used for the sleep record, i.e., “stages”
(84%) or “classic”
(16%). Note that information on sleep stage duration is reported only in the former cases.# SleepDataType as factor
sleepLog$SleepDataType <- as.factor(sleepLog$SleepDataType)
summary(sleepLog$SleepDataType) # "classic" in 16% of cases
## classic stages
## 857 4545
IsMainSleep
: indicates whether the sleep record is the main sleep record for that day (TRUE; i.e., nocturnal sleep period) or not (FALSE; i.e., considered as nocturnal or diurnal nap).# standardizing IsMainSleep encoding
sleepLog$IsMainSleep <- as.logical(toupper(gsub("FALSO","FALSE",gsub("VERO","TRUE",sleepLog$IsMainSleep))))
summary(sleepLog$IsMainSleep) # only 9% with IsMainSleep = FALSE
## Mode FALSE TRUE
## logical 500 4902
Then, to better inspect timing and duration of the recorded sleep periods, we recode the StartTime
variable to create EndTime
(i.e., StartTime + TimeInBed
), as well as StartHour
and EndHour
, indicating only the time (and not the date). This is done with the StartTime_rec
function.
#' @title Recoding and plotting StartTime and EndTime
#' @param data = data.frame including at least the start colun.
#' @param start = character string indicating the name of the variable including the starting time
#' @param duration = character string indicating the name of the variable including the recording duration (optional, default: NA)
#' @param duration.unit = character string indicating the measurement unit of the duration variable: either "mins" (default) or "secs"
#' @param doPlot = logical indicating whether StartTime and EndTime should be plotted (default: TRUE)
#' @param returnData = logical indicating whether the recoded dataset should be returned (default: TRUE)
StartTime_rec <- function(data,start="StartTime",end=NA,duration=NA,duration.unit="mins",doPlot=TRUE,returnData=TRUE){
require(lubridate) # required package
# setting columns names
colnames(data)[colnames(data)==start] <- "start"
if(!is.na(end)){ colnames(data)[colnames(data)==end] <- "end" }
if(!is.na(duration)) {
colnames(data)[colnames(data)==duration] <- "duration"
if(duration.unit=="mins"){ TIB = data$duration*60 }else if(duration.unit=="secs"){ TIB = data$duration
}else{ stop("duration.unit can only be 'mins' or 'secs'") }}
# creating EndTime (if both end and duration are specified)
if(!is.na(end) & !is.na(duration)){ data$end <- data$start + TIB }
# creating and plotting StartHour and EndHour
data$StartHour <- as.POSIXct(paste(hour(data$start),minute(data$start)),format="%H %M",tz="GMT")
if(!is.na(end)){ data$EndHour <- as.POSIXct(paste(hour(data$end),minute(data$end)),format="%H %M",tz="GMT") }
if(doPlot==TRUE){ require(ggplot2); require(gridExtra) # required packages for data visualization
p <- ggplot(data,aes(StartHour))+geom_histogram(bins=30)+ggtitle(start)+xlab("")+scale_x_datetime(date_labels="%H:%M")
if(!is.na(end)){
grid.arrange(p,ggplot(data,aes(EndHour))+geom_histogram(bins=100)+ggtitle(end)+xlab("") +
scale_x_datetime(date_labels="%H:%M"))
}else{ print(p) }}
# updating EndHour so that it indicates the following days if after 00:00
if(!is.na(end)){
h18 <- as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="GMT")
h23 <- as.POSIXct(paste(substr(Sys.time(),1,10),"23:59:59"),tz="GMT")
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
data[data$StartHour > h18 & data$StartHour < h23 & data$EndHour >= h00 & data$EndHour < h18,"EndHour"] <-
data[data$StartHour > h18 & data$StartHour < h23 & data$EndHour >= h00 & data$EndHour < h18,"EndHour"] + 1*24*60*60 }
# resetting columns names
colnames(data)[colnames(data)=="start"] <- start
if(!is.na(end)){ colnames(data)[colnames(data)=="end"] <- end }
if(!is.na(duration)) { colnames(data)[colnames(data)=="duration"] <- duration }
# returning recoded data
if(returnData==TRUE){ return(data) }}
# recoding and plotting StartTime and EndTime
sleepLog <- StartTime_rec(data=sleepLog,start="StartTime",end="EndTime",duration="TimeInBed",duration.unit="mins",
doPlot=TRUE,returnData=TRUE)
Comments:
most StartTime
are between 10:00 PM and 2:00 AM
only a minority of cases has a StartTime in diurnal hours. Most of these cases are diurnal naps, as suggested by the following graph, showing the distribution of StartTime values in cases with IsMainSleep
= TRUE (above) and IsMainSleep
= FALSE (below)
# showing StartTime by IsMainSleep
StartTime_rec(data=sleepLog[sleepLog$IsMainSleep==TRUE,],returnData=FALSE) # IsMainSleep = TRUE
StartTime_rec(data=sleepLog[sleepLog$IsMainSleep==FALSE,],returnData=FALSE) # IsMainSleep = FALSE
Comments:
as expected, StartTime
values in diurnal hours is more likely to be observed when IsMainSleep
= FALSE
in some cases with IsMainSleep
= FALSE, the StartTime
is between 10:00 PM and 3:00 AM, probably due to bunch of sleep epochs incorrectly encoded as “naps”
there is still a (very low) number of cases with IsMainSleep
= TRUE and StartTime
in diurnal hours
Here, we want to temporally recode sleepLog data based on our definition of nocturnal sleep periods, that is a period of inactivity (as detected by the FC3 device) characterized by the following conditions:
- Starting between 6 PM and 6 AM
- At least 180 min (3 hours) of Total Sleep Time
- Possibly being interrupted by an indefinite number of wake periods with undefinite duration, but with the last sleep period starting before 11 AM
- Consecutive sleep periods between 6 PM and 11 PM, and between 6 AM and 11 AM are combined only when separated by less than 1.5 hour
Here, we use the sleepPeriodRecode
function to filter and recode the data based on Condition #1 (i.e., by excluding sleep periods with StartHour before 6 PM or after 6 AM), whereas Condition #2 will be applied in the data cleaning section below.
We also apply Condition #3 by combining consecutive sleep periods with StartTime up to 11 AM. Indeed, sometimes Fitbit encodes short bouts of morning sleep (or early evening sleep) as separate sleep periods. Usually, but not necessarily always, these short bouts are encoded with IsMainSleep
= FALSE
and SleepDataType
= "Classic"
. With combined sleep stages, TIB is recomputed as the No. of minutes between sleep1’s StartTime
and sleep2’s EndTime
, whereas TST
is recomputed as the sum of sleep1 and sleep2’s TST
(the time between sleep1 and sleep 2 is considered as wake).
Note that Condition #3 is applied conditional to Condition #4, that is short sleep periods ending before 11 PM or starting after 6 AM and preceded or followed by 1.5 or more hours of wake are not combined with the preceding or following sleep period, but rather considered as naps (rather than nocturnal sleep periods).
#' @title Recoding Fitabase-derived sleep periods
#' @param data = data.frame of Fitabased-derived SleepLog datat (one row per night).
#' @param sleep_limits = character vector indicating the minimum and maximum starting time ("hh:mm") of sleep periods.
#' @param combine = logical value (default: TRUE) indicating whether consecutive sleep periods should be combined in a single sleep period. If FALSE, the function deletes all cases with IsMainSleep = FALSE.
#' @param lastSleep_startTime = character string indicating the maximum StartHour ("hh:mm") of the last combined sleep period
#' @param max_wakeNumber = integer indicating the maximum number of wake periods separating consecutive sleep periods (used only when combine = TRUE).
#' @param max_wakeDuration = numeric indicating the maximum duration (in hours) of wake periods separating consecutive sleep periods (used only when combine = TRUE). By default (NA) it is computed as the hour difference between the first element of sleep_limits and the lastSleep_startTime value
#' @param max_wakeDuration_exclude = character vector indicating the minimum and maximum times ("hh:mm") between which the max_wakeDuration parameter is NOT applied.
#' @param notCombined_LogId = character vector indicating the LogId of sleep periods that should not be combined with preceding/subsequent sleep periods
#' @param doPlot = logical value (default: FALSE) indicating whether combined sleep periods should be plotted
sleepPeriodRecode <- function(data,
sleep_limits = c("18:00","06:00"),
combine.sleep=TRUE,
lastSleep_startTime = "11:00",
max_wakeNumber=Inf,
max_wakeDuration=1.5,
max_wakeDuration_exclude = c("23:00","06:00"),
notCombined_LogId = NA,
doPlot=FALSE){
N.original = nrow(data)
cat("\n\n---\nRECODING SLEEP PERIODS (combine = ",combine.sleep,")\nOriginal number of cases = ",N.original,sep="")
data <- data[order(data$ID,data$StartTime),] # sorting by ID and time
if(combine.sleep==FALSE){
# .................................................
# Not combining sleep periods
# .................................................
data <- data[data$IsMainSleep!=FALSE,]
N.IsMainSleep=nrow(data)
cat("\n\n - Removing",N.original-N.IsMainSleep,"cases with IsMainSleep = FALSE")
data <- data[!(data$StartHour < as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[1],"00",sep=":")),tz="GMT") &
data$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[2],"00",sep=":")),tz="GMT")),]
N.sleep_limits <- nrow(data)
cat("\n\n - Removing",N.IsMainSleep-N.sleep_limits,"cases with StartHour outside the",sleep_limits[1],
"-",sleep_limits[2],"interval")
} else {
# .................................................
# Combining consecutive sleep periods
# .................................................
cat("\n\nCombining consecutive sleep periods...")
# filtering data based on sleep_limits[1] amd lastSleep_startTime
data <- data[!(data$StartHour < as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[1],"00",sep=":")),tz="GMT") &
data$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),
paste(lastSleep_startTime,"00",sep=":")),tz="GMT")),]
N.sleep_limits <- nrow(data)
cat("\n\n - Removing",N.original-N.sleep_limits,"cases with StartHour outside the",sleep_limits[1],
"-",lastSleep_startTime,"interval")
# defining constraints on nocturnal wake duration (default: sleep_limits[1] - lastSleep_startTime)
if(is.na(max_wakeDuration)){
max_wakeDuration <- difftime(as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[1],"00",sep=":")),tz="GMT"),
as.POSIXct(paste(substr(Sys.time(),1,10),
paste(lastSleep_startTime,"00",sep=":")),tz="GMT")) }
# defining first row of new.data (now just taking the first row, or the second when the first is a nap) --> to fix later (!)
data$combined <- FALSE
combinedLogId <- NA
if(!(data[1,"StartHour"] < as.POSIXct(paste(substr(Sys.time(),1,10),paste(sleep_limits[1],"00",sep=":")),tz="GMT") &
data[1,"StartHour"] > as.POSIXct(paste(substr(Sys.time(),1,10), paste(sleep_limits[2],"00",sep=":")),tz="GMT"))){
new.data <- cbind(data[1,],nCombined=0,combinedLogId=combinedLogId,combType=as.character(data[1,"IsMainSleep"]),
combSeq=paste(round(data[1,"MinutesAsleep"]/60,2),"S",sep=""))
} else { new.data <- cbind(data[2,],nCombined=0,combinedLogId=combinedLogId,combType=as.character(data[2,"IsMainSleep"]),
combSeq=paste(round(data[2,"MinutesAsleep"]/60,2),"S",sep="")) }
# iteratively adding sleep periods to new.data OR combining them when meeting the criteria identifying them as consecutive
for(i in 2:nrow(data)){
# updating cases to be compared
sleep1 <- new.data[nrow(new.data),]
sleep2 <- data[i,]
# identification of consecutive sleep periods: (1) within the same subject
if(sleep2$ID == sleep1$ID &
# (2) AND sleep1 StartHour between 18:00 and 00:00 AND sleep2 StartHour before 11:00 of the following day
(
(sleep1$StartHour <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:59:59"),tz="GMT") &
sleep1$StartHour >= as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="GMT") &
sleep2$StartTime <= as.POSIXct(paste(as.character(sleep1$ActivityDate + 1),"11:00:00"),tz="GMT"))
|
# OR sleep1 StartHour between 00:00 and 11:00 AND sleep2 StartHour before 11:00 of the same day
(sleep1$StartHour >= as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT") &
sleep1$StartHour <= as.POSIXct(paste(substr(Sys.time(),1,10),"11:00:00"),tz="GMT") &
sleep2$StartTime <= as.POSIXct(paste(as.character(sleep1$ActivityDate),"11:00:00"),tz="GMT"))
) &
# (3) AND the No. of combined sleep periods in this night is lower than max_wakeNumber (OLD TO BE CHECKED (!))
sleep1$nCombined - 1 < max_wakeNumber & # (2) within the defined max No. of nocturnal wake periods
# (4) AND unspecified max_wakeDuration_exclude AND sleep2 StartTime - sleep1 EndTime being < than max_wakeDuration
(
(is.na(max_wakeDuration_exclude[1]) & difftime(sleep2$StartTime,sleep1$EndTime,units="hours") < max_wakeDuration)
|
# OR specified max_wakeDuration_exclude AND wake between sleep1 and sleep2 is < than max_wakeDuration ...
(!is.na(max_wakeDuration_exclude[1]) &
!(
difftime(sleep2$StartTime,sleep1$EndTime,units="hours") > max_wakeDuration &
(
# ... to be applied ONLY when sleep1 ends before 23:00
sleep1$EndHour < as.POSIXct(paste(substr(Sys.time(),1,10),"23:00:00"),tz="GMT")
|
# OR when sleep2 does starts after 6:00
sleep2$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") ) )) )){
# excluding those cases reported in the notCombined_LogId argument (taking only the longest one)
if(!is.na(notCombined_LogId)[1] & (sleep1$LogId%in%notCombined_LogId | sleep2$LogId%in%notCombined_LogId)){
sleep1_TIB <- difftime(sleep1$EndTime,sleep1$StartTime)
sleep2_TIB <- difftime(sleep2$EndTime,sleep2$StartTime)
if(sleep2_TIB>sleep1_TIB){ # replacing sleep1 with sleep2 when sleep2 is longer (otherwise ignoring sleep2)
combinedLogId <- NA
new.data[nrow(new.data),] <- cbind(data[i,],
nCombined=0,combinedLogId=combinedLogId,
combType=as.character(data[i,"IsMainSleep"]),
combSeq=paste(round(data[i,"MinutesAsleep"]/60,2),"S",sep="")) }
} else { # updating the information of the first sleep period by integrating the consecutive one
new.data[nrow(new.data),c("EndTime","EndHour")] <- data[i,c("EndTime","EndHour")] # updating EndTime = last period
new.data[nrow(new.data),"MinutesAsleep"] <- new.data[nrow(new.data),"MinutesAsleep"] + data[i,"MinutesAsleep"] # summing TST
new.data[nrow(new.data),"TimeInBed"] <- difftime(as.POSIXct(as.character(new.data[nrow(new.data),"EndTime"]),
tz="GMT"), # TIB as EndTime - StartTime in GMT (like in EBE data)
as.POSIXct(as.character(new.data[nrow(new.data),"StartTime"]),
tz="GMT"),units="min")
new.data[nrow(new.data),"combined"] <- data[c(i,i-1),"combined"] <- TRUE # marking as combined
new.data[nrow(new.data),"nCombined"] <- new.data[nrow(new.data),"nCombined"] + 1 # updating nCombined
combinedLogId <- ifelse(is.na(combinedLogId),
as.character(data[i,"LogId"]),
paste(combinedLogId,as.character(data[i,"LogId"]),sep="_"))
new.data[nrow(new.data),"combinedLogId"] <- combinedLogId
new.data[nrow(new.data),"combType"] <- paste(new.data[nrow(new.data),"combType"],data[i,"IsMainSleep"],sep="-")
new.data[nrow(new.data),"combSeq"] <- paste(new.data[nrow(new.data),"combSeq"],
"-",round(difftime(data[i,"StartTime"],
new.data[nrow(new.data),"EndTime"],
units="hours"),2),"W",
"-",round(data[i,"MinutesAsleep"]/60,2),"S",sep="")
# when not identified as consecutive sleep periods
} } else { combinedLogId <- NA
new.data <- rbind(new.data,cbind(data[i,],
nCombined=0,combinedLogId=combinedLogId,combType=as.character(data[i,"IsMainSleep"]),
combSeq=paste(round(data[i,"MinutesAsleep"]/60,2),"S",sep=""))) }}
# recoding and printing information on combined sleep periods
new.data$combType <- as.factor(gsub("TRUE","Main",gsub("FALSE","Short",new.data$combType)))
new.data[,c("nCombined","combSeq")] <- lapply(new.data[,c("nCombined","combSeq")],as.factor)
cat("\n\n - ",nrow(new.data[new.data$combined==TRUE,]),"identified groups of consecutive sleep periods:")
N.combined <- nrow(new.data)
cat("\n Removing",N.sleep_limits-N.combined,"cases (integrated with previous sleep periods)")
# filtering non-combined sleep periods starting between sleep_limits[2] and lastSleep_startTime
new.data <- new.data[!(new.data$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[2],"00",sep=":")),tz="GMT") &
new.data$StartHour <= as.POSIXct(paste(substr(Sys.time(),1,10),
paste(lastSleep_startTime,"00",sep=":")),tz="GMT")),]
cat("\n Removing further",N.combined-nrow(new.data),"cases of non-combined sleep starting between",
sleep_limits[2],"and",lastSleep_startTime,
"\n\n\nUpdated number of cases =",nrow(new.data))
if(doPlot==TRUE){ require(ggplot2); require(gridExtra)
p <- new.data[new.data$combined==TRUE,]
cat("\n\nPlotting",nrow(p),"cases of combined sleep periods...")
for(i in 1:nrow(p)){
p.data <- data[data$ID==p[i,"ID"] & data$combined==TRUE,] # selecting data within the same two days
p.data <- p.data[difftime(p.data$ActivityDate,p[i,"ActivityDate"],units="days")>=0 &
difftime(p.data$ActivityDate,p[i,"ActivityDate"],units="days")<=1,]
# removing first night if it is the same night plotted in the previous case
if(i > 1) { if(p.data[1,"EndTime"] == p[i-1,"EndTime"]){ p.data <- p.data[2:nrow(p.data),] }}
# removing nights recorded in the following day
if(p.data[nrow(p.data),"StartHour"] >= as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[1],"00",sep=":")),tz="CET") &
p.data[nrow(p.data),"StartHour"] <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:59:59"),tz="CET") &
p.data[nrow(p.data),"ActivityDate"] == p.data[1,"ActivityDate"] + 1){
p.data <- p.data[1:(nrow(p.data)-1),] }
# updating EndHour when StartHour is after EndHour
p.data[p.data$EndHour<p.data$StartHour,"EndHour"] <- p.data[p.data$EndHour<p.data$StartHour,"EndHour"] + 1*24*60*60
if(p[i,"EndHour"]<p[i,"StartHour"]){ p[i,"EndHour"] <- p[i,"EndHour"] + 1*24*60*60 }
# updating StartHour and EndHour when StartHour is after the previous EndHour
for(j in 2:nrow(p.data)){
if(p.data[j,"StartHour"] < p.data[j-1,"EndHour"]){
p.data[j,c("StartHour","EndHour")] <- p.data[j,c("StartHour","EndHour")] + 1*24*60*60 }}
# updating all times to allign with the current day
if(p.data[1,"StartHour"] < as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="CET")){
p.data[,c("StartHour","EndHour")] <- p.data[,c("StartHour","EndHour")] + 1*24*60*60
p[i,c("StartHour","EndHour")] <- p[i,c("StartHour","EndHour")] + 1*24*60*60
}
cat("\n\nCase",i,": Subject",as.character(p[i,"ID"]),"on",as.character(p[i,"ActivityDate"]))
p1 <- qplot(data=p.data,
ymin=StartHour,ymax=EndHour,x=ID,geom="linerange",size=I(1)) +
coord_flip() + theme_bw() + theme(panel.grid = element_blank()) + xlab("") + ylab("") +
scale_y_datetime(labels = function(x) format(x, format = "%H:%M"),
limits = c(as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[1],"00",sep=":")),tz="CET"),
p.data[nrow(p.data),"EndHour"])) +
geom_text(aes(label=format(StartHour,format = "%H:%M"),y=StartHour),hjust = 0.5, nudge_x = 0.1) +
geom_text(aes(label=format(EndHour,format = "%H:%M"),y=EndHour),hjust = 0.5, nudge_x = -0.1) +
ggtitle(paste("Original TIB =", paste(round(p.data$TimeInBed/60,2), collapse = ', '),"hours - TST =",
paste(round(p.data$MinutesAsleep/60,1), collapse = ', '),"hours"))
p2 <- qplot(data=p[i,],ymin=StartHour,ymax=EndHour,x=ID,geom="linerange",size=I(1)) +
coord_flip() + theme_bw() + theme(panel.grid = element_blank()) + xlab("") + ylab("") +
scale_y_datetime(labels = function(x) format(x, format = "%H:%M"),
limits = c(as.POSIXct(paste(substr(Sys.time(),1,10),
paste(sleep_limits[1],"00",sep=":")),tz="CET"),
p[i,"EndHour"])) +
geom_text(aes(label=format(StartHour,format = "%H:%M"),y=StartHour),hjust = 0.5, nudge_x = 0.1) +
geom_text(aes(label=format(EndHour,format = "%H:%M"),y=EndHour),hjust = 0.5, nudge_x = 0.1) +
ggtitle(paste("Combined TIB =",round(p[i,"TimeInBed"]/60,1),
"hours - TST =",round(p[i,"MinutesAsleep"]/60,1),"hours"))
grid.arrange(p1,p2,nrow=2) }}
# correcting EndHour when EndHour < StartHour (i.e., adding one day)
new.data[difftime(new.data$EndHour,new.data$StartHour,units="min")<0,"EndHour"] <-
new.data[difftime(new.data$EndHour,new.data$StartHour,units="min")<0,"EndHour"] + 1*24*60*60
# updating data
data <- new.data }
return(data) }
Here, we use the sleepPeriodRecode
function to filter and combine sleep periods according to the criteria specified above.
Four cases of consecutive sleep periods are not combined based on visual inspection (see PLOTTING): s025 on 2019-05-20 (TIB = 11.7h, LogId = 22420732071), s026 on 2019-05-10 (TIB = 14h, LogId = 22287155636), s026 on 2019-05-17 (TIB = 12.7h, LogId = 22381565258), and s093 on 2020-11-04 (TIB = 14.9h, LogId = 29568099991).
sleepLog.new <- sleepPeriodRecode(data=sleepLog, # data to be processed
sleep_limits=c("18:00","06:00"), # minimum and maximum StartHour
combine.sleep=TRUE, # should consecutive sleep periods be combined?
max_wakeNumber=Inf, # max No. of wake periods separating consecutive sleep periods
max_wakeDuration=1.5, # max hours of wake periods between consecutive sleep periods
max_wakeDuration_exclude=c("23:00","06:00"), # time limits between which max_wakeDur is NOT applied
lastSleep_startTime="11:00", # max StartHour of the last combined sleep period
notCombined_LogId=c("22420732071","22287155636","22381565258","29568099991")) # to not combine
##
##
## ---
## RECODING SLEEP PERIODS (combine = TRUE)
## Original number of cases = 5402
##
## Combining consecutive sleep periods...
##
## - Removing 339 cases with StartHour outside the 18:00 - 11:00 interval
##
## - 62 identified groups of consecutive sleep periods:
## Removing 67 cases (integrated with previous sleep periods)
## Removing further 75 cases of non-combined sleep starting between 06:00 and 11:00
##
##
## Updated number of cases = 4921
Here, we can see the patterns of sequences of combined Main and Short sleep periods, and the associated duration (in hours) of sleep and wake periods for the combined cases.
In other words, the following table shows the number of non-combined Main (isMainSleep=TRUE
) and Short sleep periods (isMainSleep=FALSE
), and the number of cases recoded by combining a Main and the following Short sleep period (Main-Short), a Short and the following Main sleep period (Short-Main), or more specific cases (Short-Short-Main)
# sequences of Main and non-Main sleep periods
as.data.frame(summary(sleepLog.new$combType))
Here, we visualize all cases of automatically combined consecutive sleep periods. From this visual inspection, we decided to not combine four cases of automatically combined consecutive sleep periods: s025 on 2019-05-20 (TIB
= 11.7h, LogId
= 22420732071), s026 on 2019-05-10 (TIB
= 14h, LogId
= 22287155636), s026 on 2019-05-17 (TIB
= 12.7h, LogId
= 22381565258), and s093 on 2020-11-04 (TIB
= 14.9h, LogId
= 29568099991).
p <- sleepPeriodRecode(data = sleepLog, # data to be processed
sleep_limits = c("18:00","06:00"), # minimum and maximum StartHour
combine.sleep = TRUE, # should consecutive sleep periods be combined?
max_wakeNumber = Inf, # max No. of wake periods separating consecutive sleep periods
max_wakeDuration = 1.5, # max hours of wake periods between consecutive sleep periods
max_wakeDuration_exclude = c("23:00","06:00"), # time limits between which max_wakeDur is NOT applied
lastSleep_startTime = "11:00", # max StartHour of the last combined sleep period
doPlot=TRUE)
##
##
## ---
## RECODING SLEEP PERIODS (combine = TRUE)
## Original number of cases = 5402
##
## Combining consecutive sleep periods...
##
## - Removing 339 cases with StartHour outside the 18:00 - 11:00 interval
##
## - 66 identified groups of consecutive sleep periods:
## Removing 67 cases (integrated with previous sleep periods)
## Removing further 75 cases of non-combined sleep starting between 06:00 and 11:00
##
##
## Updated number of cases = 4921
##
## Plotting 66 cases of combined sleep periods...
##
## Case 1 : Subject s007 on 2019-05-18
##
##
## Case 2 : Subject s019 on 2019-04-30
##
##
## Case 3 : Subject s021 on 2019-05-05
##
##
## Case 4 : Subject s023 on 2019-04-23
##
##
## Case 5 : Subject s025 on 2019-05-09
##
##
## Case 6 : Subject s025 on 2019-05-20
##
##
## Case 7 : Subject s026 on 2019-05-10
##
##
## Case 8 : Subject s026 on 2019-05-17
##
##
## Case 9 : Subject s026 on 2019-06-23
##
##
## Case 10 : Subject s026 on 2019-08-14
##
##
## Case 11 : Subject s027 on 2019-07-03
##
##
## Case 12 : Subject s028 on 2019-06-11
##
##
## Case 13 : Subject s029 on 2019-05-04
##
##
## Case 14 : Subject s030 on 2019-06-30
##
##
## Case 15 : Subject s031 on 2019-06-15
##
##
## Case 16 : Subject s033 on 2019-07-28
##
##
## Case 17 : Subject s034 on 2019-07-07
##
##
## Case 18 : Subject s034 on 2019-07-29
##
##
## Case 19 : Subject s050 on 2019-10-11
##
##
## Case 20 : Subject s050 on 2019-11-08
##
##
## Case 21 : Subject s050 on 2019-11-09
##
##
## Case 22 : Subject s050 on 2019-11-14
##
##
## Case 23 : Subject s050 on 2019-11-17
##
##
## Case 24 : Subject s050 on 2019-11-18
##
##
## Case 25 : Subject s050 on 2019-11-20
##
##
## Case 26 : Subject s050 on 2019-11-21
##
##
## Case 27 : Subject s050 on 2019-11-22
##
##
## Case 28 : Subject s050 on 2019-11-23
##
##
## Case 29 : Subject s050 on 2019-11-24
##
##
## Case 30 : Subject s052 on 2019-10-12
##
##
## Case 31 : Subject s055 on 2019-10-19
##
##
## Case 32 : Subject s055 on 2019-11-12
##
##
## Case 33 : Subject s055 on 2019-11-19
##
##
## Case 34 : Subject s056 on 2019-11-28
##
##
## Case 35 : Subject s059 on 2019-11-22
##
##
## Case 36 : Subject s065 on 2020-01-27
##
##
## Case 37 : Subject s065 on 2020-02-07
##
##
## Case 38 : Subject s065 on 2020-02-10
##
##
## Case 39 : Subject s065 on 2020-02-25
##
##
## Case 40 : Subject s071 on 2020-02-08
##
##
## Case 41 : Subject s072 on 2019-12-22
##
##
## Case 42 : Subject s072 on 2020-01-04
##
##
## Case 43 : Subject s072 on 2020-01-28
##
##
## Case 44 : Subject s072 on 2020-02-05
##
##
## Case 45 : Subject s074 on 2020-02-10
##
##
## Case 46 : Subject s080 on 2020-03-21
##
##
## Case 47 : Subject s080 on 2020-04-09
##
##
## Case 48 : Subject s081 on 2020-03-08
##
##
## Case 49 : Subject s083 on 2020-04-02
##
##
## Case 50 : Subject s083 on 2020-04-07
##
##
## Case 51 : Subject s085 on 2020-09-21
##
##
## Case 52 : Subject s091 on 2020-09-10
##
##
## Case 53 : Subject s091 on 2020-09-26
##
##
## Case 54 : Subject s092 on 2020-10-14
##
##
## Case 55 : Subject s093 on 2020-11-04
##
##
## Case 56 : Subject s102 on 2020-11-09
##
##
## Case 57 : Subject s103 on 2020-10-27
##
##
## Case 58 : Subject s104 on 2020-11-07
##
##
## Case 59 : Subject s108 on 2020-12-17
##
##
## Case 60 : Subject s109 on 2020-11-13
##
##
## Case 61 : Subject s112 on 2020-12-20
##
##
## Case 62 : Subject s114 on 2021-03-24
##
##
## Case 63 : Subject s115 on 2021-03-12
##
##
## Case 64 : Subject s120 on 2021-03-27
##
##
## Case 65 : Subject s120 on 2021-04-01
##
##
## Case 66 : Subject s120 on 2021-04-03
Here, we visualize the differences in sleep timing and duration between the original data and those including combined cases.
Here, we visualize the distribution of StartTime
(red dots) and TIB
(black lines from StartTime
to EndTime
) in the original data by plotting TIB
intervals for each subject (note that more than one TIB
is plotted for each subject). We can see that in the original data several cases start after 6 AM and before 6 PM.
In the first plot, all sleep periods detected for a given participant are shown in the same line.
# plotting one line per subject
qplot(data=sleepLog,
ymin=StartHour,ymax=EndHour,x=ID,geom="linerange") +
geom_point(aes(y=StartHour),col="red") +
geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") + ylab("") +
scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
ggtitle("Original TIBs")
In this second plot, each sleep period is visualized on a different line.
# plotting one line per sleep period
qplot(data=sleepLog,
ymin=StartHour,ymax=EndHour,x=as.factor(paste(ID,ActivityDate,sep=".")),geom="linerange") +
geom_point(aes(y=StartHour),col="red") +
geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") + ylab("") +
scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
ggtitle("Original TIBs")
Here, we visualize the distribution of StartTime
(red dots) and TIB
(black lines from StartTime
to EndTime
) in the processed data (combined sleep periods are showed in blue). We can see that now no cases start between 6 AM and 6 PM.
In the first plot, all sleep periods detected for a given participant are shown in the same line.
qplot(data=sleepLog.new,
ymin=StartHour,ymax=EndHour,x=ID,geom="linerange",size=I(1),col=combined) + geom_point(aes(x = ID,y=StartHour),col="red") +
geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") + ylab("") +
scale_color_manual(values=c("black","lightblue")) +
scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
ggtitle("Combined TIBs")
In this second plot, each sleep period is visualized on a different line.
# plotting one line per sleep period
qplot(data=sleepLog.new,
ymin=StartHour,ymax=EndHour,x=as.factor(paste(ID,ActivityDate,sep=".")),geom="linerange",size=I(1),col=combined) +
geom_point(aes(y=StartHour),col="red") +
geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") + ylab("") +
scale_color_manual(values=c("black","lightblue")) +
scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
as.POSIXct(paste(substr(Sys.time(),1,10),
paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
ggtitle("Combined TIBs")
Here, we visually compare the distribution of TIB
and TST
between the original (in yellow) and the recoded data (in red).
# TIB
ggplot(sleepLog,aes(TimeInBed/60)) + geom_histogram(fill=rgb(1,1,0,alpha=.5),col="black") +
geom_histogram(data=sleepLog.new,fill=rgb(1,0,0,alpha=.5),col="black") +
ggtitle("TIB (hours) in the original (yellow) and combined (red) sleep periods")
# TST
ggplot(sleepLog,aes(MinutesAsleep/60)) + geom_histogram(fill=rgb(1,1,0,alpha=.5),col="black") +
geom_histogram(data=sleepLog.new,fill=rgb(1,0,0,alpha=.5)) +
ggtitle("TST (hours) in the original (yellow) and combined (red) sleep periods")
Comments:
we can notice a decrease in the number of TIB
< 5h, which have been combined to longer sleep periods. We also note that our procedure produced some outliers with extremely long TIB
(> 11h)
similar results can be observed with TST
Then, we update the ActivityDate
variable so that it indicates the previous day when the StartTime
is between 00:00 and 06:00. This allows better clarifying the distinction between consecutive nocturnal sleep periods.
# setting times
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")
# updating ActivityDate
sleepLog.new[sleepLog.new$StartHour >= h00 & sleepLog.new$StartHour <= h06,"ActivityDate"] <-
sleepLog.new[sleepLog.new$StartHour >= h00 & sleepLog.new$StartHour <= h06,"ActivityDate"] - 1
We can use the updated ActivityDate
variable to check for double cases with the same ID
and ActivityDate
value.
# No. of duplicates IDday before updatingActivityDate
nrow(sleepLog.new[duplicated(sleepLog.new$IDday),]) # 630
## [1] 630
# No. of duplicates IDday after updatingActivityDate
sleepLog.new$IDday <- as.factor(paste(sleepLog.new$ID,sleepLog.new$ActivityDate,sep="_"))
nrow(sleepLog.new[duplicated(sleepLog.new$IDday),]) # 57
## [1] 57
# showing duplicates
dupl <- sleepLog.new[duplicated(sleepLog.new$IDday),"IDday"]
sleepLog.new[sleepLog.new$IDday%in%levels(as.factor(as.character(dupl))),
c("ID","LogId","ActivityDate","StartTime","EndTime","SleepDataType")]
Comments:
in 57 cases (1%), there are two observations with the same ID
and ActivityDate
value
the inspection of these cases suggest that all of them are short early-evening naps (mostly SleepDataType
= “classic”) followed by longer nocturnal sleep periods (mostly SleepDataType
= “stages”), and thus, only the latter is kept for the analyses.
dupl <- sleepLog.new[sleepLog.new$IDday%in%levels(as.factor(as.character(dupl))),]
# shortNaps (mostly "classic")
(shortNaps <- dupl[seq(1,nrow(dupl)-1,2),])
summary(as.numeric(difftime(shortNaps$EndTime,shortNaps$StartTime,units="hours")))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.050 1.450 1.767 1.931 2.267 4.167
# longSleeps (mostly "stages")
(longSleeps <- dupl[seq(2,nrow(dupl),2),])
summary(as.numeric(difftime(longSleeps$EndTime,longSleeps$StartTime,units="hours")))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.133 5.767 6.550 6.701 7.883 10.700
# plotting
par(mfrow=c(2,2))
hist(as.numeric(difftime(shortNaps$EndTime,shortNaps$StartTime,units="hours")),main="TIB shortNaps (hours)",xlab="")
hist(shortNaps$StartHour,main="StartHour shortNaps",breaks=30,col="gray",xlab="")
hist(as.numeric(difftime(longSleeps$EndTime,longSleeps$StartTime,units="hours")),main="TIB longSleeps (hours)",xlab="")
hist(longSleeps$StartHour,main="StartHour longSleeps",breaks=40,col="gray",xlab="")
Here, we remove the 57 cases of early-evening naps recorded before the subsequent nocturnal sleep periods. Thus, now there are no further cases with the same ID
and ActivityDate
value.
# removing 57 cases of early-evening naps
memory_sleepLog.new <- sleepLog.new
sleepLog.new <- sleepLog.new[!(sleepLog.new$LogId%in%levels(as.factor(as.character(shortNaps$LogId)))),]
# printing info
cat("excluded",nrow(memory_sleepLog.new)-nrow(sleepLog.new),"cases of early-evening naps preceding nocturnal sleep")
## excluded 57 cases of early-evening naps preceding nocturnal sleep
In the San Francisco area, the Daylight Saving Time (DST) changed on March 10th (1h forward) and November 3rd, 2019 (1h backward), and again on March 8th (1h forward) and November 1st, 2020 (1h backward), and finally on March 14th, 2021. Here, we inspect the distributions of StartHour values in the 5 days preceding and the 5 days following each of these dates, in order to check whether time was automatically updated by the wristband.
# setting DST changing times
DST.changes <- as.Date(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"))
# selecting cases with ActivityDate = DST.changes + or - 5 days
DST <- as.data.frame(matrix(nrow=0,ncol=4))
for(i in 1:length(DST.changes)){
DST <- rbind(DST,sleepLog.new[difftime(sleepLog.new$ActivityDate,DST.changes[i],units="days")>(-5) &
difftime(sleepLog.new$ActivityDate,DST.changes[i],units="days")<5,
c("ID","ActivityDate","StartTime","StartHour","EndTime","TimeInBed","Duration")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$ActivityDate%in%DST.changes,"DST"] <- TRUE
# computing time (hours) from midnight
DST$timeFrom00 <- as.POSIXct(paste(lubridate::hour(DST$StartTime), lubridate::minute(DST$StartTime)), format="%H %M")
DST$timeFrom00 <- as.numeric(difftime(DST$timeFrom00,
as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT"),units="hours"))
# subtracting 1 day to cases with timeFrom00 > 12
DST[DST$timeFrom00>12,"timeFrom00"] <- DST[DST$timeFrom00>12,"timeFrom00"] - 24
# plotting StartTime trends
for(i in 1:length(DST.changes)){
DSTs <- c(substr(DST.changes[i],1,7),
paste(substr(DST.changes[i],1,6),as.integer(substr(DST.changes[i],7,7))-1,sep=""))
print(ggplot(data=DST[substr(DST$ActivityDate,1,7)%in%DSTs,],aes(x=ActivityDate,y=timeFrom00)) +
geom_line(aes(colour=ID)) + geom_point(aes(colour=ID),size=3) + ggtitle(DST.changes[i]) +
geom_vline(xintercept=DST.changes[i]) +
theme(axis.text.x=element_text(angle=45),legend.position = "none"))}
Comments:
the visual inspection of StartTime
trends in those participants that recorded their sleep during the days around DST changes does not seem to suggest systematic shifts pairing with time changes
the only DST change that shows some substantial shift in StartTime
is the first one (2019-03-10), with four participants out of eight showing an increasing upward trend of one-to-four hours
even the inspection of the TimeInBed
(minutes) or Duration
(in ms) automatically computed by the Fitbit and the difference between EndTime
and StartTime
does not highlight systematic biases associated with DST changes
# TimeInBed vs. EndTime-StartTime
DST$End_minus_Start <- as.numeric(difftime(DST$EndTime,DST$StartTime,units="mins"))
DST$timeDiff <- DST$TimeInBed - DST$End_minus_Start
DST$timeDiff2 <- DST$Duration/1000/60 - DST$End_minus_Start
DST[,c("StartTime","EndTime","TimeInBed","Duration","End_minus_Start","timeDiff","timeDiff2")]
Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.
par(mfrow=c(3,3))
for(ID in levels(sleepLog.new$ID)){
plot((1:nrow(sleepLog.new[sleepLog.new$ID==ID,])~sleepLog.new[sleepLog.new$ID==ID,"StartTime"]),main=ID,xlab="",ylab="") }
Comments:
most participants show several clusters of missing data during the period of participation, with participants s038, s041, s048, s052, s063, s064, s086, s089, s090, s105, s109, s119, and s120 showing the longest and most frequent periods of missing data
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s040, s042, s053, s055, s058, s060, s062, s063, s064, and s094), coherently with what observed at the beginning of section 2.3.
these cases will be better discussed in the data cleaning section.
Here, we update and save the recoded sleepLog dataset with the 4,921 included cases.
# updating and saving sleepLog dataset with combined TIBs
sleepLog_noncomb <- sleepLog
save(sleepLog_noncomb,file="DATA/datasets/sleepLog_nonComb.RData") # saving original dataset
sleepLog <- sleepLog.new
save(sleepLog,file="DATA/datasets/sleepLog_combined.RData") # saving combined dataset
sleepEBE
data exported from Fitabase consist of 30-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘light,’ ‘deep,’ or REM sleep. Specifically, we focus on the SleepStage
data column (i.e., accounting for short detecting awakenings/arousals to adjust wake missdetection).
Here, we recode the Time
variable (i.e., the “Date and time within a defined sleep period in mm/dd/yy hh:mm:ss format”), based on which the ActivityDate
variable is computed. To change the resolution from minutes (as exported from Fitabase) to seconds, we use argument add30
of the timeCheck
function, which adds 30 seconds to each other epoch within the same LOG. Moreover, the day.withinNight
argument is used to keep the same ActivityDate
value for those epochs recorded before and after midnight, within the same LogId
.
p <- timeCheck(data=sleepEBE,ID="ID",day="ActivityDate",hour="Time",
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M",
add30=TRUE,LogId="LogId",day.withinNight=TRUE)
## 4243037 observations in 4049 days from 93 participants:
##
## - mean No. of days/participant = 43.54 SD = 12.55 min = 4 max = 62
## - mean data collection duration (days) = 73.82 - SD = 36.14 min = 5 max = 331
##
## - mean No. of missing days per participant = 30.28 SD = 36.17 min = 1 max = 278
## - mean No. of consecutive missing days per participant = 15.3 SD = 32.61 min = 2 max = 268
##
## - No. of duplicated cases by hour (same ID and hour) = 0
Comments:
the ActivityDate
variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 40-60 nonmissing days of data, with only one participant showing less than 10 days, and further six participants showing less than 20 days
no cases have the same participant’s identifier and temporal coordinate
the No. (substantial) of missing days and consecutive missing days is in line with that shown for sleepLog
data
As in the sleepLog
data, sleepEBE
data are associated with specific LogId
identifying separate sleep periods for each participant.
# LogId as factor
sleepEBE <- p # dataset processed with timeCheck
sleepEBE$LogId <- as.factor(sleepEBE$LogId)
# sanity check: no cases with double logId and the same day
sleepEBE$dayLog <- as.factor(paste(sleepEBE$LogId,sleepEBE$ActivityDate,sep="_"))
dayLog <- sleepEBE[!duplicated(sleepEBE$dayLog),]
cat("sanity check:",length(which(summary(dayLog$LogId)==2))==0)
## sanity check: TRUE
We can notice that the number of LogIds
from sleepEBE
data (N = 4,573) is lower than that showed by sleepLog
data (N = 5,402). This difference is partially accounted by cases of SleepDataType = “classic” (N = 857), not included in sleepEBE data, in addition to some cases (N = 41) only included in sleepEBE
data.
# sleepLog (870 sleepLog only)
data.frame(NsleepLog=nrow(sleepLog_noncomb),NsleepLog.Stages=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType!="classic",]),
NsleepLog.Classic=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="classic",]),
NsleepLog_NOsleepEBE=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId)]))
# sleepEBE (41 sleepEBE only)
stagesLogs <- levels(as.factor(as.character(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="stages","LogId"])))
classicLogs <- levels(as.factor(as.character(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="classic","LogId"])))
data.frame(NsleepEBE=nlevels(sleepEBE$LogId),
NsleepEBE_INsleepLog=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]),
NsleepEBE_INsleepLog.Stages=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%stagesLogs]),
NsleepEBE_INsleepLog.Classic=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%classicLogs]),
NsleepEBE_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]))
Comments:
LogId
value that is included in both sleepEBE
and sleepLog
“classic” is the same double case highlighted in section 2.3.1. Coherently with the retained case in sleepLog
data, this case shows a TIB of 11.3h.# duration (hours) of the single case included in both sleepEBE and sleepLog classic LogIds
nrow(sleepEBE[sleepEBE$LogId=="24433907842",])/2/60
## [1] 11.30833
Here, we inspect EBE Time
values in those epochs immediately preceding or following the DST changes highlighted in section 2.3.5. Note that DST times in the San Francisco area always changed at 2 AM
# setting DST changing times
DST.changes <- as.POSIXct(paste(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"),
"02:00:00",sep=" "))
# selecting cases with ActivityDate = DST.changes + or - 1 minute
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
DST <- rbind(DST,sleepEBE[difftime(sleepEBE$Time,DST.changes[i],units="mins")>(-1) &
difftime(sleepEBE$Time,DST.changes[i],units="mins")<1,c("ID","Time")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]
Comments:
the inspection of the temporal coordinates of the epochs immediately preceding or following DST changes suggests that time is not updated by the Fitbit device during DST changes, coherently with what concluded for sleepLog
data in section 2.3.5
indeed, there are no 1-hours ‘holes’ in the sleepEBE
data corresponding to DST changes, but the Time
is continuously updated by adding 30 sec from one epoch to the following one
The temporal continuity discussed above can be also inspected throughout the whole sleepEBE
dataset, by counting the No. of consecutive epochs within the same LogId
value whose StartTime
values differ by more than 60 sec (i.e., 30 sec as expected + further 30 due to issues related to time rounding). This is done with the checkTimeContinuity
function.
checkTimeContinuity <- function(data,temporalDiff,doPlot=FALSE){
nHoles <- 0
for(LOG in levels(as.factor(as.character(data$LogId)))){
LogData <- data[data$LogId==LOG,c("ID","Time")]
LogData$Time.LAG <- dplyr::lag(LogData$Time, n = 1, default = NA)
LogData$diffTime <- as.numeric(difftime(LogData$Time,LogData$Time.LAG,units="secs"))
LogData$IDtime <- as.factor(paste(LogData$ID,LogData$Time))
diffs <- na.omit(LogData[LogData$diffTime>temporalDiff,])
if(nrow(diffs)>0){
diffs <- LogData[LogData$ID%in%levels(as.factor(as.character(diffs$ID))) &
LogData$Time>=na.omit(LogData[LogData$diffTime>temporalDiff,"Time"])-120 &
LogData$Time<=na.omit(LogData[LogData$diffTime>temporalDiff,"Time"])+120,]
print(diffs)
nHoles <- nHoles + 1 }}
cat(nHoles,"consecutive epochs separated by more than 60 secs")
if(doPlot==TRUE){
par(mfrow=c(3,3))
for(ID in levels(data$ID)){ plot((1:nrow(data[data$ID==ID,])~data[data$ID==ID,"Time"]),main=ID) }}
}
# sorting sleepEBE by ID, ActivityDate, and Time
sleepEBE <- sleepEBE[order(sleepEBE$ID,sleepEBE$ActivityDate,sleepEBE$Time),]
# checking temporal continuity setting temporalDiff to 60 sec
checkTimeContinuity(data=sleepEBE,temporalDiff=60)
## ID Time Time.LAG diffTime
## 2714657 s081 2020-03-08 09:00:00 2020-03-08 07:59:30 3630
## 2714658 s081 2020-03-08 09:00:30 2020-03-08 09:00:00 30
## 2714659 s081 2020-03-08 09:01:00 2020-03-08 09:00:30 30
## 2714660 s081 2020-03-08 09:01:30 2020-03-08 09:01:00 30
## 2714661 s081 2020-03-08 09:02:00 2020-03-08 09:01:30 30
## IDtime
## 2714657 s081 2020-03-08 09:00:00
## 2714658 s081 2020-03-08 09:00:30
## 2714659 s081 2020-03-08 09:01:00
## 2714660 s081 2020-03-08 09:01:30
## 2714661 s081 2020-03-08 09:02:00
## 1 consecutive epochs separated by more than 60 secs
# one case with diffTime > 60 sec
sleepEBE[which(rownames(sleepEBE)%in%as.character(2714654:2714660)),c("ID","LogId","ActivityDate","Time")]
Comments:
only in one case (LogId
= 26201445747) there are two consecutive epochs separated by more than 60 seconds, namely 1 hour, and this case is observed precisely on March 8th, 2020 (DST change)
nevertheless, no other ‘holes’ are observed in corrispondence of DST changes, coherently with our conclusions in the section above
none of the other cases show time shifts of more than 60 secs
In a number of cases (28%), there is a time shift of 60 secs between the last and the preceding epoch. Here, we correct these cases by subtracting 30 seconds from the last epoch.
# counting and correcting cases with 60 secs between the last and the preceding epoch
n60 <- 0
for(LOG in levels(as.factor(as.character(sleepEBE$LogId)))){
LogData <- sleepEBE[sleepEBE$LogId==LOG,c("ID","Time")]
if(as.numeric(difftime(tail(LogData$Time,1),tail(LogData$Time,2)[1],units="secs"))==60){
n60 <- n60 + 1
sleepEBE[sleepEBE$LogId==LOG,"Time"][nrow(LogData)] <- sleepEBE[sleepEBE$LogId==LOG,"Time"][nrow(LogData)] - 30 }}
n60 # No. of corrected cases
## [1] 1289
# re-checking temporal continuity setting temporalDiff to 30 sec
checkTimeContinuity(data=sleepEBE,temporalDiff=30)
## ID Time Time.LAG diffTime
## 2714657 s081 2020-03-08 09:00:00 2020-03-08 07:59:30 3630
## 2714658 s081 2020-03-08 09:00:30 2020-03-08 09:00:00 30
## 2714659 s081 2020-03-08 09:01:00 2020-03-08 09:00:30 30
## 2714660 s081 2020-03-08 09:01:30 2020-03-08 09:01:00 30
## 2714661 s081 2020-03-08 09:02:00 2020-03-08 09:01:30 30
## IDtime
## 2714657 s081 2020-03-08 09:00:00
## 2714658 s081 2020-03-08 09:00:30
## 2714659 s081 2020-03-08 09:01:00
## 2714660 s081 2020-03-08 09:01:30
## 2714661 s081 2020-03-08 09:02:00
## 1 consecutive epochs separated by more than 60 secs
Comments:
all cases were effectively corrected
now, no more cases have one or more couples of consecutive epochs differing more than 30 sec, with the only exception of SleepLog 26201445747 (see section 2.5.3)
Finally, we plot epochs order against Time
for each participant in order to better inspect the pattern of missing data.
par(mfrow=c(3,3))
for(ID in levels(sleepEBE$ID)){
plot((1:nrow(sleepEBE[sleepEBE$ID==ID,])~sleepEBE[sleepEBE$ID==ID,"Time"]),main=ID,xlab="",ylab="",cex=0.5) }
Comments:
most participants show several clusters of missing data during the period of participation, with participants s038, s040, s041, s048, s052, s063, s064, s086, s089, s090, s105, s109, s119, and s120 showing the longest and most frequent periods of missing data, partially coherently with what reported in section 2.3.6 for sleepLog
data
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s042, s053, s055, s058, s062, s063, s064, s074, and s094), partially coherently with what reported in section 2.3.6 for sleepLog
data
these cases will be better discussed in the data cleaning section.
Here, we save the processed sleepEBE
dataset to be used in the following steps.
save(sleepEBE,file="DATA/datasets/sleepEBE_timeProcessed.RData")
classicEBE
data exported from Fitabase consist of 60-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘sleep.’ classicEBE
data are processed to integrate sleepEBE
with those cases (currently not included) with SleepDataType
= “classic”
.
Here, we recode the date
variable (here renamed as Time
(i.e., the “Date and minute of that day within a defined sleep period in mm/dd/yy hh:mm:ss format”), based on which the ActivityDate
variable is computed.
colnames(classicEBE)[which(colnames(classicEBE)=="date")] <- "Time"
colnames(classicEBE)[colnames(classicEBE)=="logId"] <- "LogId"
p <- timeCheck(data=classicEBE,ID="ID",day="ActivityDate",hour="Time",
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M",
day.withinNight=TRUE,LogId="LogId")
## 2505034 observations in 4663 days from 93 participants:
##
## - mean No. of days/participant = 50.14 SD = 11.95 min = 6 max = 86
## - mean data collection duration (days) = 77.67 - SD = 38.65 min = 7 max = 331
##
## - mean No. of missing days per participant = 27.53 SD = 41.14 min = 0 max = 275
## - mean No. of consecutive missing days per participant = 15.84 SD = 37.03 min = 0 max = 268
##
## - No. of duplicated cases by hour (same ID and hour) = 0
Comments:
the ActivityDate
variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 55-65 nonmissing days of data, coherently with sleepEBE
and sleepLog
data
no cases have the same participant’s identifier and temporal coordinate
the No. (substantial) of missing days is slightly lower than that shown by sleepEBE
and sleepLog
data, whereas the No. of consecutive missing days is similar across the three datasets
As in the sleepLog
data, classicEBE
data are associated with specific LogId
identifying separate sleep periods for each participant.
# LogId as factor
classicEBE <- p # dataset processed with timeCheck
classicEBE$LogId <- as.factor(classicEBE$LogId)
# sanity check: no cases with double logId and the same day
classicEBE$dayLog <- as.factor(paste(classicEBE$LogId,classicEBE$ActivityDate,sep="_"))
dayLog <- classicEBE[!duplicated(classicEBE$dayLog),]
cat("sanity check:",length(which(summary(dayLog$LogId)==2))==0)
## sanity check: TRUE
We can notice that the number of LogIds
from classicEBE
data (5,759) is higher than that showed by both sleepLog
data (N = 5,402) and sleepEBE
data (N = 4,573).
# sleepLog (1 sleepLog only)
data.frame(NsleepLog=nrow(sleepLog_noncomb),NStages=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType!="classic",]),
NsClassic=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="classic",]),
NsleepLog_NOsleepEBE=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId)]),
NsleepLog_NOclassicEBE=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(classicEBE$LogId)]),
NsleepLog_NOstageORclassic=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId)
& !levels(sleepLog_noncomb$LogId
)%in%levels(classicEBE$LogId)]))
# sleepEBE (41 sleepEBE only)
data.frame(Nstages=nlevels(sleepEBE$LogId),
Nstages_INsleepLog=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]),
Nstages_INsleepLog.Stages=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%stagesLogs]),
Nstages_INsleepLog.Classic=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%classicLogs]),
Nstages_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]))
# classicEBE (377 classicEBE only)
data.frame(Nclassic=nlevels(classicEBE$LogId),
Nclassic_INsleepLog=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]),
Nclassic_INsleepLog.Stages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%stagesLogs]),
Nclassic_INsleepLog.Classic=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%classicLogs]),
NclassicE_NOsleepLog=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]))
# classicEBE vs. sleepEBE
data.frame(Nstages_INclassic=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nstages_NOclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nclassic_INsleepEBE=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
Nclassic_NOstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))
# only sleepEBE and/or classicEBE
data.frame(NstagesANDclassic_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nstages_NOsleepLogORclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nclassic_NOsleepLogORstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
NclassicANDsleepLogNostages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))
Comments:
we can notice only one case only included in sleepLog
but not included in sleepEBE
or classicEBE
data (N = 1)
41 cases are included in sleepEBE
but not in sleepLog
336 cases are only included in classicEBE
but not in sleepEBE
or sleepLog
869 cases are included in classicEBE
and sleepLog
but not in sleepEBE
in contrast, no cases are only included in sleepEBE
or in both sleepEBE
and classicEBE
but not in sleepLog
From the above, we can identify three main groups of cases which are uniquely included in a given dataset but not in the others:
Those cases only included in sleepLog
but not in sleepEBE
or classicEBE
(N = 1) will be removed from the analyses (see data cleaning)
# identifying and showing 1 case only included in sleepLog data
uniqueLogId <- levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId) &
!levels(sleepLog_noncomb$LogId)%in%levels(classicEBE$LogId)]
length(uniqueLogId)
## [1] 1
sleepLog[sleepLog$LogId%in%uniqueLogId,c("ID","LogId","ActivityDate","StartTime","EndTime","Duration")]
Those cases only included in sleepEBE
but not in sleepLog
(N = 41) will be processed separately based on the number of epochs included in sleepEBE.
Note that these are all cases of nocturnal sleep with TIB
between 5.7 and 11.25 hours.
# summarizing TIB and StartTime of 41 cases only included in sleepEBE data
uniqueEBElogs <- levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]
length(uniqueEBElogs)
## [1] 41
n <- nrow(sleepEBE[sleepEBE$LogId==uniqueEBElogs[1],])
start <- head(sleepEBE[sleepEBE$LogId==uniqueEBElogs[1],"Time"],1)
for(i in 2:length(uniqueEBElogs)){
n <- c(n,nrow(sleepEBE[sleepEBE$LogId==uniqueEBElogs[i],]))
start <- c(start,head(sleepEBE[sleepEBE$LogId==uniqueEBElogs[i],"Time"],1))}
StartHour <- as.POSIXct(paste(lubridate::hour(start),lubridate::minute(start)),format="%H %M",tz="GMT")
# plotting
par(mfrow=c(2,1))
hist(n/2/60,breaks=35,col="black",main="TIB (hours) in uniqueEBElogs")
hist(StartHour,breaks=35,col="black",xlab="",main="StartTime in uniqueEBElogs")
Those cases only included in classicEBE
but not in sleepLog
or sleepEBE
(N = 336) will be also processed separately, based on the number of epochs included in classicEBE. Most of these cases seem to be cases of nocturnal sleep, with TIB from 2 to 13h.
# summarizing TIB and StartTime of 336 cases only included in classicEBE
uniqueClassiclogs <- levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
!levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]
length(uniqueClassiclogs)
## [1] 336
n <- nrow(classicEBE[classicEBE$LogId==uniqueClassiclogs[1],])
start <- head(classicEBE[classicEBE$LogId==uniqueClassiclogs[1],"Time"],1)
for(i in 2:length(uniqueClassiclogs)){
n <- c(n,nrow(classicEBE[classicEBE$LogId==uniqueClassiclogs[i],]))
start <- c(start,head(classicEBE[classicEBE$LogId==uniqueClassiclogs[i],"Time"],1))}
StartHour <- as.POSIXct(paste(lubridate::hour(start),lubridate::minute(start)),format="%H %M",tz="GMT")
par(mfrow=c(2,1))
hist(n/60,breaks=35,col="black",main="TIB (hours) in uniqueClassiclogs")
hist(StartHour,breaks=35,col="black",xlab="",main="StartTime in uniqueClassiclogs")
Those cases only included in both classicEBE
and* sleepLog
but not in sleepEBE
(N = 869) will be also processed separately, based on the number of epochs included in classicEBE.
Only a minority of these cases seems to be cases of nocturnal sleep, with TIB
from 5 to 13h, whereas most cases are naps (TIB
< 5h).
# summarizing TIB and StartTime of 869 cases only included in classicEBE and sleepLog
ClassicAndSleepLog <- levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
!levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]
length(ClassicAndSleepLog)
## [1] 869
par(mfrow=c(2,1))
hist(sleepLog_noncomb[sleepLog_noncomb$LogId%in%ClassicAndSleepLog,"TimeInBed"]/60,breaks=35,col="black",
main="TIB (hours) in ClassicAndSleepLog")
hist(sleepLog_noncomb[sleepLog_noncomb$LogId%in%ClassicAndSleepLog,"StartHour"],breaks=35,col="black",xlab="",
main="StartTime in ClassicAndSleepLog")
Here, we inspect EBE Time
values in those epochs immediately preceding or following the DST changes highlighted in section 2.3.5. Note that DST times in the San Francisco area always changed at 2 AM
# setting DST changing times
DST.changes <- as.POSIXct(paste(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"),
"02:00:00",sep=" "),tz="GMT")
# selecting cases with ActivityDate = DST.changes + or - 2 minutes
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
DST <- rbind(DST,classicEBE[difftime(classicEBE$Time,DST.changes[i],units="mins")>(-2) &
difftime(classicEBE$Time,DST.changes[i],units="mins")<2,c("ID","Time","LogId")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]
Comments:
the inspection of the temporal coordinates of the epochs immediately preceding or following DST changes suggests that time is not updated by the Fitbit device during DST changes, coherently with what concluded for sleepLog data in section 2.3.5
indeed, there are no 1-hours ‘holes’ in the sleepEBE
data corresponding to DST changes, but the Time
is continuously updated by adding 0 sec from one epoch to the following one
The temporal continuity discussed above can be also inspected throughout the whole sleepEBE
dataset, by counting the No. of consecutive epochs within the same LogId
value whose StartTime
values differ by more than 60 sec (i.e., 30 sec as expected + further 30 due to issues related to time rounding).
# sorting sleepEBE by ID, LogId, and Time
classicEBE <- classicEBE[order(classicEBE$ID,classicEBE$ActivityDate,classicEBE$Time),]
# checking temporal continuity
checkTimeContinuity(data=classicEBE,temporalDiff=60)
## 0 consecutive epochs separated by more than 60 secs
Comments:
LogId
= 26201445747, which showed a shift of 1 hour between 7:59 and 9:00 in sleep EBE dataHere, we better inspect sleepLog
, sleepEBE
, and classicEBE
times for this case.
# computing and showing sleep times for LogId 26201445747
times <- data.frame(dataType=c("LogId","sleepEBE","classicEBE"),
start=c(sleepLog[sleepLog$LogId=="26201445747","StartTime"],head(sleepEBE[sleepEBE$LogId=="26201445747","Time"],1),
head(classicEBE[classicEBE$LogId=="26201445747","Time"],1)),
end=c(sleepLog[sleepLog$LogId=="26201445747","EndTime"],tail(sleepEBE[sleepEBE$LogId=="26201445747","Time"],1),
tail(classicEBE[classicEBE$LogId=="26201445747","Time"],1)),
duration=c(sleepLog[sleepLog$LogId=="26201445747","TimeInBed"],nrow(sleepEBE[sleepEBE$LogId=="26201445747",])/2,
nrow(classicEBE[classicEBE$LogId=="26201445747",])))
times$timeDiff <- difftime(times$end,times$start,units="mins")
times
Comments:
both sleepEBE
and classicEBE
show shorter TIB than sleepLog
since classicEBE
times are closer to sleepLog
times, with no missing epochs (in contrast to sleepEBE
data), we keep only these
Here, we discard LogId
26201445747 epochs from sleepEBE
.
sleepEBE <- sleepEBE[sleepEBE$LogId!="26201445747",] # removing case from sleepEBE
sleepEBE$LogId <- as.factor(as.character(sleepEBE$LogId)) # resetting LogIds
ClassicAndSleepLog <- c(ClassicAndSleepLog,"26201445747")
Finally, we plot epochs order against Time
for each participant in order to better inspect the pattern of missing data.
par(mfrow=c(3,3))
for(ID in levels(classicEBE$ID)){
plot((1:nrow(classicEBE[classicEBE$ID==ID,])~classicEBE[classicEBE$ID==ID,"Time"]),main=ID,xlab="",ylab="",cex=0.5) }
Comments:
most participants show several clusters of missing data during the period of participation, with participants s038, s040, s041, s048, s052, s063, s064, s089, and s090 showing the longest and most frequent periods of missing data, partially coherently with what reported in section 2.3.6 for sleepLog data
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s040, s042, s052, s053, s055, s060, s062, s063, and s064, partially coherently with what reported in section 2.3.6 for sleepLog
data
these cases will be better discussed in the data cleaning section.
Here, we save the processed classicEBE
dataset to be used in the following steps. We also update the sleepEBE
dataset, and we save the special LogId
cases.
# saving updated classicEBE and sleepEBE datsets
save(classicEBE,file="DATA/datasets/classicEBE_timeProcessed.RData")
save(sleepEBE,file="DATA/datasets/sleepEBE_timeProcessed2.RData")
# saving LogId special cases
LogId_special <- list(uniqueLogId,uniqueEBElogs,uniqueClassiclogs,ClassicAndSleepLog)
save(LogId_special,file="DATA/datasets/LogId_special.RData")
HR.1min
data exported from Fitabase consist of 60-sec epoch-by-epoch heart rate data recorded by the Fitbit device. This variable will be used both for recomputing both diurnal and nocturnal HR values.
Here, we recode the Time
variable (i.e., the “Date and hour value in mm/dd/yyyy hh:mm:ss format”), based on which the ActivityDate
variable is computed.
# standardizing StartTime format and converting as POSIXct
HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min$Time))=="M","time2"] <- # timestamps with AM/PM specification
as.POSIXct(HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min))=="M","Time"],format="%m/%d/%Y %I:%M:%S %p",tz="GMT")
HR.1min[is.na(HR.1min$time2),"time2"] <-
as.POSIXct(HR.1min[is.na(HR.1min$time2),"Time"],format="%m/%d/%Y %I:%M:%S %p",tz="GMT") # cases requiring time zone
HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min$Time))!="M","time2"] <-
as.POSIXct(HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min$Time))!="M","Time"],
format="%m/%d/%Y %H:%M",tz="GMT") # timestamps without AM/PM specification
HR.1min[is.na(HR.1min$time2),"time2"] <-
as.POSIXct(HR.1min[is.na(HR.1min$time2),"Time"],format="%m/%d/%Y %H:%M",tz="GMT") # timestamps requiring time zone specification
HR.1min$Time <- HR.1min$time2 # keeping only the corrected timestamps
HR.1min$time2 <- NULL
# recoding day and hour, and checking time and missing data points
p <- timeCheck(data=HR.1min,ID="ID",day="ActivityDate",hour="Time",
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")
## 6986307 observations in 5800 days from 93 participants:
##
## - mean No. of days/participant = 62.37 SD = 9.32 min = 12 max = 80
## - mean data collection duration (days) = 78.39 - SD = 38.42 min = 25 max = 332
##
## - mean No. of missing days per participant = 16.02 SD = 40.16 min = 0 max = 266
## - mean No. of consecutive missing days per participant = 12.9 SD = 36.76 min = 0 max = 267
##
## - No. of duplicated cases by hour (same ID and hour) = 0
Comments:
the ActivityDate
variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 60-70 nonmissing days of data, with only a few participants showing less than 20 days
no cases have the same participants identifier and temporal coordinate
the No. (substantial) of missing days and consecutive missing days is substantially lower than with that shown for sleepLog
, sleepEBE
and classicEBE
Here, we inspect EBE Time
values in those epochs immediately preceding or following the DST changes highlighted in section 2.3.5. Note that DST times in the San Francisco area always changed at 2 AM
HR.1min <- p[order(p$ID,HR.1min$Time),]
# setting DST changing times
DST.changes <- as.POSIXct(paste(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"),
"02:00:00",sep=" "),tz="GMT")
# selecting cases with ActivityDate = DST.changes + or - 3 minutes
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
DST <- rbind(DST,HR.1min[difftime(HR.1min$Time,DST.changes[i],units="mins")>(-3) &
difftime(HR.1min$Time,DST.changes[i],units="mins")<3,c("ID","Time")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]
Comments:
the inspection of the temporal coordinates of the epochs immediately preceding or following DST changes suggests that time is sometimes updated by the Fitbit device during DST changes, contrarily with what concluded for sleepLog
, sleepEBE
and classicEBE
data
specifically, time is updated (i.e., 1 hour forward) only in March but not in November DST changes
Here, we adjust those cases that show 1-hour shifts forward in March 2019-2021 by subtracting 1h to all Time
values after DST changes in March.
# subracting 1h to cases with 1h shift forward on March 2019-2021
for(ID in levels(as.factor(as.character(DST$ID)))){
if(nrow(DST[DST$ID==ID,])==2){ # selecting only cases with time shifts associated with DST changes
if(substr(DST[DST$ID==ID,"Time"][2],1,10)%in%substr(DST.changes,1,10)[seq(1,5,2)]){ # selecting only DST changes in March
HR.1min[HR.1min$ID==ID & HR.1min$Time>DST[DST$ID==ID,"Time"][2],"Time"] <-
HR.1min[HR.1min$ID==ID & HR.1min$Time>DST[DST$ID==ID,"Time"][2],"Time"] - 1*60*60 }}}
# sanity check
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
DST <- rbind(DST,HR.1min[difftime(HR.1min$Time,DST.changes[i],units="mins")>(-3) &
difftime(HR.1min$Time,DST.changes[i],units="mins")<3,c("ID","Time")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]
Comments:
Time
shifts of 1 hour corresponding to DST changes in MarchFinally, we plot epochs order against Time
for each participant in order to better inspect the pattern of missing data.
par(mfrow=c(3,3))
for(ID in levels(HR.1min$ID)){
plot((1:nrow(HR.1min[HR.1min$ID==ID,])~HR.1min[HR.1min$ID==ID,"Time"]),main=ID,xlab="",ylab="",cex=0.5) }
Comments:
most participants show several clusters of missing data during the period of participation, with participants s038, s040, s041, s052, s060, s063, s089, and s090 showing the longest and most frequent periods of missing data
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s040, s042, s052, s053, s055, s060, s062, s063, and s064, partially coherently with what reported in section 2.3.6 for sleepLog
data
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s026, s028, s029, s030, s031, s033, s038, s040, s042, s053, s055, s060, s063, and s064)
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
these cases will be better discussed in the data cleaning section.
Here, we save the processed HR.1min
dataset to be used in the following steps.
save(HR.1min,file="DATA/datasets/HR.1min_timeProcessed.RData")
dailyDiary
data were recorded with Survey Sparrow (SurveySparrow Inc.), and include the daily diary reports on psychological distress and other psychosocial variables self-reported each evening by participants. dailyDiary data are stored in a dataset with one row per day, with the StartedTime
and SubmittedTime
variables indicating the survey start and submission time, respectively. Thus, in this dataset we only need to recode the StartedTime
variable, based on which the .
# recoding day and hour, and checking time and missing data points
dailyDiary <- timeCheck(data=dailyDiary,ID="ID",day="ActivityDate",hour="StartedTime",
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")
## 5133 observations in 4302 days from 93 participants:
##
## - mean No. of days/participant = 46.26 SD = 9.01 min = 23 max = 72
## - mean data collection duration (days) = 63.28 - SD = 8.47 min = 38 max = 111
##
## - mean No. of missing days per participant = 17.02 SD = 10.27 min = 0 max = 73
## - mean No. of consecutive missing days per participant = 4.14 SD = 2.59 min = 0 max = 17
##
## - No. of duplicated cases by hour (same ID and hour) = 43
# updating SubmittedTime
subTime <- timeCheck(data=dailyDiary,ID="ID",day="ActivityDate",hour="SubmittedTime",printInfo = FALSE,
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")
subTime <- subTime[order(subTime$ID,subTime$StartedTime),] # sorting by StartedTime (now sorted by SubmittedTime)
dailyDiary$SubmittedTime <- subTime$SubmittedTime
Comments:
the ActivityDate
variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 40-60 nonmissing days of data, with only a few participants showing less than 20 days
43 cases have the same participants identifier and temporal coordinate
the overall No. of days is lower than that shown by sleepLog
data, although the No. (substantial) of missing days and consecutive missing days is also lower than with that shown for sleepLog
. Specifically, despite the substantial No. of missing days (i.e., 78% with 10+ missing days vs. 76% of sleepLog), the number of consecutive missing days is lower, with only three participants (s052) showing 10+ consecutive missing days. This and other cases will be better discussed in the data cleaning section.
(dailyDiary_compliance <-
timeCheck(data=dailyDiary,ID="ID",day="ActivityDate",hour="StartedTime",returnInfo=TRUE,printInfo=FALSE,
input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M"))
# plotting missing days
par(mfrow=c(1,2))
hist(dailyDiary_compliance$nMissingDays,main="No. of missing days",breaks=30)
hist(dailyDiary_compliance$maxdayDiff,main="Max No. of consecutive missing days",breaks=30)
First, we better inspect the 43 duplicated responses observed above.
dailyDiary$IDhour <- as.factor(paste(dailyDiary$ID,dailyDiary$StartedTime,sep="_"))
dup <- dailyDiary[duplicated(dailyDiary$IDhour),c("ID","ActivityDate","StartedTime","IDhour")]
cat("Detected",nrow(dup),"cases of double responses recorded between",
as.character(dup[dup$StartedTime==min(dup$StartedTime),"StartedTime"]),"and",
as.character(dup[dup$StartedTime==max(dup$StartedTime),"StartedTime"]))
## Detected 43 cases of double responses recorded between 2019-03-28 21:56:00 and 2021-03-30 13:46:00
dailyDiary[dailyDiary$IDhour%in%levels(as.factor(as.character(dup$IDhour))),
c("ID","StartedTime","SubmittedTime",colnames(dailyDiary)[c(3,10,11)])]
Comments:
each duplicated case consists of two responses with the same StartedTime
and SubmittedTime
values
critically, although no missing responses are systematically shown by duplicated cases, the responses are different within the same couple of duplicated cases
Here, we remove all duplicated cases by keeping only the first one.
# excluding double responses (keeping only the first one)
new.data <- dailyDiary[!duplicated(dailyDiary$IDhour),]
cat("Excluded",nrow(dailyDiary)-nrow(new.data),"double responses")
## Excluded 43 double responses
# checking again for double responses (no more cases)
new.data$IDhour <- as.factor(paste(new.data$ID,new.data$IDhour,sep="_"))
cat("Detected",nrow(new.data)-nlevels(new.data$IDhour),"cases of double responses")
## Detected 0 cases of double responses
# updating dataset
dailyDiary <- new.data
Then, we inspect the CompletionStatus
variable looking for cases of "Partial Completion"
.
# printing info
cat(nrow(dailyDiary[is.na(dailyDiary$SubmittedTime),]),"cases with missing SubmittedTime and",
nrow(dailyDiary[dailyDiary$CompletionStatus=="Partially Completed",]),"cases of Partial Completion")
## 13 cases with missing SubmittedTime and 13 cases of Partial Completion
# showing 13 cases of Partial Completion
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed",
c("ID","StartedTime","SubmittedTime","CompletionStatus",colnames(dailyDiary)[c(3,10,11)])]
Comments:
in 13 cases the CompletionStatus
is "Partially Completed"
, and the SubmittedTime
is missing
however, all of these cases does not show missing values in the focal variables (stress, negative mood, and worry), and thus we keep them by associating the median surveyDuration
in the sample (see below)
# creating and plotting surveyDuration (min)
dailyDiary$surveyDuration <- as.numeric(difftime(dailyDiary$SubmittedTime,dailyDiary$StartedTime,units="min"))
# interpolating surveyDuration in cases of Partial Completion
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","surveyDuration"] <- median(dailyDiary$surveyDuration,na.rm=TRUE)
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","SubmittedTime"] <-
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","StartedTime"] +
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","surveyDuration"] # interpolating SubmissionTime
barplot(prop.table(table(dailyDiary$surveyDuration)),col="black",xlab="",main="Survey Duration (min)")
Comments:
most responses took about 1 min (32%) or less (66%), with only 91 cases (2%) showing a surveyDuration
> 1 min
in 17 cases the surveyDuration
was longer than 15 min, among which 6 cases that submitted the responses more than 16h after the StartedTime
*** `
Here, we exclude these 6 cases because we don’t know if the ratings were referred to the current or the following day.
# summary of surveyDuration
summary(as.factor(dailyDiary$surveyDuration))
## 0 1 2 3 4 5 6 7 8 12 14 16 27 40 45 53
## 3368 1631 49 11 6 1 1 3 1 1 1 1 1 1 1 1
## 70 161 253 439 754 758 1297 1362 1424 1432 2286 2740
## 1 1 1 1 1 1 1 1 1 1 1 1
# durations > 15 min (17)
dailyDiary[dailyDiary$surveyDuration > 15,c("ID","StartedTime","SubmittedTime","surveyDuration","CompletionStatus")]
# excluding 6 cases that submitted the responses on the following day
memory <- dailyDiary
dailyDiary <- dailyDiary[dailyDiary$surveyDuration < 1000,]
cat("Excluded",nrow(memory)-nrow(dailyDiary),"cases with surveyDuration > 17h")
## Excluded 6 cases with surveyDuration > 17h
Then, to better inspect timing and duration of the recorded sleep periods, we recode the StartedTime
and SubmittedTime
variables to create StartHour
and EndHour
, indicating only the time (and not the date). Note that StartedTime
and SubmittedTime
have a minute resolution (not seconds).
dailyDiary <- StartTime_rec(data=dailyDiary,start="StartedTime",end="SubmittedTime",doPlot=TRUE,returnData=TRUE)
Comments:
SartTime
values derived from the Fitabase sleepLog
data were later than Survey Sparrow StartedTime
, which looks fineThen, we update the ActivityDate
variable so that it indicates the previous day when the StartTime
is between 00:00 and 06:00 (N = 1,848, 36%). This allows better clarifying the distinction between consecutive daily reports.
# No. of surveys started between 00 and 20
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h20 <- as.POSIXct(paste(substr(Sys.time(),1,10),"20:00:00"),tz="GMT")
cat(nrow(dailyDiary[dailyDiary$StartHour>=h00 & dailyDiary$StartHour<h20,c("ID","StartedTime","StartHour","ActivityDate")]),
"cases with StartTime between midnight and 8 PM") ## 1848 cases with StartTime between midnight and 8 PM
## 1848 cases with StartTime between midnight and 8 PM
# updating ActivityDate
dailyDiary[dailyDiary$StartHour >= h00 & dailyDiary$StartHour <= h20,"ActivityDate"] <-
dailyDiary[dailyDiary$StartHour >= h00 & dailyDiary$StartHour <= h20,"ActivityDate"] - 1
We can use the updated ActivityDate
variable to check for double cases with the same ID
and ActivityDate
value.
# No. of duplicates IDday before updatingActivityDate
nrow(dailyDiary[duplicated(dailyDiary$IDday),]) # 788
## [1] 788
# No. of duplicates IDday after updatingActivityDate
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))
nrow(dailyDiary[duplicated(dailyDiary$IDday),]) # 139
## [1] 139
# showing duplicates
dailyDiary <- dailyDiary[order(dailyDiary$ID,dailyDiary$StartedTime),] # re-sorting by ID and time
rownames(dailyDiary) <- 1:nrow(dailyDiary)
dupl <- dailyDiary[duplicated(dailyDiary$IDday),c("IDday","ActivityDate")]
cat(nrow(dupl),"duplicated cases from",as.character(min(dupl$ActivityDate)),"to",as.character(max(dupl$ActivityDate)))
## 139 duplicated cases from 2019-02-04 to 2021-04-17
dailyDiary[dailyDiary$IDday%in%levels(as.factor(as.character(dupl$IDday))),c("IDday","StartedTime","SubmittedTime",
colnames(dailyDiary)[c(3,10)])]
Comments:
in 139 cases (3%), there are from two (N = 127) or three/four observations(N = 12) with the same ID
and ActivityDate
value, probably due to technical problems with Survey Sparrow
duplicated cases are observed during all the data collection (i.e., not specific of a limited period of time)
in some of these cases (e.g., participant s005 on 2019-04-01), the StartTime
values differ by 2-3 min or even less, but responses are different
Here, for each of these groups of duplicated cases, we only keep the first case (i.e., the one with the earlier StartedTime
), whereas we exclude 139 (3%) double responses.
# excluding double responses (keeping only the first one)
memory <- dailyDiary
dailyDiary <- dailyDiary[!duplicated(dailyDiary$IDday),]
cat("Excluded",nrow(memory)-nrow(dailyDiary),"double responses")
## Excluded 139 double responses
# checking again for double responses (no more cases)
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))
cat("Detected",nrow(dailyDiary)-nlevels(dailyDiary$IDday),"cases of double responses")
## Detected 0 cases of double responses
We can also use the updated ActivityDate
to check the No. of cases with StartedTime
after sleepLog
EndTime
(N = 739) (i.e., surveys answered on the following day), and the differences between the two time points (ranging from 0.5 min to 15.8h)
# checking No. of cases with StartedTime after wake up time (739)
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))
diaryAndsleep <- na.omit(plyr::join(dailyDiary[,c("IDday","StartedTime")],sleepLog[,c("IDday","EndTime")],by="IDday",type="left"))
cat(nrow(diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,c("StartedTime","EndTime")]),
"cases in which participants started filling the diary after they woke up \n(",
round(100*nrow(diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,])/nrow(diaryAndsleep),1),
"% of cases with matching ID and ActivityDate between dailyDiary and sleepLog)")
## 739 cases in which participants started filling the diary after they woke up
## ( 18 % of cases with matching ID and ActivityDate between dailyDiary and sleepLog)
# summarizing differences between sleepLog EndTime and dailyDiary StartedTime in these 739 cases
summary(as.numeric(difftime(diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,"StartedTime"],
diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,"EndTime"],units="mins")))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 40.75 139.00 197.54 292.50 948.00
In the San Francisco area, the Daylight Saving Time (DST) changed on March 10th (1h forward) and November 3rd, 2019 (1h backward), and again on March 8th (1h forward) and November 1st, 2020 (1h backward), and finally on March 14th, 2021. Here, we inspect the distributions of StartHour
values in the 5 days preceding and the 5 days following each of these dates, in order to check whether time was automatically updated by the wristband.
# setting DST changing times
DST.changes <- as.Date(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"))
# selecting cases with ActivityDate = DST.changes + or - 5 days
DST <- as.data.frame(matrix(nrow=0,ncol=4))
for(i in 1:length(DST.changes)){
DST <- rbind(DST,dailyDiary[difftime(dailyDiary$ActivityDate,DST.changes[i],units="days")>(-5) &
difftime(dailyDiary$ActivityDate,DST.changes[i],units="days")<5,
c("ID","ActivityDate","StartedTime","StartHour","SubmittedTime","EndHour","surveyDuration")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$ActivityDate%in%DST.changes,"DST"] <- TRUE
# computing time (hours) from midnight
DST$timeFrom00 <- as.POSIXct(paste(lubridate::hour(DST$StartedTime), lubridate::minute(DST$StartedTime)), format="%H %M")
DST$timeFrom00 <- as.numeric(difftime(DST$timeFrom00,
as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT"),units="hours"))
# subtracting 1 day to cases with timeFrom00 > 12
DST[DST$timeFrom00>12,"timeFrom00"] <- DST[DST$timeFrom00>12,"timeFrom00"] - 24
# plotting StartTime trends
for(i in 1:length(DST.changes)){
DSTs <- c(substr(DST.changes[i],1,7),
paste(substr(DST.changes[i],1,6),as.integer(substr(DST.changes[i],7,7))-1,sep=""))
print(ggplot(data=DST[substr(DST$ActivityDate,1,7)%in%DSTs,],aes(x=ActivityDate,y=timeFrom00)) +
geom_line(aes(colour=ID)) + geom_point(aes(colour=ID),size=3) + ggtitle(DST.changes[i]) +
geom_vline(xintercept=DST.changes[i]) +
theme(axis.text.x=element_text(angle=45),legend.position = "none"))}
Comments:
the visual inspection of StartedTime
trends in those participants that recorded their sleep during the days around DST changes does not seem to suggest systematic shifts pairing with time changes
the only DST change that shows some substantial shift in StartedTime
is the third one (2019-03-10), with four participants out of five showing an increasing upward trend of five-to-ten hours on the following day
Finally, we plot epochs order against Time
for each participant in order to better inspect the pattern of missing data.
par(mfrow=c(3,3))
for(ID in levels(dailyDiary$ID)){
plot((1:nrow(dailyDiary[dailyDiary$ID==ID,])~dailyDiary[dailyDiary$ID==ID,"StartedTime"]),main=ID,xlab="",ylab="") }
Comments:
sleepLog
, sleepEBE
, and classicEBE
data, dailyDiary
data does not show evident cases of missing data clustersHere, we save the recoded dailyDiary
dataset with the 4,945 included cases.
save(dailyDiary,file="DATA/datasets/dailyDiary_timeProcessed.RData")
Here, we remove the variables not considered for the analysis, and we recode those variables that are kept. Before recoding, we empty the working environment and reload the processed datsets.
rm(list=ls()) # emptying the working environment
# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")
# loading processed datasets
load("DATA/datasets/dailyAct_timeProcessed.RData") # dailyAct
load("DATA/datasets/hourlySteps_timeProcessed.RData") # hourlySteps
load("DATA/datasets/sleepLog_combined.RData") # sleepLog
load("DATA/datasets/sleepEBE_timeProcessed2.RData") # sleepEBE
load("DATA/datasets/classicEBE_timeProcessed.RData") # classicEBE
load("DATA/datasets/HR.1min_timeProcessed.RData") # HR.1min
load("DATA/datasets/dailyDiary_timeProcessed.RData") # dailyDiary
demos <- read.csv2("DATA/demographics.csv",header=TRUE) # demos
From dailyAct
, we only keep the TotalSteps
variable, and those counting the No. of minutes in each activity zone (VeryActiveMinutes
, FairlyActiveMinutes
, LightlyActiveMinutes
, and SedentaryMinutes
), whereas we discard RestingHeartRate
(due to unclear computation) and the derived measures such as Calories
.
# plotting all variables
par(mfrow=c(2,4))
for(Var in colnames(dailyAct)[4:ncol(dailyAct)]){ hist(dailyAct[,Var],main=Var,xlab="") }
# removing variables
toRemove <- c("TotalDistance","TrackerDistance","LoggedActivitiesDistance","VeryActiveDistance",
"ModeratelyActiveDistance","LightActiveDistance","SedentaryActiveDistance",
"Calories","Floors","CaloriesBMR","MarginalCalories","RestingHeartRate")
dailyAct[,toRemove] <- NULL
Then, we use the variables counting the No. of minutes in each activity zone for computing the aggregated ModerateVigorousMinutes
, and TotalActivityMinutes
, quantifying the No. of minutes in “very active” or “fairly active” physical activity, and the total No. of any activity minutes, respectively.
# computing ModerateVigorousMinutes
dailyAct$ModerateVigorousMinutes <- dailyAct$VeryActiveMinutes + dailyAct$FairlyActiveMinutes
# removing further unused variables
dailyAct$VeryActiveMinutes <- dailyAct$FairlyActiveMinutes <- NULL
# computing TotalActivityMinutes
dailyAct$TotalActivityMinutes <- dailyAct$ModerateVigorousMinutes + dailyAct$LightlyActiveMinutes + dailyAct$SedentaryMinutes
# plotting all included variables
par(mfrow=c(2,3))
for(Var in colnames(dailyAct)[4:ncol(dailyAct)]){ hist(dailyAct[,Var],main=Var,xlab="") }
# showing dataset (first 3 rows)
dailyAct[1:3,]
Here, we save the recoded dailyAct
dataset.
save(dailyAct,file="DATA/datasets/dailyAct_recoded.RData")
From hourlySteps
, we only remove the IDday
and IDhour
variables created above, whereas we keep the only included variable, namely StepTotal
.
# removing IDday and IDhour variables, and sorting columns
hourlySteps <- hourlySteps[,c("ID","group","ActivityDate","ActivityHour","StepTotal")]
# plotting StepTotal
hist(hourlySteps$StepTotal,xlab="",breaks=100,
main=paste("StepTotal ( min =",min(hourlySteps$StepTotal),", max =",max(hourlySteps$StepTotal),", median =",
median(hourlySteps$StepTotal),")"))
# showing dataset (first 3 rows)
hourlySteps[1:3,]
Here, we save the recoded hourlySteps
dataset.
save(hourlySteps,file="DATA/datasets/hourlySteps_recoded.RData")
From dailyAct
, we only keep the StartTime
and EndTime
variables to be used for computing sleep measures from the sleepEBE
and classicEBE
datasets. The sleep measures automatically recorded in Fitabase (i.e., MinutesAfterWakeUp
, MinutesAsleep
, MinutesToFallAsleep
, and TimeInBed
) are also kept for comparison.
# plotting all variables
par(mfrow=c(2,4))
for(Var in colnames(sleepLog)[5:ncol(sleepLog)]){ if(is.numeric(sleepLog[,Var]) | is.integer(sleepLog[,Var])){
hist(sleepLog[,Var],main=Var,xlab="") }}
# removing variables
toRemove <- c("Duration","Efficiency","ClassicAsleepCount","ClassicAsleepDuration","ClassicAwakeCount","ClassicAwakeDuration",
"ClassicRestlessCount","ClassicRestlessDuration","StagesWakeCount","StagesWakeDuration","StagesWakeThirtyDayAvg",
"StagesLightCount","StagesLightDuration","StagesLightThirtyDayAvg",
"StagesDeepCount","StagesDeepDuration","StagesDeepThirtyDayAvg",
"StagesREMCount","StagesREMDuration","StagesREMThirtyDayAvg",
"IsMainSleep",
"lastTimeChar","StartTime2","IDday","IDhour","StartHour","EndHour", # ad-hoc created variables
"combType","combSeq") # info on combined cases (keeping only combined variable)
sleepLog[,toRemove] <- NULL
Then, we use the TimeInBed
, MinutesAfterWakeUp
, MinutesToFallAsleep
, MinutesAsleep
for computing the fitabaseWASO
variable, and we remove the MinutesAfterWakeUp
variable (not considered). Finally, we sort, plot, and print the included variables.
# creating fitabaseWASO
sleepLog$fitabaseWASO <- sleepLog$TimeInBed - sleepLog$MinutesAsleep - sleepLog$MinutesAfterWakeUp - sleepLog$MinutesToFallAsleep
# sorting columns and removing MinutesAfterWakeUp
sleepLog <- sleepLog[,c("ID","group","ActivityDate","LogId","StartTime","EndTime","combined","combinedLogId","nCombined",
"SleepDataType","TimeInBed","MinutesAsleep","fitabaseWASO","MinutesToFallAsleep")]
# plotting all included variables
par(mfrow=c(1,4))
for(Var in colnames(sleepLog)[11:ncol(sleepLog)]){ hist(sleepLog[,Var],main=Var,xlab="") }
# showing dataset (first 3 rows)
sleepLog[1:3,]
Here, we save the recoded sleepLog
dataset.
save(sleepLog,file="DATA/datasets/sleepLog_recoded.RData")
From sleepEBE
, we only keep the SleepStage
variable, accounting for short wakes, whereas we remove the Level
and the ShortWakes
variables. The SleepStage
variable is converted as integer (i.e., 0 = wake, 1 = light, 2 = deep, 3 = REM).
# removing variables
toRemove <- c("Level","ShortWakes","IDday","IDhour","dayLog")
sleepEBE[,toRemove] <- NULL
# sorting columns
sleepEBE <- sleepEBE[,c("ID","group","ActivityDate","LogId","Time","SleepStage")]
# plotting SleepStage
sleepEBE$SleepStage <- as.factor(sleepEBE$SleepStage)
plot(sleepEBE$SleepStage)
# converting SleepStage as integer
sleepEBE$SleepStage <- as.integer(gsub("wake","0",gsub("light","1",gsub("deep","2",gsub("rem","3",sleepEBE$SleepStage)))))
# showing dataset (first 3 rows)
sleepEBE[1:3,]
Here, we save the recoded sleepEBE
dataset.
save(sleepEBE,file="DATA/datasets/sleepEBE_recoded.RData")
From classicEBE
, we only keep the value
variable, with possible values 1 = “asleep,” 2 = “restless,” and 3 = “awake.” Here, this variable is recoded with 0 = wake (i.e., both awake and restless) and 1 = “sleep”, coherently with what is done by other processing pipelines such as RAPIDS, and including the Fitabase pipeline, as highlighted in this thread.
# removing variables
toRemove <- c("IDday","IDhour","dayLog")
classicEBE[,toRemove] <- NULL
# plotting value
classicEBE$value <- as.factor(gsub("2","0",gsub("3","0",classicEBE$value)))
plot(classicEBE$value)
# converting value as integer
classicEBE$value <- as.integer(as.character(classicEBE$value))
# sorting columns
classicEBE <- classicEBE[,c("ID","group","ActivityDate","LogId","Time","value")]
# showing dataset (first 3 rows)
classicEBE[1:3,]
Here, we save the recoded classicEBE
dataset.
save(classicEBE,file="DATA/datasets/classicEBE_recoded.RData")
From HR.1min
, we only keep the Value
variable, which we rename as HR
.
# removing variables
toRemove <- c("IDday","IDhour")
HR.1min[,toRemove] <- NULL
# renaming HR and sorting variables
colnames(HR.1min)[which(colnames(HR.1min)=="Value")] <- "HR"
HR.1min <- HR.1min[,c("ID","group","ActivityDate","Time","HR")]
# converting value as factor and plotting
hist(HR.1min$HR,xlab="HR (bpm)")
# showing dataset (first 3 rows)
HR.1min[1:3,]
Here, we save the recoded classicEBE
dataset.
save(HR.1min,file="DATA/datasets/HR.1min_recoded.RData")
From dailyDiary
, we only keep the self-reported variables, whereas we remove all variables describing survey or participants’ details.
# removing variables
toRemove <- c("TotalScore","CompletionStatus","IPAddress","Location","DMSLatLong","ChannelName","ChannelType","DeviceID",
"DeviceName","Browser","OS",
"ContactName","ContactPhone","ContactJobTitle","ContactEmail","ContactMobile",
"IDday","IDhour","StartHour","EndHour")
dailyDiary[,toRemove] <- NULL
# renaming variables
colnames(dailyDiary)[which(colnames(dailyDiary)=="Howstressfulwasyourday"
):which(colnames(dailyDiary)=="OtheregIamworriedaboutsomethingelsehappeningtomorrow")] <-
c("dailyStress","stress_school","stress_family","stress_health","stress_COVID","stress_peers","stress_other",
"eveningMood",
"eveningWorry","worry_school","worry_family","worry_health","worry_peer","worry_COVID","worry_sleep","worry_other")
Then, we recode self-report variables from character to numeric values.
We start with dailyStress
(i.e., “How stressful was your day?”), 1 = “Not at all stressful” to 5 = “Extremely stressful.” Only when participants reported dailyStress > 1, they were asked to indicate the sources of stress (yes/no).
# converting as dailyStress score as numeric
dailyDiary$dailyStress <- as.numeric(gsub("Not at all stressful","1",
gsub("Not so stressful","2",
gsub("Somewhat stressful","3",
gsub("Very stressful","4",
gsub("Extremely stressful","5",
dailyDiary$dailyStress))))))
# converting dailyStress sources as binary (0/1)
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,7)=="stress_")]){
colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var=="","var"] <- "0"
dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var!="0","var"] <- "1"
dailyDiary$var <- as.numeric(dailyDiary$var)
colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }
# sanity check: 4 cases with dailyStress = 1 but stressor specification (?)
dailyDiary$stress_total <- rowSums(dailyDiary[,which(substr(colnames(dailyDiary),1,7)=="stress_")],na.rm=TRUE) # stress_total
dailyDiary[!is.na(dailyDiary$dailyStress) & dailyDiary$dailyStress==1 & dailyDiary$stress_total!=0,
c(which(colnames(dailyDiary)=="dailyStress"),which(substr(colnames(dailyDiary),1,7)=="stress_"))]
# sanity check: 12 cases with dailyStress > 1 but no stressor specification (?)
dailyDiary[!is.na(dailyDiary$dailyStress) & dailyDiary$dailyStress>1 & dailyDiary$stress_total==0,
c(which(colnames(dailyDiary)=="dailyStress"),which(substr(colnames(dailyDiary),1,7)=="stress_"))]
# plotting distribution of dailyStress scores
hist(dailyDiary$dailyStress,breaks=50,col="black",xlab="",main="dailyStress (quite skewed)")
# plotting frequency of stressor categories
dailyDiary[,which(substr(colnames(dailyDiary),1,7)=="stress_")] <-
lapply(dailyDiary[,which(substr(colnames(dailyDiary),1,7)=="stress_")],as.factor)
par(mfrow=c(2,3))
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,7)=="stress_")]){
colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
barplot(prop.table(table(dailyDiary$var)),col="black",xlab="",main=Var)
colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }
# showing summary
summary(dailyDiary[!is.na(dailyDiary$dailyStress) & dailyDiary$dailyStress>1,
which(substr(colnames(dailyDiary),1,7)=="stress_")])
## stress_school stress_family stress_health stress_COVID stress_peers
## 0 :1154 0 :2385 0 :1908 0 : 660 0 :1983
## 1 :2141 1 : 403 1 : 310 1 : 99 1 : 539
## NA's: 97 NA's: 604 NA's:1174 NA's:2633 NA's: 870
##
##
##
##
## stress_other stress_total
## 0 :2081 0: 12
## 1 :1281 1:2286
## NA's: 30 2: 850
## 3: 201
## 4: 35
## 5: 4
## 6: 4
Comments:
dailyStress
ratings show a negatively skewed distribution, with most of the ratings being 1 (32%) or 2 (31%). Most ratings were higher than 1 (68%)
the most frequently reported stressors were “school” (65%) and “other” (37%)
note that the number of missing values strongly variate across stressors (from 30 for “Other” to 2,617 for “COVID”)
in 4 cases, stressors were specified even if dailyStress
was rated as 1 (note that stressors items were supposed to be showed only when dailyStress > 1)
in 11 cases, stressors were not specified even if dailyStress
was rated as > 1
The same is done for eveningWorry
(i.e., “How worried do you feel right now?”), which is recorded from 1 = “Not at all worried” to 5 = “Extremely worried.” Only when participants reported eveningWorry > 1, they were asked to indicate the sources of worry (yes/no)
# converting eveningWorry score as numeric
dailyDiary$eveningWorry <- as.numeric(gsub("Not at all worried","1",
gsub("Not so worried","2",
gsub("Somewhat worried","3",
gsub("Very worried","4",
gsub("Extremely worried","5",
dailyDiary$eveningWorry))))))
# converting eveningWorry sources as binary (0/1)
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,6)=="worry_")]){
colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var=="","var"] <- "0"
dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var!="0","var"] <- "1"
dailyDiary$var <- as.numeric(dailyDiary$var)
colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }
# sanity check: 1 case with eveningWorry = 1 but worry specification (?)
dailyDiary$worry_total <- rowSums(dailyDiary[,which(substr(colnames(dailyDiary),1,6)=="worry_")],na.rm=TRUE) # worry_total
dailyDiary[!is.na(dailyDiary$eveningWorry) & dailyDiary$eveningWorry==1 & dailyDiary$worry_total!=0,
c(which(colnames(dailyDiary)=="eveningWorry"),which(substr(colnames(dailyDiary),1,6)=="worry_"))]
# sanity check: 26 cases with eveningWorry > 1 but no worry specification (?)
dailyDiary[!is.na(dailyDiary$eveningWorry) & dailyDiary$eveningWorry>1 & dailyDiary$worry_total==0,
c(which(colnames(dailyDiary)=="eveningWorry"),which(substr(colnames(dailyDiary),1,6)=="worry_"))]
# plotting distribution of eveningWorry scores
hist(dailyDiary$eveningWorry,breaks=50,col="black",xlab="",main="eveningWorry (very skewed)")
# plotting frequency of worry categories
dailyDiary[,which(substr(colnames(dailyDiary),1,6)=="worry_")] <-
lapply(dailyDiary[,which(substr(colnames(dailyDiary),1,6)=="worry_")],as.factor)
dailyDiary$worry_total <- NULL # removing worry_total
par(mfrow=c(3,3))
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,6)=="worry_")]){
colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
barplot(prop.table(table(dailyDiary$var)),col="black",xlab="",main=Var)
colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }
# showing summary
summary(dailyDiary[!is.na(dailyDiary$eveningWorry) & dailyDiary$eveningWorry>1,
which(substr(colnames(dailyDiary),1,6)=="worry_")])
## worry_school worry_family worry_health worry_peer worry_COVID worry_sleep
## 0 :1090 0 :2335 0 :1866 0 :1989 0 : 766 0 :2074
## 1 :2243 1 : 295 1 : 354 1 : 556 1 : 92 1 : 732
## NA's: 85 NA's: 788 NA's:1198 NA's: 873 NA's:2560 NA's: 612
## worry_other
## 0 :2010
## 1 :1294
## NA's: 114
Comments:
eveningWorry
ratings showed a negatively skewed distribution, with most of the ratings being 1 (32%) or 2 (29%), similarly to dailyStress
. Most ratings were higher than 1 (68%)
the most frequently reported stressors were “school” (49%) and “other” (31%)
note that the number of missing values strongly variate across the sources of worry (from 88 for “School” to 2625 for “COVID”)
in 1 case, the sources of worry were specified even if eveningWorry
was rated as 1 (note that worry items were supposed to be showed only when eveningWorry > 1)
in 26 cases, sources of worry were not specified even if eveningWorry
was rated as > 1
Finally, we recode eveningMood
(i.e., “How is your mood right now?”) from 1 = “Very bad” to 5 = “Very good.” Here, we can see that the variable was positively skewed, with most of the ratings being 3 (29%) or 4 (36%)
# converting as numeric
dailyDiary$eveningMood <- as.numeric(gsub("Very bad","1",
gsub("Somewhat bad","2",
gsub("Neither bad or good","3",
gsub("Somewhat good","4",
gsub("Very good","5",
dailyDiary$eveningMood))))))
# plotting
hist(dailyDiary$eveningMood,breaks=50,col="black",xlab="",main="eveningMood (slightly skewed)")
Here, we sort the variables, we display the recoded dataset, and we save the recoded dailyDiary
dataset.
# sorting variables
dailyDiary <- dailyDiary[,c("ID","group","ActivityDate","StartedTime","SubmittedTime","surveyDuration",
"dailyStress","eveningWorry","eveningMood",
colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,7)=="stress_")],
colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,6)=="worry_")])]
# showing dataset (first 3 rows)
dailyDiary[1:3,]
# saving dataset
save(dailyDiary,file="DATA/datasets/dailyDiary_recoded.RData")
From demos
, we keep the participant’s identifier, sex, age, BMI, and insomnia group, with sex and insomnia group being recoded as factor.
# recoding participants' identifier
demos$ID <- as.factor(paste("s",substr(demos$id,5,7),sep=""))
# changing variable classes
demos$sex <- gsub("0","F",gsub("1","M",demos$sex))
demos[,c("sex","insomnia","DSMinsomnia","sub_insomnia")] <-
lapply(demos[,c("sex","insomnia","DSMinsomnia","sub_insomnia")],as.factor)
demos[,c("age","BMI")] <- lapply(demos[,c("age","BMI")],as.numeric)
# sorting variables and removing unuseful columns
demos <- demos[,c("ID","sex","age","BMI","insomnia","DSMinsomnia","sub_insomnia")] # sorting columns
# plotting variables
par(mfrow=c(2,3))
for(Var in c(colnames(demos)[2:7])){
if(is.numeric(demos[,Var])){ hist(demos[,Var],main=Var) }else{ plot(demos[,Var],main=Var) }}
Then, we recode the insomnia group variables in order to compute the insomnia.group
variable accounting for both the DSMinsomnia
(i.e., 1 if case of insomnia based on DSM criteria) and the sub.insomnia
variable (i.e., 1 if case meeting all but one DSM criteria; note that sub.insomnia
have missing values when DSMinsomnia
= 1).
# creating insomnia.group variable
demos$insomnia.group <- "control" # creating insomnia.group = control vs. sub.insomnia vs. DSM.insomnia
demos[!is.na(demos$sub_insomnia) & demos$sub_insomnia==1,"insomnia.group"] <- "sub.ins" # sub.ins when sub_insomnia = 1
demos[demos$DSMinsomnia==1,"insomnia.group"] <- "DSM.ins" # DSM.ins when DSMinsomnia = 1
demos$DSMinsomnia <- demos$sub_insomnia <- NULL # removing unecessary variables
demos$insomnia.group <- as.factor(demos$insomnia.group)
# plotting
plot(demos$insomnia.group,main="Insomnia groups")
Here, we show and save the recoded demos
dataset.
# showing dataset (first 3 rows)
demos[1:3,]
# saving dataset
save(demos,file="DATA/datasets/demos_recoded.RData")
Here, we aggregate the recoded dataset into the final datasets to be used for the analysis. Before recoding, we empty the working environment and reload the processed datsets.
rm(list=ls()) # emptying the working environment
library(lubridate) # required packages
# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")
# loading processed datasets
load("DATA/datasets/dailyAct_recoded.RData") # dailyAct
load("DATA/datasets/hourlySteps_recoded.RData") # hourlySteps
load("DATA/datasets/sleepLog_nonComb.RData") # sleepLog_nonComb
load("DATA/datasets/sleepLog_recoded.RData") # sleepLog
load("DATA/datasets/sleepEBE_recoded.RData") # sleepEBE
load("DATA/datasets/classicEBE_recoded.RData") # classicEBE
load("DATA/datasets/LogId_special.RData") # special cases of LogId
load("DATA/datasets/HR.1min_recoded.RData") # HR.1min
load("DATA/datasets/dailyDiary_recoded.RData") # dailyDiary
load("DATA/datasets/demos_recoded.RData") # demos
First, we aggregate the hourlySteps
and dailyAct
datasets by using the former for recomputing the TotalSteps
count in the latter. The two datasets are aggregated by using the IDday
variable (i.e., accouning for participants’ ID
and ActivityDate
)
# creating common ID x day identifier
hourlySteps$IDday <- as.factor(paste(hourlySteps$ID,hourlySteps$ActivityDate,sep="_")) # subject x day identifier
dailyAct$IDday <- as.factor(paste(dailyAct$ID,dailyAct$ActivityDate,sep="_"))
# checking whether all IDday values in hourlySteps are included in dailyAct (TRUE)
cat("Sanity check:",length(levels(hourlySteps$IDday)[!(levels(hourlySteps$IDday)%in%levels(dailyAct$IDday))])==0)
## Sanity check: TRUE
# checking whether all IDday values in dailyAct are included in hourlySteps (FALSE)
cat("Sanity check:",length(levels(dailyAct$IDday)[!(levels(dailyAct$IDday)%in%levels(hourlySteps$IDday))])==0)
## Sanity check: FALSE
# IDday only included in dailyAct but not in hourlySteps (13)
levels(dailyAct$IDday)[!(levels(dailyAct$IDday)%in%levels(hourlySteps$IDday))]
## [1] "s001_2019-04-04" "s041_2019-09-07" "s042_2019-08-14" "s047_2019-10-16"
## [5] "s047_2019-10-17" "s047_2019-10-18" "s047_2019-10-19" "s047_2019-10-20"
## [9] "s047_2019-10-21" "s089_2020-10-29" "s116_2021-04-08" "s119_2021-04-22"
## [13] "s120_2021-04-09"
# marking these cases as hourlySteps=FALSE
dailyAct$hourlySteps = TRUE
dailyAct[!(dailyAct$IDday%in%levels(hourlySteps$IDday)),"hourlySteps"] <- FALSE
# recomputing total steps per day
for(i in 1:nrow(dailyAct)){
if(dailyAct[i,"hourlySteps"]==TRUE){
dailyAct[i,"TotalSteps2"] <- sum(hourlySteps[as.character(hourlySteps$IDday)==as.character(dailyAct[i,"IDday"]),
"StepTotal"])}}
# sanity check (TRUE)
cat("sanitycheck:",nrow(dailyAct[is.na(dailyAct$TotalSteps2),])==13)
## sanitycheck: TRUE
Comments:
all ID
-ActivityDate
combinations included in hourlySteps
are included in dailyAct
, whereas 13 cases (0.2%) are only included in dailyAct
but not in hourlySteps
with the exception of those 13 cases, all TotalSteps
values have been successfully recomputed from hourlySteps
Here, we inspect the differences between automatically computed and manually recomputed TotalSteps
.
# sanity check (how many different rows?)
dailyAct[is.na(dailyAct$TotalSteps2),"TotalSteps2"] <- dailyAct[is.na(dailyAct$TotalSteps2),"TotalSteps"] # interp. 13 cases
dailyAct$TotalSteps_diff <- dailyAct$TotalSteps - dailyAct$TotalSteps2 # computing difference between original and recomputed
cat(nrow(dailyAct[dailyAct$TotalSteps!=dailyAct$TotalSteps2,]),"differences (", # 1,308 diff (21.73%)
round(100*nrow(dailyAct[dailyAct$TotalSteps!=dailyAct$TotalSteps2,])/nrow(dailyAct),1),"% ) ranging from",
min(dailyAct$TotalSteps_diff[dailyAct$TotalSteps_diff!=0]),"to",
max(dailyAct$TotalSteps_diff[dailyAct$TotalSteps_diff!=0]),"steps\n-",
nrow(dailyAct[dailyAct$TotalSteps_diff<10,]),"cases with a max difference of 10 or less steps (",
round(100*nrow(dailyAct[dailyAct$TotalSteps_diff<=10,])/nrow(dailyAct),1),"% ) \n-",
nrow(dailyAct[dailyAct$TotalSteps_diff<100,]),"cases with a max difference of 100 or less steps (",
round(100*nrow(dailyAct[dailyAct$TotalSteps_diff<=100,])/nrow(dailyAct),1),"% )")
## 1308 differences ( 21.7 % ) ranging from 1 to 14220 steps
## - 4949 cases with a max difference of 10 or less steps ( 82.7 % )
## - 5606 cases with a max difference of 100 or less steps ( 93.2 % )
# plotting differences
par(mfrow=c(1,3))
hist(dailyAct$TotalSteps,breaks=100,main="Automatically scored \nTotalSteps per day",xlab="steps")
hist(dailyAct$TotalSteps2,breaks=100,main="Manually scored \nTotalSteps per day",xlab="steps")
hist(dailyAct[dailyAct$TotalSteps_diff!=0,"TotalSteps_diff"],breaks=100,
main="differences between automatically\nvs. manually determined daily TotalSteps",xlab="steps")
Comments:
manually and automatically scored total steps per day are different only in a minority, although substantial, number of cases (i.e., 22%)
all differences are positive, meaning that automatically scored steps are always equal to or higher than manually scored steps
most differences are close to zero (i.e., 80% < 200 steps), whereas a minority of them (6%) is higher than 1000 steps/day
Here, we save the aggregated dataset. Note that only manually recomputed TotalSteps
are kept in the dailyAct
dataset.
# removing TotalSteps2 and TotalSteps_diff
dailyAct$TotalSteps <- dailyAct$TotalSteps2
dailyAct$TotalSteps2 <- dailyAct$TotalSteps_diff <- NULL
# showing dataset (first 3 rows)
dailyAct[1:3,]
# saving dataset
save(dailyAct,file="DATA/datasets/dailyAct_aggregated.RData")
Here, we integrate the 870 cases only included in classicEBE
and sleepLog
, but not in sleepEBE
, from the classicEBE
to the sleepEBE
dataset (saved as ClassicAndSleepLog
cases, see section 2.5.2). This is done to facilitate the sleep measures computation below. For consistency between classicEBE
value and sleepEBE
SleepStages
variables, the former was recoded as 0 = wake and 1 = sleep (corresponding to sleepStages
light) (see section 3.5).
# ClassicAndSleepLog cases (N = 870)
ClassicAndSleepLog <- LogId_special[[4]]
length(ClassicAndSleepLog)
## [1] 870
cat("sanity check:",
length(ClassicAndSleepLog)==length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
!levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))
## sanity check: TRUE
# preparing datasets for aggregation
sleepEBE$SleepDataType <- "stages" # creating SleepDataType column to mark the data origin
classicEBE$SleepDataType <- "classic"
memory <- sleepEBE # saving current dataset for comparison
sleepEBE$LogId <- as.character(sleepEBE$LogId) # LogId back to character
classicEBE$LogId <- as.character(classicEBE$LogId)
Here, the 870 ClassicAndSleepLog
are integrated within the sleepEBE
dataset. Since classicEBE
was recorded in 60-sec epochs whereas sleepEBE
was recorded in 30-sec epochs, the former are duplicated before merging.
# aggregating data
for(LOG in ClassicAndSleepLog){
# selecting LOG-related epochs
classicLog <- classicEBE[classicEBE$LogId==LOG,]
# duplicating each row, and adding 30 secs to each other epoch to have 30-sec epochs
classicLog_dup <- classicLog[rep(1:nrow(classicLog), rep(2,nrow(classicLog))),]
classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] <- classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] + 30
# changing column name from "value" to "SleepStage"
colnames(classicLog_dup)[which(colnames(classicLog_dup)=="value")] <- "SleepStage"
# merging
sleepEBE <- rbind(sleepEBE,classicLog_dup[,c("ID","group","LogId","Time","SleepStage","ActivityDate","SleepDataType")]) }
# back to LogId as factor, and sorting data by ID, ActivityDate and Time
sleepEBE$LogId <- as.factor(sleepEBE$LogId)
classicEBE$LogId <- as.factor(classicEBE$LogId)
sleepEBE$SleepDataType <- as.factor(sleepEBE$SleepDataType)
sleepEBE <- sleepEBE[order(sleepEBE$ID,sleepEBE$ActivityDate,sleepEBE$Time),]
Here, we check whether our procedure effectively integrated ClassicANDSleepLog
cases with sleepEBE
data. This is done by using the same lines of code used in section 2.5.1.
# sanity check
data.frame(NstagesANDclassic_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nstages_NOsleepLogORclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nclassic_NOsleepLogORstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
NclassicANDsleepLogNostages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))
# is the difference between the No. of cases in the original and dataset aggregated equal to the No. of ClassicAndSleepLog cases?
cat("sanity check:",(nrow(sleepEBE)-nrow(memory))==(nrow(classicEBE[classicEBE$LogId%in%ClassicAndSleepLog,])*2))
## sanity check: TRUE
# is the difference between the No. of LogIds in the original and dataset aggregated equal to the No. of ClassicAndSleepLog?
cat("sanity check:",(nlevels(sleepEBE$LogId)-nlevels(memory$LogId))==length(ClassicAndSleepLog))
## sanity check: TRUE
# new No. of sleepEBE LogId
cat("New No. of sleepEBE LogId:",nlevels(sleepEBE$LogId))
## New No. of sleepEBE LogId: 5442
Comments:
now, no more cases are included in classicEBE
and sleepLog
but not in sleepEBE
, suggesting that data aggregation was effective
the new No. of sleepEBE
LogId
values is 5,442
Then, we compare the distributions of sleep and wake values in the original and aggregated cases.
par(mfrow=c(1,3))
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="stages","SleepStage"]))
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"]))
plot(as.factor(as.factor(gsub("2","1",gsub("3","1",sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"])))))
Comments:
sleepEBE
data and the integrated ClassicAndSleepLog
casesHere, we save the aggregated dataset.
# saving dataset
save(sleepEBE,file="DATA/datasets/sleepEBEclassic_aggregated.RData")
Here, we recompute summary sleep measures by using EBE data. This will be done separately for:
those cases included in both sleepLog
and sleepEBE
(N = 5,401), whose sleep measures are computed within the StartTime
and EndTime boundaries of the combined sleep periods identified in section 2.3.3
those cases only included in sleepEBE
(and classicEBE
) (N = 41), whose sleep measures are computed using the LogId
variable to identify the epochs belonging to the same sleep period
those cases only included in classicEBE
(N = 336), whose sleep measures are computed using the LogId
variable to identify the epochs belonging to the same sleep period
In each group of cases, sleep measures are computed in line with the definitions reported by Menghini et al (2021a), using a modified version of the ebe2sleep
R function from the associated public repository.
ebe2sleep <- function(SLEEPdata=NA,EBEdata=NA,epochLength=30,
idBased=FALSE,idCol="LogId", # new arguments added to consider TIB boundaries or LogId values
stagesCol="SleepStage",staging=TRUE,stages=c(wake=0,light=1,deep=2,REM=3),digits=2,
classicDataTypeCol=NA, classicStages=NA, # arguments added to include epochs from classicEBE
sleep.measures=c("EBEDataType","nEpochs","nFinalWake","missing_start","missing_middle",
"TIB","TST","SE","SO","WakeUp","midSleep",
"SOL","WASO","nAwake","fragIndex","light","deep","rem"),nAwake_min=5,
lastWake_exclude=TRUE,missing_asWake=TRUE){
# 1. preparing data
# ..............................................................................................................................
colnames(EBEdata) <- gsub(idCol,"idCol",colnames(EBEdata)) # setting idCol
EBEdata$idCol <- as.factor(as.character(EBEdata$idCol))
if(idBased==FALSE){ SLEEPdata[,sleep.measures] <- NA} # target columns (default NA)
# 2. ebe2sleep function (modified from Menghini et al 2021a)
# ..........................................................................................................................
EBE2SLEEP <- function(data,staging,stages,nAwake_min2){
# renaming variables
colnames(data) <- gsub(stagesCol,"stages",colnames(data))
data$stages <- as.integer(as.character(data$stages))
# setting stages as 0 = wake, 1 = light, 2 = deep, 3 = REM
if(staging==TRUE){
data$stages <- as.integer(as.character(factor(data$stages,levels=as.numeric(stages),labels=c(0,1,2,3))))
} else { # setting stages as 0 = wake, 1 = sleep when staging = FALSE
if(length(stages)>2){ stop("only two elements should be used in the stages argument when staging = FALSE,
e.g., stages = c(wake = 0, sleep = 1)") }
data$stages <- as.integer(as.character(factor(data$stages,levels=as.numeric(stages),labels=c(0,1)))) }
# function to compute sleep measures
sleepMeasures <- function(data,nAwake_min3){
# TIB = number of minutes between lights on and lights off
TIB <- nrow(data)*epochLength/60
# TST = number of minutes scored as sleep
TST <- nrow(data[data$stages!=0,])*epochLength/60
# SE = percentage of sleep time over TIB
SE <- 100*TST/TIB
# SOL = number of minutes scored as wake before the first epoch scored as sleep
SOL = 0
for(i in 1:nrow(data)){ if(data[i,"stages"]==0){ SOL = SOL + 1 } else { break } }
SOL <- SOL*epochLength/60
# WASO = number of minutes scored as wake after the first epoch scored as sleep
WASO <- data[i:nrow(data),]
WASO <- nrow(WASO[WASO$stages==0,])*epochLength/60
# SO = time of the first sleep epoch
SO = as.POSIXct(as.character(data[i,"Time"]),tz="GMT")
# WakeUp = time of the first epoch of final wake
if(tail(data$stages,1)!=0){ # when last epoch = sleep, WakeUp = last epoch + epochLength
WakeUp <- as.POSIXct(as.character(tail(data$Time,1)),tz="GMT") + epochLength
} else{
for(i in nrow(data):1){ if(data[i,"stages"]!=0){ break } }
WakeUp <- as.POSIXct(as.character(data[i,"Time"]),tz="GMT") }
# midSleep = halfway points between SO and WakeUp
midSleep <- SO + difftime(WakeUp,SO,units="secs")/2
# nAwake = number of awakenings longer than nAwake_min3; stageShift = number of sleep stage shiftings (including wake)
nAwake <- stageShift <- 0
for(i in which(data$Time==SO):(nrow(data)-nAwake_min3*60/epochLength+1)){ if(i==1){ i <- i + 1 }
if(data[i-1,"stages"]!=0 & sum(data[i:(i+nAwake_min3*60/epochLength-1),"stages"])==0){
nAwake <- nAwake + 1 }
if(data[i,"stages"]!=data[i-1,"stages"]){ stageShift <- stageShift + 1 }}
# fragIndex = number of sleep stage shifting (including wake) per hour
fragIndex <- stageShift/as.numeric(difftime(WakeUp,SO,units="hours"))
if(staging==TRUE){
# Light = number of minutes scored as Light sleep (N1 + N2)
Light <- nrow(data[data$stages==1,])*epochLength/60
# Deep = number of minutes scored as Light sleep (N3)
Deep <- nrow(data[data$stages==2,])*epochLength/60
# REM = number of minutes scored as REM sleep
REM <- nrow(data[data$stages==3,])*epochLength/60
c("EBEDataType","nEpochs","nFinalWake","missing_start","missing_middle",
"TIB","TST","SE","SO","WakeUp","midSleep",
"SOL","WASO","nAwake","fragIndex","light","deep","rem")
# sleep stages metrics when staging = TRUE
return(data.frame(TIB=TIB,TST=TST,SE=SE,
SO=as.POSIXct(as.character(SO),tz="GMT"),WakeUp=as.POSIXct(as.character(WakeUp),tz="GMT"),
midSleep=midSleep,
SOL=SOL,WASO=WASO,nAwake=nAwake,fragIndex=fragIndex,
light=Light,deep=Deep,rem=REM))
# only sleep/wake metrics when staging = FALSE
} else{ return(data.frame(TIB=TIB,TST=TST,SE=SE,
SO=as.POSIXct(as.character(SO),tz="GMT"),WakeUp=as.POSIXct(as.character(WakeUp),tz="GMT"),
midSleep=midSleep,
SOL=SOL,WASO=WASO,nAwake=nAwake,fragIndex=fragIndex,
light=NA,deep=NA,rem=NA)) }}
sleep.metrics <- sleepMeasures(data,nAwake_min3=nAwake_min2)
# rounding values and returning dataset
nums <- vapply(sleep.metrics, is.numeric, FUN.VALUE = logical(1))
sleep.metrics[,nums] <- round(sleep.metrics[,nums], digits = digits)
return(sleep.metrics)}
# 3. # iteratively computing sleep measures for each considered sleep period
# ..........................................................................................................................
# 3.1. based on sleepLog TIB boundaries
# ..........................................................................................
if(idBased==FALSE){
require(tcltk)
pb <- tkProgressBar("Computing Sleep metrics:", "%",0, 100, 50) # progress bar
rownames(SLEEPdata) <- 1:nrow(SLEEPdata)
for(i in 1:nrow(SLEEPdata)){ info <- sprintf("%d%% done", round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100))
setTkProgressBar(pb, round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100), sprintf("Computing sleep metrics:", info), info)
# 3.1.1. Selecting EBE data within SLEEPdata boundaries
ebe <- EBEdata[EBEdata$ID==SLEEPdata[i,"ID"] & # same ID & bounded between StartTime and EndTime
EBEdata$Time >= SLEEPdata[i,"StartTime"] & EBEdata$Time <= SLEEPdata[i,"EndTime"],]
nEpochs <- nrow(ebe) # number of epochs
if(nEpochs>0){
# 3.1.2. excluding the last group of wake epochs or considering them as wake?
nFinalWake <- 0
if(lastWake_exclude==TRUE){
SLEEPdata[i,"EndTime"] <- tail(ebe$Time,1) # updating EndTime
if(ebe[nrow(ebe),"SleepStage"]==0){ # excluding final wake epochs
for(j in nrow(ebe):1){ if(ebe[j,"SleepStage"]=="0"){ ebe <- ebe[1:(j-1),] } else{ break }}
SLEEPdata[i,"EndTime"] <- ebe[nrow(ebe),"Time"] } # updating EndTime
nFinalWake <- nEpochs - nrow(ebe) # counting excluded nFinalWake
nEpochs <- nrow(ebe) }
# 3.1.3. Counting missing epochs
# (a) missing epochs at the beginning
missing_start <- 0
if(difftime(head(ebe$Time,1),SLEEPdata[i,"StartTime"],units="mins")!=0){
missing_start <- as.numeric(difftime(head(ebe$Time,1),SLEEPdata[i,"StartTime"],units="mins"))*2 }
# (b) missing epochs at the end (only if the last group of wake epochs is included)
missing_end <- 0
if(lastWake_exclude==FALSE){
if(difftime(SLEEPdata[i,"EndTime"],tail(ebe$Time,1),units="mins")!=0){
missing_end <- as.numeric(difftime(SLEEPdata[i,"EndTime"],tail(ebe$Time,1),units="mins"))*2 }}
# (c) missing epochs in the middle (only for combined sleep periods)
missing_middle <- 0
# 2+ Logs ..............................................
ebe$idCol <- as.factor(as.character(ebe$idCol))
LOGs <- levels(ebe$idCol)
if(length(LOGs)>1){
# creating table with LogId and StartTime
LOGtimes <- head(ebe[ebe$idCol==LOGs[1],c("idCol","Time")],1)
for(LOGorder in 2:length(LOGs)){
LOGtimes <- rbind(LOGtimes,head(ebe[ebe$idCol==LOGs[LOGorder],c("idCol","Time")],1)) }
LOGtimes <- LOGtimes[order(LOGtimes$Time),] # sorting sleep periods by StartTime
LOGtimes$newLog <- paste("LOG",1:nrow(LOGtimes),sep="")
# ebe$idCol <- as.character(ebe$idCol)
for(LOGorder in 1:nrow(LOGtimes)){
ebe[as.character(ebe$idCol)==LOGtimes[LOGorder,"idCol"],"newLog"] <- LOGtimes[LOGorder,"newLog"] }
LOGs <- levels(as.factor(ebe$newLog))
# computing missing_middle
d <- 0
for(k in 2:length(LOGs)){
d <- d + as.numeric(difftime(head(ebe[ebe$newLog==LOGs[k],"Time"],1),
tail(ebe[ebe$newLog==LOGs[k-1],"Time"],1),units="mins"))*2 }
missing_middle <- missing_middle + d
# 3.1.4. Computing and adding sleep measures
# .................................................................
# Are both "classic" and "stage" cases included in EBEdata?
if(!is.na(classicDataTypeCol)){
dataTypes <- levels(as.factor(as.character(ebe[,classicDataTypeCol])))
# a) 2+ Logs = "stages" --> staging = TRUE, stages = stages
if(length(dataTypes)==1 & dataTypes[1] == "stages"){ EBEDataType = "stages"
new.data <- EBE2SLEEP(data=ebe,staging=TRUE,stages=stages,nAwake_min2=nAwake_min)
# b) 2+ Logs = "classic" --> staging = FALSE, stages = classicStages
} else if(length(dataTypes)==1 & dataTypes[1] == "classic"){ EBEDataType = "classic"
new.data <- EBE2SLEEP(data=ebe,staging=FALSE,stages=classicStages,nAwake_min2=nAwake_min)
# c) both "stages" and "classic" sleep data types --> separately processing and summing "stages" and "classic"
} else if(length(dataTypes)==2){ EBEDataType = "mixed"
colnames(ebe) <- gsub(classicDataTypeCol,"dataType",colnames(ebe))
ebe$dataType <- as.factor(as.character(ebe$dataType))
new.data <- EBE2SLEEP(data=ebe[ebe$dataType=="stages",],staging=TRUE,stages=stages, # "stages" epochs
nAwake_min2=nAwake_min)
new.data.classic <- EBE2SLEEP(data=ebe[ebe$dataType=="classic",],staging=FALSE,stages=classicStages, # "classic" epochs
nAwake_min2=nAwake_min)
# updating variables
new.data[,c("TIB","TST","WASO")] <- # summing sleep measures
new.data[,c("TIB","TST","WASO")] + new.data.classic[,c("TIB","TST","WASO")]
new.data$SE <- round(100*new.data$TST/new.data$TIB,digits) # recomputing SE
new.data$SOL <- ifelse(new.data$SO < new.data.classic$SO, new.data$SOL, new.data.classic$SOL) # first SOL
new.data$SO <- min(new.data$SO,new.data.classic$SO) # first SO
new.data$WakeUp <- max(new.data$WakeUp,new.data.classic$WakeUp) # last WakeUp
new.data$midSleep <- new.data$SO + difftime(new.data$WakeUp,new.data$SO,units="secs")/2 # recomputing midSleep
new.data$light <- new.data$deep <- new.data$rem <- NA } # invalid sleep stage durations
# d) Unspecified classicDataTypeCol (i.e., only "classic" or "stages" cases) --> staging = staging, stages = stages
} else { EBEDataType = ifelse(staging,"stages","classic")
new.data <- EBE2SLEEP(data=ebe,staging=staging,stages=stages,nAwake_min2=nAwake_min) }
# Only 1 Log ...........................................
} else {
# d) Unspecified classicDataTypeCol (i.e., only "classic" or "stages" cases) --> staging = staging, stages = stages
if(is.na(classicDataTypeCol)){ EBEDataType = ifelse(staging,"stages","classic")
new.data <- EBE2SLEEP(data=ebe,staging=staging,stages=stages,nAwake_min2=nAwake_min)
# e) # Only 1 Log = "stages" --> staging = TRUE, stages = stages
} else { EBEDataType = "stages"
if(ebe[1,classicDataTypeCol]=="stages"){
new.data <- EBE2SLEEP(data=ebe,staging=TRUE, stages=stages,nAwake_min2=nAwake_min)
# f) Only 1 Log = "classic" --> staging = FALSE, stages = classicStages
} else { EBEDataType = "classic"
new.data <- EBE2SLEEP(data=ebe,staging=FALSE,stages=classicStages,nAwake_min2=nAwake_min) }}}
SLEEPdata[i,sleep.measures] <- cbind(data.frame(EBEDataType=EBEDataType,nEpochs=nEpochs,nFinalWake=nFinalWake,
missing_start=missing_start,missing_middle=missing_middle),
new.data)
} else{ # when no epochs are identified in the sleep period -> nEpochs=0, sleep measures = NA
SLEEPdata[i,"nEpochs"] <- 0 }}
close(pb)
} else {
# 3.2. based on LogId
# ........................................................................................
# setting empty data.frame
sleep.measures <- c("ID","group","LogId",sleep.measures)
SLEEPdata <- data.frame(matrix(ncol=length(sleep.measures),nrow=0))
colnames(SLEEPdata) <- sleep.measures
# iteratively computing sleep measures for each level of idCol
for(ID in levels(EBEdata$idCol)){
ebe <- EBEdata[EBEdata$idCol==ID,] # selecting data based on ID
nEpochs <- nrow(ebe)
# 3.2.1. excluding the last group of wake epochs?
nEpochs <- nrow(ebe)
nFinalWake <- 0
if(lastWake_exclude==TRUE){
for(j in nrow(ebe):1){
if(ebe[j,stagesCol]=="0"){ ebe <- ebe[1:j,] } else { break }}
nFinalWake = nEpochs - nrow(ebe)
nEpochs <- nrow(ebe) }
# Unspecified classicDataTypeCol (i.e., only "classic" or "stages" cases) --> staging = staging, stages = stages
if(is.na(classicDataTypeCol)){
EBEDataType = ifelse(staging,"stages","classic")
new.data <- EBE2SLEEP(data=ebe,staging=staging,stages=stages,nAwake_min2=nAwake_min)
# if both "classic" and "stages" cases are included
} else {
EBEDataType = head(ebe[,classicDataTypeCol],1)
# if EBEDataType = "stages" --> staging = TRUE, stages = stages
if(EBEDataType=="stages"){
new.data <- EBE2SLEEP(data=ebe,staging=TRUE,stages=stages,nAwake_min2=nAwake_min)
# if EBEDataType = "classic" --> staging = FALSE, stages = classicStages
} else {
new.data <- EBE2SLEEP(data=ebe,staging=FALSE,stages=classicStages,nAwake_min2=nAwake_min)
}
}
# updating dataset
SLEEPdata <- rbind(SLEEPdata,
cbind(data.frame(ID=head(ebe$ID,1),group=head(ebe$group,1),ActivityDate=head(ebe$ActivityDate,1),LogId=ID,
StartTime=head(ebe$Time,1),EndTime=tail(ebe$Time,1),
EBEDataType=EBEDataType,nEpochs=nEpochs,nFinalWake=nFinalWake,
missing_start=NA,missing_middle=NA),
new.data)) }}
# WakeUp and midSleep as POSIXct
SLEEPdata$WakeUp <- as.POSIXct(SLEEPdata$WakeUp,origin="1970-01-01",tz="GMT")
SLEEPdata$midSleep <- as.POSIXct(SLEEPdata$midSleep,origin="1970-01-01",tz="GMT")
# 4. Recomputing sleep metrics by considering missing epochs as wake? (only when idBased = FALSE)
# ..........................................................................
if(idBased==FALSE & missing_asWake==TRUE){
SLEEPdata$TIB <- SLEEPdata$TIB + SLEEPdata$missing_middle*epochLength/60 +
SLEEPdata$missing_start*epochLength/60 # TIB = tot No.epochs + missing_middle + missing_start
SLEEPdata$EndTime <- SLEEPdata$EndTime - 60*missing_end*epochLength/60 # EndTime is recoded by removing nFinalWake
SLEEPdata$SE <- round(100*SLEEPdata$TST/SLEEPdata$TIB,2) # recomputing SE
SLEEPdata$SOL <- SLEEPdata$SOL + SLEEPdata$missing_start*epochLength/60 # SOL = SOL + No. missing epochs at the beginning
SLEEPdata$SO <- as.POSIXct(SLEEPdata$SO,origin="1970-01-01",tz="GMT") # SO from timestamp code to date and hour
SLEEPdata$WASO <- SLEEPdata$WASO + SLEEPdata$missing_middle*epochLength/60 # WASO = WASO + No. missing_middle
SLEEPdata$nAwake <- SLEEPdata$nAwake + ifelse(missing_middle>0,1,0)} # adding 1 nAwake if missing_middle > 0
return(SLEEPdata) }
Here, we use the boundaries of the combined sleep periods identified in section 2.2.1 to generate a dataset of sleep measures from raw EBE data. The following parameters are set:
missing epochs, either at the beginning, at the end, or in the middle (i.e., 61 cases of combined sleep periods), are considered as wake epochs
the last group of wake epochs is excluded from the computation of TIB and the other sleep measures (i.e., not considered as WASO)
the first group of wake epochs is included and considered as SOL
only sleep/wake measures (and not sleep stages measures) are computed from the 566 cases integrated from classicEBE
# running function
sleepLog <- ebe2sleep(SLEEPdata=sleepLog,EBEdata=sleepEBE, # sleepLog and EBE data
idCol="LogId",idBased=FALSE, # not based on LogId but on sleepLog TIB limits
stagesCol="SleepStage",staging=TRUE,stages=c(wake=0,light=1,deep=2,REM=3),digits=2, # sleep staging info
classicDataTypeCol="SleepDataType",classicStages=c(wake=0,sleep=1), # new arguments to include Classic cases
epochLength=30, # epoch length (secs)
sleep.measures=c("EBEDataType","nEpochs","nFinalWake","missing_start","missing_middle",
"TIB","TST","SE","SO","WakeUp","midSleep",
"SOL","WASO","nAwake","fragIndex","light","deep","rem"), # sleep.measures to be computed
nAwake_min=5, # minimum minutes of wake epochs to count nAwake
lastWake_exclude=TRUE, # excluding last epochs of wake?
missing_asWake=TRUE) # considering missing values as wake
## Loading required package: tcltk
Then, we use the same function to compute sleep metrics from the 41 cases of sleepEBE
not included in sleepLog
data. For these cases, we use the LogId
variable to identify the epochs belonging to the same sleep period.
# selecting cases of uniqueEBElogs
uniqueEBElogs <- LogId_special[[2]]
length(uniqueEBElogs) # 41
## [1] 41
# computing sleep measures
(sleepLog.uniqueEBE <- ebe2sleep(SLEEPdata=sleepLog,EBEdata=sleepEBE[sleepEBE$LogId%in%uniqueEBElogs,], # selecting uniqueEBElogs
idBased=TRUE,idCol="LogId", epochLength=30, # based on LogId rather than sleepLog TIB
stagesCol="SleepStage"))[1:3,] # showing first 3 lines
Then, as done for sleepLog
in section 2.3.4, we need to update the ActivityDate
variable so that it indicates the previous day when the StartTime
is between 00:00 and 06:00. This allows better clarifying the distinction between consecutive nocturnal sleep periods.
library(lubridate)
sleepLog.uniqueEBE$StartHour <- as.POSIXct(paste(hour(sleepLog.uniqueEBE$StartTime), # computing StartHour
minute(sleepLog.uniqueEBE$StartTime)), format = "%H %M",tz="GMT")
# setting times
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")
# updating ActivityDate
sleepLog.uniqueEBE[sleepLog.uniqueEBE$StartHour >= h00 & sleepLog.uniqueEBE$StartHour <= h06,"ActivityDate"] <-
sleepLog.uniqueEBE[sleepLog.uniqueEBE$StartHour >= h00 & sleepLog.uniqueEBE$StartHour <= h06,"ActivityDate"] - 1
# sanity check: no duplicated epochs (0)
sleepLog.uniqueEBE$StartHour <- NULL
which(duplicated(paste(sleepLog.uniqueEBE$ID,sleepLog.uniqueEBE$ActivityDate,sep="_"))==TRUE)
## integer(0)
sleepLog.uniqueEBE[1:3,] # showing first 3 rows
Comments:
the ActivityDate
variable has been effectively recoded
in none of the 41 cases there are two sleep periods with the same ID
and ActivityDate
variables
Then, we use the same function to compute sleep metrics from the 336 cases of classicEBE
not included in sleepLog
data. For these cases, we use the LogId
variable to identify the epochs belonging to the same sleep period. Note that with classicEBE
the epoch length should be set at 60 seconds.
# selecting cases of uniqueEBElogs
uniqueClassiclogs <- LogId_special[[3]]
length(uniqueClassiclogs) # 336
## [1] 336
# computing sleep measures
(sleepLog.uniqueClassic <- ebe2sleep(SLEEPdata=sleepLog,EBEdata=classicEBE[classicEBE$LogId%in%uniqueClassiclogs,], # uniqueEBElogs
idBased=TRUE,idCol="LogId",epochLength=60, # epochLenght = 60 sec
staging=FALSE,stages=c(wake=0,sleep=1), # staging = FALSE
stagesCol="value"))[1:3,] # showing first 3 lines
Then, as done for sleepLog
in section 2.3.4, we need to update the ActivityDate
variable so that it indicates the previous day when the StartTime
is between 00:00 and 06:00. This allows better clarifying the distinction between consecutive nocturnal sleep periods.
sleepLog.uniqueClassic$StartHour <- as.POSIXct(paste(hour(sleepLog.uniqueClassic$StartTime), # computing StartHour
minute(sleepLog.uniqueClassic$StartTime)), format = "%H %M",tz="GMT")
# setting times
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")
# updating ActivityDate
sleepLog.uniqueClassic[sleepLog.uniqueClassic$StartHour >= h00 & sleepLog.uniqueClassic$StartHour <= h06,"ActivityDate"] <-
sleepLog.uniqueClassic[sleepLog.uniqueClassic$StartHour >= h00 & sleepLog.uniqueClassic$StartHour <= h06,"ActivityDate"] - 1
# sanity check: duplicated epochs (2 cases of diurnal naps)
sleepLog.uniqueClassic$StartHour <- NULL
which(duplicated(paste(sleepLog.uniqueClassic$ID,sleepLog.uniqueClassic$ActivityDate,sep="_"))==TRUE)
## [1] 66 74
sleepLog.uniqueClassic$IDday <- as.factor(paste(sleepLog.uniqueClassic$ID,sleepLog.uniqueClassic$ActivityDate,sep="_"))
sleepLog.uniqueClassic[sleepLog.uniqueClassic$IDday %in%
as.character(sleepLog.uniqueClassic[duplicated(sleepLog.uniqueClassic$IDday),"IDday"]),]
sleepLog.uniqueClassic[1:3,] # showing first 3 rows
Comments:
the ActivityDate
variable has been effectively recoded
in two cases there are two sleep periods with the same ID
and ActivityDate
variables, but these case will be later removed since they are cases of diurnal sleep periods (see the data cleaning section)
Here, we inspect the number of missing data in the generated dataset, and we visually compare the obtained sleep metrics distributions with those automatically recorded in Fitabase, and those uniquely included in EBE data (not in SleepLog
data).
First, we inspect the sleep metrics obtained by considering all epochs between the StartTime and EndTime recoded from sleepLog data.
# plotting
par(mfrow=c(2,2))
hist(sleepLog$nEpochs,breaks=30,main="No. of included nonmissing epochs",xlab="")
hist(sleepLog$nFinalWake,breaks=30,main="No. of final wake epochs\n(excluded)",xlab="")
hist(sleepLog$missing_start,breaks=30,main="No. of missing epochs at the beginning\n(included as wake)",xlab="")
hist(sleepLog$missing_middle,breaks=30,main="No. of missing epochs in the middle\n(included as wake)",xlab="")
Comments and details:
StartTime
and EndTime
shows a similar shape than sleepLog
-based TimeInBed
, ranging from 118 epochs (59 min) to 1,888 epochs (15.7 hours). All cases have nEpochs > 0.# summary of nonmissing epochs
summary(sleepLog$nEpochs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 118.0 809.0 931.0 923.3 1049.0 1888.0
# No. of cases with missing nEpochs (0)
cat(nrow(sleepLog[is.na(sleepLog$nEpochs),]),"cases with missing nEpochs")
## 0 cases with missing nEpochs
# No. of sleepLog data with no corresponding sleepEBE data (0)
cat(nrow(sleepLog[sleepLog$nEpochs==0,]),"cases with zero nEpochs (i.e., sleepLog cases with no corresponding EBE data")
## 0 cases with zero nEpochs (i.e., sleepLog cases with no corresponding EBE data
# No. of cases with NO final wake epochs (2005 , 41.2%)
cat(nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$nFinalWake==0,]),"cases (",
round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$nFinalWake==0,])/nrow(sleepLog[sleepLog$nEpochs!=0,]),1),
"% ) with NO final wake epochs\n\nSummary of nFinalWake:")
## 1761 cases ( 36.2 % ) with NO final wake epochs
##
## Summary of nFinalWake:
summary(sleepLog[sleepLog$nEpochs!=0,"nFinalWake"]) # nFinalWake (max 122 min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 3.000 8.679 13.000 123.000
TIB
, and considered as WASO
) ranges from 0 (98.8%) to 535 (4.5h). All cases with missing epochs > 0 (N = 61) are cases of combined sleep periods# No. of missing epochs in the middle > 0 (61)
cat(nrow(sleepLog[!is.na(sleepLog$missing_middle) & sleepLog$missing_middle>0,]),"cases (",
round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_middle>0,])/nrow(sleepLog[sleepLog$nEpochs!=0,]),1),
"% ) with 1 or more missing epochs in the middle (i.e., considered as wake)\n",
"of which",nrow(sleepLog[sleepLog$missing_middle>0 & sleepLog$combined==TRUE,]),"(",
round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_middle>0 & sleepLog$combined==TRUE,])/
nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_middle>0,]),1),
"% ) are cases of combined sleep periods\n\nSummary of missing_middle:")
## 61 cases ( 1.3 % ) with 1 or more missing epochs in the middle (i.e., considered as wake)
## of which 61 ( 100 % ) are cases of combined sleep periods
##
## Summary of missing_middle:
summary(sleepLog[sleepLog$missing_middle>0,"missing_middle"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 136.0 163.0 188.1 248.0 535.0
TIB
and considered as SOL) ranges from 0 (79.7%) or 1 (20.2%) in all cases with the exception of 1 single case with 308 missing epochs (2.5h) at the beginning (LogId
= 24848200932, i.e. case of combined sleep periods in which the first sleep period was not included in sleepEBE
or classicEBE
).# missing epochs at the beginning > 1
miN <- min(sleepLog$missing_start,na.rm=TRUE)
miN2 <- min(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_start>miN,"missing_start"],na.rm=TRUE)
cat(nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_start==miN,]),"cases with",min(sleepLog$missing_start,na.rm=TRUE),
"missing epochs at the beginning","(",round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_start==miN,])/
nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$nEpochs!=0,]),1),
"% )\nWithout these cases, the min No. of missing epochs at the beginning would be",miN2,
" (",nrow(sleepLog[sleepLog$missing_start==miN2,]),
"cases,",round(100*nrow(sleepLog[sleepLog$missing_start==miN2,])/
nrow(sleepLog[sleepLog$nEpochs>0,]),1),
"% )\nwith",nrow(sleepLog[sleepLog$nEpochs>0 & sleepLog$missing_start>miN2,]),
"case showing more than",miN2,"missing epochs at the beginning (i.e., =",
sleepLog[sleepLog$nEpochs>0 & sleepLog$missing_start>miN2,"missing_start"],"missing epochs)")
## 3879 cases with 0 missing epochs at the beginning ( 79.7 % )
## Without these cases, the min No. of missing epochs at the beginning would be 1 ( 984 cases, 20.2 % )
## with 1 case showing more than 1 missing epochs at the beginning (i.e., = 308 missing epochs)
Here, we better inspect the case with LogId
marked as uniqueLogId
(24848200932): a case of combined sleep periods in which only the second period (and thus, the second LogId
) was included in sleepEBE
data. Thus, only for this specific case, we do not consider the first period, but we recompute the sleep measures based on sleepEBE
data only.
# what happened to that case marked as uniqueLogId? -> it was a case of combined sleep periods (the first part was not included)
(uniqueLogId <- LogId_special[[1]]) # case 24848200932
## [1] "24848200932"
sleepLog[sleepLog$LogId==uniqueLogId,c("ID","LogId","StartTime","EndTime","TimeInBed","TIB","combined","missing_start")]
combCase <- levels(as.factor(as.character(sleepEBE[sleepEBE$ID=="s056" & # case 24848200933
sleepEBE$Time>=sleepLog[sleepLog$LogId==uniqueLogId,"StartTime"] &
sleepEBE$Time<=sleepLog[sleepLog$LogId==uniqueLogId,"EndTime"],"LogId"])))
# removing uniqueLogId from sleepLog dataset
sleepLog <- sleepLog[sleepLog$LogId!=uniqueLogId,]
# adding case to sleepLog.uniqueEBE
sleepLog.uniqueEBE <- rbind(sleepLog.uniqueEBE,
ebe2sleep(EBEdata=sleepEBE[sleepEBE$LogId=="24848200933",], # selecting case 24848200933
idBased=TRUE,idCol="LogId", epochLength=30))
# re-plotting missing data after the adjustment
par(mfrow=c(2,2))
hist(sleepLog$nEpochs,breaks=30,main="No. of included nonmissing epochs",xlab="")
hist(sleepLog$nFinalWake,breaks=30,main="No. of final wake epochs\n(excluded)",xlab="")
hist(sleepLog$missing_start,breaks=30,main="No. of missing epochs at the beginning\n(included as wake)",xlab="")
hist(sleepLog$missing_middle,breaks=30,main="No. of missing epochs in the middle\n(included as wake)",xlab="")
Comments:
Finally, we better inspect the cases with missing_middle > 0 or nFinalWake > 0.
# plotting No. of cases with missing_start > 2, missing_middle > 0 and nFinalWake > 0
par(mfrow=c(1,2))
hist(sleepLog[sleepLog$nFinalWake>0,"nFinalWake"],breaks=30,
main="No. of excluded final wake epochs\nhigher than 0",xlab="")
hist(sleepLog[sleepLog$missing_middle>0,"missing_middle"],
breaks=30,main="No. of missing epochs in the middle\nhigher than 0",xlab="")
Here, we inspect the differences between TIB
values (computed from the No. of available epochs) and the difference between sleepLog
EndTime
and StartTime
. Note that by setting the argument lastWake_exclude
= TRUE
the function automatically updated EndTime
values by setting it to the last sleep epoch’s Time
value. Thus, the two variables should almost perfectly match.
# recomputing TIB as the difference (in min) between EndTime and StartTime values
sleepLog$TIB_r <- as.numeric(difftime(sleepLog$EndTime,sleepLog$StartTime,units="mins"))
# computing differences between TIB_r and TIB
sleepLog$TIB_diff <- sleepLog$TIB_r - sleepLog$TIB
# plotting
hist(sleepLog$TIB_diff,breaks=100,xlab="",main="EndTime-StartTime difference - TIB values")
# summarizing (all negative differences from -1.5 to -0.5)
summary(sleepLog$TIB_diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5000 -0.5000 -0.5000 -0.5064 -0.5000 -0.5000
The same is done for the difference between sleepLog
WakeUp
and StartTime
.
# recomputing TIB as the difference (in min) between EndTime and StartTime values
sleepLog$TIB_r2 <- as.numeric(difftime(sleepLog$WakeUp,sleepLog$StartTime,units="mins"))
# computing differences between TIB_r and TIB
sleepLog$TIB_diff2 <- sleepLog$TIB_r2 - sleepLog$TIB
# plotting
hist(sleepLog$TIB_diff2,breaks=100,xlab="",main="WakeUp-StartTime difference - TIB values")
# summarizing (only a few negative differences from -1 to -0.5)
summary(sleepLog$TIB_diff2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000000 0.000000 0.000000 -0.006375 0.000000 0.000000
# all combined cases
cat("All combined:",nrow(sleepLog[sleepLog$TIB_diff2<0 & sleepLog$combined==FALSE,])==0)
## All combined: TRUE
Finally, we inspect the differences between WakeUp
- StartTime
and EndTime
- StartTime
, as well as the differences between WakeUp
and EndTime
# computing differences between TIB_r2 and TIB_r
sleepLog$TIB_diff3 <- sleepLog$TIB_r2 - sleepLog$TIB_r
sleepLog$WakeUpMINUSStartTime <- as.numeric(difftime(sleepLog$WakeUp,sleepLog$EndTime,units="mins"))
# plotting
par(mfrow=c(2,1))
hist(sleepLog$TIB_diff3,breaks=100,xlab="",main="WakeUp-StartTime - EndTime-StartTime differences")
hist(sleepLog$WakeUpMINUSStartTime,breaks=100,xlab="",main="WakeUp-EndTime differences")
# summarizing (only a few negative differences from -1 to -0.5)
summary(sleepLog$TIB_diff3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5 0.5 0.5 0.5 0.5 0.5
summary(sleepLog$WakeUpMINUSStartTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5 0.5 0.5 0.5 0.5 0.5
Comments:
all differences between EndTime
- StartTime
and TIB
are negative differences from -1.5 to -0.5 min
in contrast, only 61 differences between WakeUp
- StartTime
and TIB
are negative differences from -1 to -0.5; these are all cases of combined sleep periods
all differences between WakeUp
- StartTime
and EndTime
- StartTime
are equal to 0.5; coherently, WakeUp
time is always 30 sec after EndTime
, which is right, since WakeUp
indicates the first wake epoch, whereas EndTime
indicates the last sleep epoch (i.e., the final wake epochs were excluded)
Here, we remove these small discrepancies by matching EndTime
with WakeUp
time (i.e., by adding 30 sec to the former), and by matching TIB
values with EndTime
- StartTime
differences.
# adding 30 sec to each EndTime
sleepLog$EndTime <- sleepLog$EndTime + 30
# doing the same for sleepLog.uniqueEBE and sleepLog.uniqueClassic
sleepLog.uniqueEBE$EndTime <- sleepLog.uniqueEBE$EndTime + 30
sleepLog.uniqueClassic$EndTime <- sleepLog.uniqueClassic$EndTime + 30
# recomputing TIB and SE
sleepLog$TIB <- as.numeric(difftime(sleepLog$EndTime,sleepLog$StartTime,units="mins"))
sleepLog$SE <- 100*sleepLog$TST/sleepLog$TIB
# removing check variables
sleepLog$TIB_r <- sleepLog$TIB_r2 <- sleepLog$TIB_diff <- sleepLog$TIB_diff2 <- sleepLog$TIB_diff3 <-
sleepLog$WakeUpMINUSStartTime <- NULL
# sanity checks (all zero)
cat("-",nrow(sleepLog[difftime(sleepLog$WakeUp,sleepLog$EndTime)!=0,]),"differences between WakeUp and EndTime",
"\n-",nrow(sleepLog[sleepLog$TIB-as.numeric(difftime(sleepLog$EndTime,sleepLog$StartTime,units="mins"))!=0,]),
"differences between TIB and EndTime-StartTime differences",
"\n-",nrow(sleepLog[sleepLog$TIB-as.numeric(difftime(sleepLog$WakeUp,sleepLog$StartTime,units="mins"))!=0,]),
"differences between TIB and WakeUp-StartTime differences")
## - 0 differences between WakeUp and EndTime
## - 0 differences between TIB and EndTime-StartTime differences
## - 0 differences between TIB and WakeUp-StartTime differences
Comments:
now, WakeUp
and EndTime
perfectly match, both identifying the time of the first wake epoch (“lights-on”)
similarly, TIB
values now perfectly match with the differences between EndTime
and StartTime
Here, we visualize the differences between the distribution of sleep metrics obtained from sleepEBE
data and that of sleep measures automatically recorded in Fitabase (sleepLog
).
Here, we compare the distribution of TIB
values computed from sleepEBE
with that of TimeInBed
values, computed as sleepLog
EndTime
- StartTime
.
# plotting TIB distributions
hist(sleepLog$TimeInBed/60,col="yellow",breaks=35,xlab="TIB (hours)",
main=paste("SleepLog- (yellow; max TIB =",round(max(sleepLog$TimeInBed/60),1),
"hours) \nand EBE-derived Time in Bed (red; max TIB =",round(max(sleepLog$TIB/60,na.rm=TRUE),1),"hours)"))
hist(sleepLog$TIB/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
Then, we replicate the plot by focusing on cases with no missing data.
# Excluding combined cases
par(mfrow=c(1,3))
hist(sleepLog[sleepLog$combined==FALSE,"TimeInBed"]/60,col="yellow",breaks=35,xlab="TIB (hours)",
main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"TIB"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding cases with nFinalWake > 0
hist(sleepLog[sleepLog$nFinalWake==0,"TimeInBed"]/60,col="yellow",breaks=35,xlab="TIB (hours)",
main="excluding cases with nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"TIB"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding EBEDataType = "classic" sleep
hist(sleepLog[sleepLog$EBEDataType!="classic","TimeInBed"]/60,col="yellow",breaks=35,xlab="TIB (hours)",
main="excluding cases with \n'classic' EBEDataType")
hist(sleepLog[sleepLog$EBEDataType!="classic","TIB"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# plotting TIB differences
par(mfrow=c(2,2))
diff <- sleepLog$TimeInBed-sleepLog$TIB
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Differences Log-based - EBE-based TIB (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$nFinalWake==0,"TimeInBed"] - sleepLog[sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Excluding cases of missing data at the end\nmin=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"TimeInBed"] - sleepLog[sleepLog$combined==FALSE,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TimeInBed"] -
sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
Finally, we inspect the No. of cases with no missing_start
, no missing_middle
, and no FinalWake
, but with a difference of more than 1 hour with sleepLog
TimeInBed
. These are just two cases, whose sleep measures are recomputed based on sleepEBE
data only.
# selecting cases with missing_start, missing_middle and nFinalWake = 0, but TIB diff > 60
(LogId <- as.character(sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0 &
sleepLog$TimeInBed - sleepLog$TIB > 60,"LogId"]))
## [1] "21318629605" "26400046971"
# removing uniqueLogId from sleepLog dataset
sleepLog <- sleepLog[!(sleepLog$LogId%in%LogId),]
# showing EBEDataTye (first stages, second classic)
rbind(head(sleepEBE[sleepEBE$LogId==LogId[1],],1),tail(sleepEBE[sleepEBE$LogId==LogId[1],],1),
head(sleepEBE[sleepEBE$LogId==LogId[2],],1),tail(sleepEBE[sleepEBE$LogId==LogId[2],],1))
# adding cases to sleepLog.uniqueEBE
specialCases <- rbind(ebe2sleep(EBEdata=sleepEBE[sleepEBE$LogId%in%LogId[1],],idBased=TRUE,idCol="LogId",
epochLength=30,stagesCol="SleepStage"),
ebe2sleep(EBEdata=classicEBE[classicEBE$LogId%in%LogId[2],],idBased=TRUE,idCol="LogId",
epochLength=60,staging=FALSE,stages=c(wake=0,sleep=1),stagesCol="value"))
specialCases[2,"ActivityDate"] <- specialCases[2,"ActivityDate"] - 1 # updating ActivityDate when StartTime > midnight
sleepLog.uniqueEBE <- rbind(sleepLog.uniqueEBE,specialCases)
sleepLog.uniqueEBE <- sleepLog.uniqueEBE[order(sleepLog.uniqueEBE$ID,sleepLog.uniqueEBE$ActivityDate,
sleepLog.uniqueEBE$StartTime),]
# re-plotting TIB differences
par(mfrow=c(2,2))
diff <- sleepLog$TimeInBed-sleepLog$TIB
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Differences Log-based - EBE-based TIB (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$nFinalWake==0,"TimeInBed"] - sleepLog[sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Excluding cases of missing data at the end\nmin=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"TimeInBed"] - sleepLog[sleepLog$combined==FALSE,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TimeInBed"] -
sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
Comments:
sleepLog
and sleepEBE
TIB
(i.e., 2,353 differences > 1 min) are due to cases of wake/missing epochs at the endHere, we compare the sleep onset time computed based on EBE data with the StartTime
recorded in Fitabase.
par(mfrow=c(1,3))
diff <- as.numeric(difftime(sleepLog$SO, sleepLog$StartTime, units="mins"))
hist(diff,breaks=30,
main=paste("Differences Sleep Onset time - sleepLog StartTime (min):\nmin =",
min(diff),", mean =",round(mean(diff),1),", median =",median(diff),", max =",max(diff)))
hist(sleepLog$SOL,breaks=30,
main=paste("SOL (min):\nmin =",min(sleepLog$SOL),
", mean =",round(mean(sleepLog$SOL),1),", median =",median(sleepLog$SOL),", max =",max(sleepLog$SOL)))
diffvsSOL <- diff - sleepLog$SOL
hist(diffvsSOL,breaks=30,
main=paste("Differences (SO - StartTime) - SOL (min):\nmin =",
min(diffvsSOL),", mean =",round(mean(diffvsSOL),1),", median =",median(diffvsSOL),", max =",max(diffvsSOL)))
Comments:
differences between EBE-based sleep onset and sleepLog
StartTime
range from 0 to 96 min, with 13% being = 0, 56% being equal to or less than 5 min, 81% being equal to or less than 10 min, and only 8% being > 15 min
these differences exactly match with the corresponding SOL
values
Here, we inspect the EBE data of those cases with SO
-StartTime
differences higher than 60 min (N = 5):
# 5 cases with SO - StartTime > 60 min
SOdiffs <- sleepLog[as.numeric(difftime(sleepLog$SO,sleepLog$StartTime,units="mins"))>60,]
nrow(SOdiffs) # 5
## [1] 5
SOdiffs$SOvsStartTime <- as.numeric(difftime(SOdiffs$SO,SOdiffs$StartTime,units="mins"))
SOdiffs$SOvsSOL <- SOdiffs$SOvsStartTime - SOdiffs$SOL
SOdiffs[,c("ID","StartTime","SO","SOL","SOvsStartTime","SOvsSOL","combined","MinutesToFallAsleep")]
# plotting EBE data
par(mfrow=c(2,3))
for(i in 1:nrow(SOdiffs)){
plot(sleepEBE[sleepEBE$LogId==as.character(SOdiffs[i,"LogId"]),"SleepStage"],xlab="epoch",ylab="sleep stage",pch=20,
main=SOdiffs[i,"LogId"])}
Comments:
each of the highlighted cases is actually associated with a No. of initial epochs scored as wake higher than 100 (i.e., 50 min)
in conclusions, SO and SOL values seem to be correctly computed
Here, we inspect the distribution of SOL
values compared to the MinutesToFallAsleep
variable encoded in Fitabase.
# plotting SOL distributions
hist(sleepLog$SOL,col=rgb(.9,0,0,alpha=.5),breaks=35,xlab="SOL (hours)",
main=paste("SleepLog- (yellow; max SOL =",round(max(sleepLog$MinutesToFallAsleep),1),
"min) \nand EBE-derived Sleep Onset Latency (red; max SOL =",round(max(sleepLog$SOL,na.rm=TRUE),1),"min)"))
hist(sleepLog$MinutesToFallAsleep,col=rgb(1,1,0,alpha=.5),breaks=35,add=TRUE)
Then, we replicate the plot by focusing on cases with no missing data.
# Excluding combined cases
par(mfrow=c(2,2))
hist(sleepLog[sleepLog$combined==FALSE,"MinutesToFallAsleep"],col="yellow",breaks=35,xlab="SOL (hours)",
main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"SOL"],add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding cases with missing_start > 0 or nFinalWake > 0
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"MinutesToFallAsleep"]/60,col="yellow",breaks=35,
xlab="SOL (hours)",main="excluding cases with \nmissing_start or nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"SOL"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding EBEDataType = "classic" sleep
hist(sleepLog[sleepLog$EBEDataType!="classic","MinutesToFallAsleep"]/60,col="yellow",breaks=35,xlab="SOL (hours)",
main="excluding cases with \n'classic' EBEDataType")
hist(sleepLog[sleepLog$EBEDataType!="classic","SOL"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# plotting SOL differences
par(mfrow=c(2,2))
diff <- sleepLog$MinutesToFallAsleep-sleepLog$SOL
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
main=paste("Differences Log-based - EBE-based SOL (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesToFallAsleep"] -
sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"SOL"]
hist(diff,breaks=55,xlab="SOL differences (min)",cex.main=.7,
main=paste("Excluding cases of missing data at the beginning/en\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"MinutesToFallAsleep"] -
sleepLog[sleepLog$combined==FALSE,"SOL"]
hist(diff,breaks=55,xlab="SOL differences (min)",cex.main=.7,
main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE &
sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesToFallAsleep"] -
sleepLog[sleepLog$combined==FALSE &
sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"SOL"]
hist(diff,breaks=55,xlab="SOL differences (min)",cex.main=.7,
main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
Comments:
EBE-based SOL shows more variability and higher values than the TimeToFallAsleep
variable in sleepLog
data
differences are only partially reduced by excluding cases with missing data at the beginning
Here, we compare the total sleep time computed based on EBE data (TIB
) with the MinutesAsleep
variable recorded in Fitabase.
# plotting TST distributions
hist(sleepLog$MinutesAsleep/60,col="yellow",breaks=35,xlab="TST (hours)",
main=paste("SleepLog- (yellow; max TST =",round(max(sleepLog$MinutesAsleep/60),1),
"hours) \nand EBE-derived Time in Bed (red; max TST =",round(max(sleepLog$TST/60,na.rm=TRUE),1),"hours)"))
hist(sleepLog$TST/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
Then, we replicate the plot by focusing on cases with no missing data.
# Excluding combined cases
par(mfrow=c(2,2))
hist(sleepLog[sleepLog$combined==FALSE,"MinutesAsleep"]/60,col="yellow",breaks=35,xlab="TST (hours)",
main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"TST"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding cases with missing_start > 0 or nFinalWake > 0
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"MinutesAsleep"]/60,col="yellow",breaks=35,xlab="TST (hours)",
main="excluding cases with \nmissing_start or nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"TST"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding EBEDataType = "classic" sleep
hist(sleepLog[sleepLog$EBEDataType!="classic","MinutesAsleep"]/60,col="yellow",breaks=35,xlab="TST (hours)",
main="excluding cases with \n'classic' EBEDataType")
hist(sleepLog[sleepLog$EBEDataType!="classic","TST"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# plotting TST differences
par(mfrow=c(2,2))
diff <- sleepLog$MinutesAsleep-sleepLog$TST
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
main=paste("Differences Log-based - EBE-based TST (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesAsleep"] -
sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TST"]
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
main=paste("Excluding cases of missing data at the beginning/en\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"MinutesAsleep"] -
sleepLog[sleepLog$combined==FALSE,"TST"]
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE &
sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesAsleep"] -
sleepLog[sleepLog$combined==FALSE &
sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TST"]
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
Comments:
TST
shows similar results than TIB
, but differences between Fitabased- and EBE-derived TST
are unaffected by cases of combined sleep, and only partially affected by cases of missing epochs at the beginning/end
mean differences range from -151 to 30 min, with most differences between -10 and 10 min (99%)
Here, we compare the distribution of Wake After Sleep Onset values computed from EBE data (WASO
) with those computed from sleepLog
data as: TimeInBed
- MinutesAfterWakeUp
(not considered for computing EBE-derived sleep measures) - MinutesToFallAsleep
(SOL) - MinutesAsleep
(TST).
# plotting WASO distributions
sleepLog$fitabaseWASO <- sleepLog$TimeInBed - sleepLog$MinutesAsleep - sleepLog$MinutesToFallAsleep
hist(sleepLog$fitabaseWASO/60,col="yellow",breaks=35,xlab="WASO (hours)",
main=paste("SleepLog- (yellow; max WASO =",round(max(sleepLog$fitabaseWASO/60),1),
"hours) \nand EBE-derived WASO (red; max WASO =",round(max(sleepLog$WASO/60,na.rm=TRUE),1),"hours)"))
hist(sleepLog$WASO/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding combined cases
par(mfrow=c(2,2))
hist(sleepLog[sleepLog$combined==FALSE,"fitabaseWASO"]/60,col="yellow",breaks=35,xlab="WASO (hours)",
main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"WASO"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding cases with missing_start > 0 or nFinalWake > 0
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"fitabaseWASO"]/60,col="yellow",breaks=35,
xlab="WASO (hours)",main="excluding cases with \nmissing_start or nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,
"WASO"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding SleepDataType = "classic" sleep
hist(sleepLog[sleepLog$SleepDataType!="classic","fitabaseWASO"]/60,col="yellow",breaks=35,xlab="WASO (hours)",
main="excluding cases with \n'classic' SleepDataType")
hist(sleepLog[sleepLog$SleepDataType!="classic","WASO"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# plotting WASO differences
par(mfrow=c(2,2))
diff <- sleepLog$fitabaseWASO-sleepLog$WASO
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
main=paste("Differences Log-based - EBE-based WASO (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"fitabaseWASO"] -
sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"WASO"]
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
main=paste("Excluding cases of missing data at the beginning/en\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"fitabaseWASO"] -
sleepLog[sleepLog$combined==FALSE,"WASO"]
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE &
sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"fitabaseWASO"] -
sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"WASO"]
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
Comments:
the distribution of EBE-derived WASO
values shows a similar shape than that of Fitabase-derived WASO values, although EBE-derived WASO
shows a higher No. of cases from 1 to 2 hours. The distribution of EBE-derived WASO
is centered on a slightly lower value (median = 46.5 min) than that of fitabase-derived WASO (median = 56 min)
differences between sleepLog and sleepEBE
WASO
range from -26 to 156 min, and are at least partially due to missing/wake epochs at the beginning/end
As a further check, we inspect whether EBE-based WASO
corresponds to EBE-based TIB
- SOL
- TST
.
# manually recomputing WASO
sleepLog$WASO_rec <- sleepLog$TIB - sleepLog$SOL - sleepLog$TST
# plotting differences > |0.5| min
diff <- sleepLog[sleepLog$WASO_rec - sleepLog$WASO > abs(0.5),"WASO_rec"] -
sleepLog[sleepLog$WASO_rec - sleepLog$WASO > abs(0.5),"WASO"]
hist(diff,xlab="WASO_rec - WASO",breaks=10,
main=paste(length(diff),"cases with WASO_rec being 0.5 or more minutes higher than WASO\nmin =",
min(diff),", max =",max(diff)))
# all combined cases
sleepLog[sleepLog$WASO_rec-sleepLog$WASO>abs(0.5),c("ID","LogId","StartTime","TIB","SOL","TST","WASO","WASO_rec",
"combined")]
Comments:
only in 24 cases, the difference between EBE-based WASO
and the manually-recomputed WASO is higher than 30 seconds (i.e., due to StartTime rounding), ranging from 1 to 20 min
all of these cases are cases of combined sleep periods
Here, we assign the manually recomputed values (i.e., TIB
- SOL
- WASO
) to each of these 24 cases.
# assigning recomputed WASO to these cases
sleepLog[sleepLog$WASO_rec-sleepLog$WASO>abs(0.5),"WASO"] <- sleepLog[sleepLog$WASO_rec-sleepLog$WASO>abs(0.5),"WASO_rec"]
sleepLog$WASO_rec <- NULL
Similarly, we manually recompute sleep efficiency (SE
) as 100*TST
/TIB.
sleepLog$SE <- round(100*sleepLog$TST/sleepLog$TIB,2)
Here, we visualize the distribution of sleep metrics obtained from the 41 cases of EBE data not included in sleepLog
data, to which we added further 3 cases** described above (N = 44).
# showining 44 cases
nrow(sleepLog.uniqueEBE)
## [1] 44
par(mfrow=c(2,2))
hist(sleepLog.uniqueEBE$TIB/60,breaks=35,main="TIB (hours)")
hist(sleepLog.uniqueEBE$TST/60,breaks=35,main="TST (hours)")
hist(sleepLog.uniqueEBE$WASO,breaks=35,main="WASO (min)")
hist(sleepLog.uniqueEBE$SOL,breaks=35,main="SOL (min)")
# plotting StartHour
sleepLog.uniqueEBE$StartHour <- as.POSIXct(paste(lubridate::hour(sleepLog.uniqueEBE$StartTime),
lubridate::minute(sleepLog.uniqueEBE$StartTime)),format="%H %M",tz="GMT")
par(mfrow=c(1,1))
hist(sleepLog.uniqueEBE$StartHour,breaks=35,col="black")
Comments:
no cases with StartTime
< 18:00 and > 06:00 are included among the 44 uniqueEBElog
cases.
the distributions of sleep metrics derived from the 44 cases are in line with the other cases obtained from EBE data included in sleepLog
data
Then, coherently with what done for sleepLog
cases, we match EndTime
with WakeUp
values (currently separated by -0.5 to 0.5 min due to temporal approximations).
# plotting WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueEBE$WakeUp,sleepLog.uniqueEBE$EndTime,units="mins")),
breaks=30,main="WakeUp - EndTime")
# matching EndTime with WakeUp
sleepLog.uniqueEBE$EndTime <- sleepLog.uniqueEBE$WakeUp
# plotting again WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueEBE$WakeUp,sleepLog.uniqueEBE$EndTime,units="mins")),
breaks=30,main="WakeUp - EndTime")
# recomputing TIB and SE
sleepLog.uniqueEBE$TIB <- as.numeric(difftime(sleepLog.uniqueEBE$EndTime,sleepLog.uniqueEBE$StartTime,units="mins"))
sleepLog.uniqueEBE$SE <- 100*sleepLog.uniqueEBE$TST/sleepLog.uniqueEBE$TIB
Here, we visualize the distribution of sleep metrics obtained from the 336 cases of classic EBE data not included in sleepLog
data. From these cases, we exclude all cases with StartTime
< 18:00 and > 06:00, and all cases with a TST
< 3 hours.
# showining 336 cases
nrow(sleepLog.uniqueClassic)
## [1] 336
par(mfrow=c(2,2))
hist(sleepLog.uniqueClassic$TIB/60,breaks=35,main="TIB (hours)")
hist(sleepLog.uniqueClassic$TST/60,breaks=35,main="TST (hours)")
hist(sleepLog.uniqueClassic$WASO,breaks=35,main="WASO (min)")
hist(sleepLog.uniqueClassic$SOL,breaks=35,main="SOL (min)")
# plotting StartHour
sleepLog.uniqueClassic$StartHour <- as.POSIXct(paste(lubridate::hour(sleepLog.uniqueClassic$StartTime),
lubridate::minute(sleepLog.uniqueClassic$StartTime)),format="%H %M",tz="GMT")
par(mfrow=c(1,1))
hist(sleepLog.uniqueClassic$StartHour,breaks=35,col="black")
Comments:
a few cases show StartTime
< 18:00 and > 06:00, or TST
< 3h. These cases will be removed in the data cleaning section below
the distributions of sleep metrics derived from the 336 cases are in line with the other cases obtained from EBE sleep data included in sleepLog data
Then, coherently with what done for sleepLog
cases, we match EndTime
with WakeUp
values (currently separated by -1 to 1 min due to temporal approximations).
# plotting WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueClassic$WakeUp,sleepLog.uniqueClassic$EndTime,units="mins")),
breaks=30,main="WakeUp - EndTime")
# matching EndTime with WakeUp
sleepLog.uniqueClassic$EndTime <- sleepLog.uniqueClassic$WakeUp
# plotting again WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueClassic$WakeUp,sleepLog.uniqueClassic$EndTime,units="mins")),
breaks=30,main="WakeUp - EndTime")
# recomputing TIB and SE
sleepLog.uniqueClassic$TIB <- as.numeric(difftime(sleepLog.uniqueClassic$EndTime,
sleepLog.uniqueClassic$StartTime,units="mins"))
sleepLog.uniqueClassic$SE <- 100*sleepLog.uniqueClassic$TST/sleepLog.uniqueClassic$TIB
Here, we join the 44 uniqueEBElog
and the 336 uniqueclassicEBE
cases to the remaining 4,861 cases included in the sleepLog
dataset.
# creating EBEonly logical column
sleepLog$EBEonly <- FALSE
sleepLog.uniqueEBE$EBEonly <- sleepLog.uniqueClassic$EBEonly <- TRUE
memory <- sleepLog
# merging datasets
sleepLog <- plyr::join(plyr::join(sleepLog,sleepLog.uniqueEBE,type="full"),sleepLog.uniqueClassic,type="full")
# sanity check
cat("sanity check:",nrow(sleepLog)-nrow(memory)==nrow(sleepLog.uniqueEBE)+nrow(sleepLog.uniqueClassic))
## sanity check: TRUE
# sorting by ID and date and removing the merged datasets
sleepLog <- sleepLog[order(sleepLog$ID,sleepLog$StartTime),]
rm(sleepLog.uniqueEBE,sleepLog.uniqueClassic)
Once uniqueEBElog
and uniqueclassicEBE
have been merged with the remaining sleepLog
data, we can integrate the remaining uniqueclassicEBE
cases from classicEBE
to sleepEBE
, in order to have just one single dataset of EBE data to be used for the analysis. Here, we apply the same procedures used in section 4.2 to integrate the 336 cases of uniqueClassicEBE
with the sleepEBE
dataset.
# uniqueclassicEBE cases (N = 336)
uniqueclassicEBE <- LogId_special[[3]]
length(uniqueclassicEBE)
## [1] 336
cat("sanity check:", # sanity check
length(uniqueclassicEBE)==length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
!levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))
## sanity check: TRUE
# preparing datasets for aggregation
memory <- sleepEBE # saving current dataset for comparison
classicEBE$SleepDataType <- "classic"
sleepEBE$LogId <- as.character(sleepEBE$LogId) # LogId back to character
classicEBE$LogId <- as.character(classicEBE$LogId)
Here, the 336 uniqueclassicEBE
are integrated within the sleepEBE
dataset. Since classicEBE
was recorded in 60-sec epochs whereas sleepEBE
was recorded in 30-sec epochs, the former are duplicated before merging.
# aggregating data
for(LOG in uniqueclassicEBE){
# selecting LOG-related epochs
classicLog <- classicEBE[classicEBE$LogId==LOG,]
# duplicating each row, and adding 30 secs to each other epoch to have 30-sec epochs
classicLog_dup <- classicLog[rep(1:nrow(classicLog), rep(2,nrow(classicLog))),]
classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] <- classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] + 30
# changing column name from "value" to "SleepStage"
colnames(classicLog_dup)[which(colnames(classicLog_dup)=="value")] <- "SleepStage"
# merging
sleepEBE <- rbind(sleepEBE,classicLog_dup[,c("ID","group","LogId","Time","SleepStage","ActivityDate","SleepDataType")]) }
# back to LogId as factor, and sorting data by ID, ActivityDate and Time
sleepEBE$LogId <- as.factor(sleepEBE$LogId)
classicEBE$LogId <- as.factor(classicEBE$LogId)
sleepEBE$SleepDataType <- as.factor(sleepEBE$SleepDataType)
sleepEBE <- sleepEBE[order(sleepEBE$ID,sleepEBE$ActivityDate,sleepEBE$Time),]
Here, we check whether our procedure effectively integrated ClassicANDSleepLog
cases with sleepEBE
data. This is done by using the same lines of code used in section 2.5.1.
# sanity check
data.frame(NstagesANDclassic_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nstages_NOsleepLogORclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
Nclassic_NOsleepLogORstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
NclassicANDsleepLogNostages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
& !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))
# is the difference between the No. of cases in the original and dataset aggregated equal to the No. of ClassicAndSleepLog cases?
cat("sanity check:",(nrow(sleepEBE)-nrow(memory))==(nrow(classicEBE[classicEBE$LogId%in%uniqueclassicEBE,])*2))
## sanity check: TRUE
# is the difference between the No. of LogIds in the original and dataset aggregated equal to the No. of ClassicAndSleepLog?
cat("sanity check:",(nlevels(sleepEBE$LogId)-nlevels(memory$LogId))==length(uniqueclassicEBE))
## sanity check: TRUE
# new No. of sleepEBE LogId
cat("New No. of sleepEBE LogId:",nlevels(sleepEBE$LogId))
## New No. of sleepEBE LogId: 5778
Comments:
now, no more cases are included in classicEBE
but not in sleepEBE
, suggesting that data aggregation was effective
the new No. of sleepEBE
LogId
values is 5,778
Then, we compare the distributions of sleep and wake values in the original and aggregated cases.
par(mfrow=c(1,3))
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="stages","SleepStage"]),main="stages")
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"]),main="stages (binary recoded)")
plot(as.factor(as.factor(gsub("2","1",gsub("3","1",sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"])))),main="classic")
Comments:
sleepEBE
data and the integrated ClassicAndSleepLog
casesHere, we save the aggregated datasets.
# saving dataset with EBE-based sleep measures
save(sleepLog,file="DATA/datasets/sleepLogEBE_aggregated.RData")
# saving dataset with all EBE data
save(sleepEBE,file="DATA/datasets/sleepEBEclassic_aggregated_full.RData")
Here, we use HR.1min
data for computing average HR associated with sleep periods. Specifically, we computethe mean HR by the sleep stage, that is the average of those HR values associated with couples of consecutive sleep epochs classified with the same stage: NREM
and REM
(only for cases with SleepDataType
= "stages"
). Mean HR of all sleep and wake epochs is also recorded and separately computed for wake epochs prior and after SO, also for cases with SleepDataType
= "classic"
.
The HRstage
function is used to optimize the computation.
HRstage <- function(SLEEPdata=NA,HRdata=NA,EBEdata=NA,digits=3){ require(tcltk)
# 1. preparing data
# ...................................................................................................................
# preparing SLEEPdata
rownames(SLEEPdata) <- 1:nrow(SLEEPdata)
# HR-by-time column
SLEEPdata$stageHR_NREM <- SLEEPdata$stageHR_REM <- NA
# preparing EBEdata (joining HR values only to those couples of consecutive epochs classified with the same stage)
EBEdata <- plyr::join(EBEdata,HRdata[,c("ID","Time","HR")],by=c("ID","Time")) # joining HR values to EBEdata
EBEdata$SleepStage_rec <- EBEdata$SleepStage + 1 # adding 1 to stages for avoiding zero values
EBEdata$LogId <- as.character(EBEdata$LogId) # same LogId in cases with combined sleep periods
for(comb in levels(as.factor(SLEEPdata$combinedLogId))){
if(nchar(comb)==23){ EBEdata[EBEdata$LogId%in%strsplit(comb,split="_")[[1]],"LogId"] <-
as.character(SLEEPdata[!is.na(SLEEPdata$combinedLogId) & SLEEPdata$combinedLogId==comb,"LogId"])
} else { EBEdata[EBEdata$LogId==comb,"LogId"] <-
as.character(SLEEPdata[!is.na(SLEEPdata$combinedLogId) & SLEEPdata$combinedLogId==comb,"LogId"]) }}
EBEdata$LogId <- as.factor(EBEdata$LogId)
require(dplyr)
EBEdata <- EBEdata %>%
group_by(LogId) %>% # creating lagged variable within the same LogId
mutate(SleepStage_rec.LAG = dplyr::lag(SleepStage_rec,n=1,default=NA),
SleepStage_rec.LEAD = dplyr::lead(SleepStage_rec,n=1,default=NA))
EBEdata <- as.data.frame(EBEdata)
detach("package:dplyr", unload=TRUE)
EBEdata$sameStage.LAG <- EBEdata$SleepStage_rec - EBEdata$SleepStage_rec.LAG # sameStage.LAG = difference epochs i and i-1
EBEdata$sameStage.LEAD <- EBEdata$SleepStage_rec - EBEdata$SleepStage_rec.LEAD # sameStage.LEAD = difference epochs i and i-1
# valid HR only when sameStage.LAG or sameStage.LEAD = 0
EBEdata[!is.na(EBEdata$HR) & ((!is.na(EBEdata$sameStage.LAG) & EBEdata$sameStage.LAG == 0) |
(!is.na(EBEdata$sameStage.LEAD) & EBEdata$sameStage.LEAD == 0)),"stageHR"] <-
EBEdata[!is.na(EBEdata$HR) & ((!is.na(EBEdata$sameStage.LAG) & EBEdata$sameStage.LAG == 0) |
(!is.na(EBEdata$sameStage.LEAD) & EBEdata$sameStage.LEAD == 0)),"HR"]
# 2. iteratively computing mean HR by sleep stage
# ........................................................................................................................
pb <- tkProgressBar("Computing mean HR:", "%",0, 100, 50) # progress bar
for(i in 1:nrow(SLEEPdata)){
info <- sprintf("%d%% done", round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100))
setTkProgressBar(pb, round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100), sprintf("Computing mean HR:", info), info)
# data selection (between sleepLog StartTime and EndTime)
HRday <- HRdata[HRdata$ID == SLEEPdata[i,"ID"] &
HRdata$Time >= SLEEPdata[i,"StartTime"] & HRdata$Time <= SLEEPdata[i,"EndTime"],]
ebe <- EBEdata[EBEdata$ID==SLEEPdata[i,"ID"] & # same ID & bounded between StartTime and EndTime
EBEdata$Time >= SLEEPdata[i,"StartTime"] & EBEdata$Time <= SLEEPdata[i,"EndTime"],]
if(nrow(ebe)>0){
# sleep stage HR (only when EBEDataType is not "classic")
if(SLEEPdata[i,"EBEDataType"]!="classic" & !is.na(SLEEPdata[i,"light"])){
SLEEPdata[i,"stageHR_NREM"] <- round(mean(ebe[ebe$SleepStage==1 | ebe$SleepStage==2,"stageHR"], # mean HR NREM sleep
na.rm=TRUE),digits)
SLEEPdata[i,"nHR_NREM"] <- nrow(ebe[(ebe$SleepStage==1 | ebe$SleepStage==2) &
!is.na(ebe$stageHR),]) # No. HR epochs in light sleep
SLEEPdata[i,"stageHR_REM"] <- round(mean(ebe[ebe$SleepStage==3,"stageHR"],na.rm=TRUE),digits) # REM
SLEEPdata[i,"nHR_REM"] <- nrow(ebe[ebe$SleepStage==3 & !is.na(ebe$stageHR),]) }}}
close(pb) # closing progress bar
return(SLEEPdata) }
sleepLog <- HRstage(SLEEPdata=sleepLog,HRdata=HR.1min,EBEdata=sleepEBE,digits=3)
# saving dataset (not run, for saving computational time)
save(sleepLog,file="DATA/datasets/sleepLog_HRtimestage.RData")
Here, we inspect the No. and percentage of nonmissing data for each class of variables included in the sleepLog
dataset: classic" (e.g.,
TST),
stages(e.g., "light"), and
stageHR(e.g.,
stageHR_NREM`).
# counting No. and % of missing data per class of variable
infoMiss <- data.frame(classic=nrow(sleepLog[!is.na(sleepLog$TST),]),stages=nrow(sleepLog[!is.na(sleepLog$light),]),
stageHR=nrow(sleepLog[!is.na(sleepLog$stageHR_NREM),]),
stagesANDstageHR=nrow(sleepLog[!is.na(sleepLog$light) & !is.na(sleepLog$stageHR_NREM),]))
infoMiss <- rbind(infoMiss,nrow(sleepLog)-infoMiss,round(100*infoMiss/nrow(sleepLog)),round(100-100*infoMiss/nrow(sleepLog)))
row.names(infoMiss) <- c("N","Missing","%","%missing")
infoMiss
Comments:
sleep stages information is missing in 778 (15%) sleepLog
cases
sleep-stage HR information is missing in 809 (15%) sleepLog
cases
Then, we inspect the differences between the expected and observed No. of HR epochs in each sleep period. That is, we subtract the total nHR_TIB
value from the corresponding TIB
(in minute).
# computing No. of epochs recorded during TIB
sleepLog$nHR_TST <- apply(sleepLog[,c("nHR_NREM","nHR_REM")],1,sum,na.rm=TRUE)
# computing and plotting differences between expected and observed No. of TIB epochs
sleepLog$nHR_TSTdiff <- sleepLog$TST - sleepLog$nHR_TST
hist(sleepLog$nHR_TSTdiff,breaks=100,main="TST - nHR_TST (minutes)",xlab="")
# printing info
nMiss <- c(0,10,20,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(sleepLog[sleepLog$nHR_TSTdiff>=nMiss[i],]),
"cases with >=",nMiss[i],"missing sleepHR epochs")}
##
## - 4568 cases with >= 0 missing sleepHR epochs
## - 2273 cases with >= 10 missing sleepHR epochs
## - 953 cases with >= 20 missing sleepHR epochs
## - 824 cases with >= 50 missing sleepHR epochs
## - 806 cases with >= 75 missing sleepHR epochs
## - 797 cases with >= 100 missing sleepHR epochs
## - 777 cases with >= 150 missing sleepHR epochs
## - 765 cases with >= 200 missing sleepHR epochs
## - 556 cases with >= 400 missing sleepHR epochs
# removing columns
sleepLog$nHR_TST <- sleepLog$nHR_TSTdiff <- NULL
Comments:
These cases will be processed in the data cleaning section.
Here, we repeat the procedure above by visualizing the number of epochs and the mean HR values computed for each sleep stage, and for the wake epochs preceding and following seep onset.
# dataset in a long form (1 row per sleep stage)
library(tidyr)
sleepLog_long <- sleepLog[,c("ID","LogId","EBEDataType",paste("stageHR",c("NREM","REM"),sep="_"))] %>%
pivot_longer(stageHR_NREM:stageHR_REM, names_to = "stage", values_to = "HR") # long-form dataset of HR values
nEpochs <- sleepLog[,c("ID","LogId",paste("nHR",c("NREM","REM"),sep="_"))] %>%
pivot_longer(nHR_NREM:nHR_REM, names_to = "stage", values_to = "nEpochs") # long-form dataset of No. of nonmissing HR epochs
detach("package:tidyr", unload=TRUE)
sleepLog_long$nEpochs <- nEpochs$nEpochs
sleepLog_long$stage <- as.factor(gsub("meanHR_","",sleepLog_long$stage)) # sleep stage as factor
sleepLog_long[is.na(sleepLog_long$nEpochs),"nEpochs"] <- 0 # NA nEpochs are converted as zero
# plotting HR
p1 <- ggplot(sleepLog_long,(aes(x=stage,y=HR))) + geom_violin(fill="salmon") +
stat_summary(fun.y=mean, geom="point", shape=20, size=5, col="darkred") + ggtitle("Mean HR by sleep stage") +
xlab("Sleep stage") + ylab("HR (bpm)") + geom_boxplot(width=0.4,alpha=0.2)
# plotting No. included epochs by sleep stage
p2 <- ggplot(sleepLog_long,(aes(x=stage,y=nEpochs))) + geom_boxplot() +
stat_summary(fun.y=mean, geom="point", shape=20, size=2) + ggtitle("No. of nonmissing 1-min HR epochs \nper sleep stage")+
xlab("Sleep stage") + ylab("No. of nonmissing HR epochs")
# showing plots
grid.arrange(p1,p2,nrow=1)
Comments:
mean HR distributions show slightly lower HR for NREM
than REM
sleep
no extreme HR values are observed
the number of nonmissing HR epochs is higher for NREM sleep (from 104 to 1,152, mean = 558 when EBEDataType
is not "classic"
) than for REM sleep (from 0 to 236, mean = 85 when EBEDataType
is not "classic"
)
Here, we save the aggregated dataset.
# saving dataset with EBE-based sleep measures
save(sleepLog,file="DATA/datasets/sleepLogEBEHR_aggregated.RData")
Here, we create the variable IDday
to be used for merging the sleepLog
and the dailyAct
datasets. Note that sleepLog
ActivityDate
was recoded to be referred to the previous day when StartTime
is between midnight and 6 AM (see sections 2.3.4, 4.3.2, and 4.3.3).
# creating common variable IDday
dailyAct$IDday <- as.factor(paste(dailyAct$ID,dailyAct$ActivityDate,sep="_"))
sleepLog$IDday <- as.factor(paste(sleepLog$ID,sleepLog$ActivityDate,sep="_"))
# sanity checks
cat("Sanity check:",nlevels(dailyAct$IDday)==nrow(dailyAct)) # dailyAct: no cases with the same IDday value
## Sanity check: TRUE
cat("Sanity check:",nrow(sleepLog)-nlevels(sleepLog$IDday)) # 4 double cases
## Sanity check: 4
sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],c("IDday","StartTime","EndTime","TST","stageHR_NREM")]
# correcting N = 1 sleepLog case with StartTime between 00:00 and 06:00 but ActivityDate not adjusted
sleepLog[as.numeric(substr(sleepLog$StartTime,12,13)) >= 0 & as.numeric(substr(sleepLog$StartTime,12,13)) <= 6 &
substr(sleepLog$StartTime,9,10)==substr(sleepLog$ActivityDate,9,10),"ActivityDate"] <-
sleepLog[as.numeric(substr(sleepLog$StartTime,12,13)) >= 0 & as.numeric(substr(sleepLog$StartTime,12,13)) <= 6 &
substr(sleepLog$StartTime,9,10)==substr(sleepLog$ActivityDate,9,10),"ActivityDate"] - 1
# re-creating common variable IDday and sanity check
sleepLog$IDday <- as.factor(paste(sleepLog$ID,sleepLog$ActivityDate,sep="_"))
cat("Sanity check:",nrow(sleepLog)-nlevels(sleepLog$IDday)) # three double cases
## Sanity check: 3
sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],
c("IDday","StartTime","EndTime","SleepDataType","EBEDataType","TST","stageHR_NREM")] # showing 3 double cases
# removing N = 3 day-time duplicated cases of sleepLog data
toRemove <- as.character(sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],
][seq(1,nrow(sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],])-1,
by=2),"LogId"])
sleepLog <- sleepLog[!(sleepLog$LogId %in% toRemove),]
# re-creating common variable IDday and sanity check
sleepLog$IDday <- as.factor(paste(sleepLog$ID,sleepLog$ActivityDate,sep="_"))
cat("Sanity check:",nrow(sleepLog)==nlevels(sleepLog$IDday)) # no more double cases
## Sanity check: TRUE
Comments:
1 sleepLog
case was recoded since the StartTime
value was between midnight and 6 AM but ActivityDate
was not referred to the previous day
after the recoding of that case, three double cases (with the same ID
and ActivityDate
values) were included in the sleepLog
dataset, and have been removed
no double cases are included in the dailyAct
dataset
Then, we can join the two datasets by using the common variable IDday
to create the fitbit dataset. Note that the type
argument is set to “full” in order to include all the cases included in either one or the other dataset.
fitbit <- plyr::join(sleepLog,dailyAct,by="IDday",type="full") # joining
fitbit <- fitbit[order(fitbit$ID,fitbit$ActivityDate,fitbit$StartTime),] # sorting by ID and time
row.names(fitbit) <- 1:nrow(fitbit) # renaming rows
Here, we inspect the No. and percentage of nonmissing data for each class of variables included in the sleepLog
dataset: "classic"
(e.g., TST
), "stages"
(e.g., light
), "HR"
(e.g., stageHR_NREM
), and "Act"
(e.g., TotalSteps
).
# counting No. and % of missing data per class of variable
Ncases <- nrow(fitbit) # total No. of cases
infoMiss <- data.frame(total=Ncases,Act=nrow(fitbit[!is.na(fitbit$TotalSteps),]),
classic=nrow(fitbit[!is.na(fitbit$TST),]),stages=nrow(fitbit[!is.na(fitbit$light),]),
HR=nrow(fitbit[!is.na(fitbit$stageHR_NREM),]),
classicANDAct=nrow(fitbit[!is.na(fitbit$TST) & !is.na(fitbit$TotalSteps),]),
stagesANDAct=nrow(fitbit[!is.na(fitbit$light) & !is.na(fitbit$TotalSteps),]),
HRANDAct=nrow(fitbit[!is.na(fitbit$stageHR_NREM) & !is.na(fitbit$TotalSteps),]))
infoMiss <- rbind(infoMiss,Ncases-infoMiss,round(100*infoMiss/Ncases),round(100-100*infoMiss/Ncases))
row.names(infoMiss) <- c("N","Missing","%","%missing")
infoMiss
Comments:
Considering the total No. of cases (i.e., N = 6,169 non missing either in sleepLog or in dailyAct):
Act information is missing in 150 cases (2%)
sleep classic information is missing in 931 (15%) cases
sleep stages information is missing in 1,706 (28%) cases
sleep-stage HR information is missing in 1,737 (28%) cases
Here, we save the aggregated fitbit dataset.
# saving dataset with EBE-based sleep measures
save(fitbit,file="DATA/datasets/fitbitSleepAct_aggregated.RData")
Here, we create the variable IDday
to be used for merging the dailyDiary
and the fitbit
datasets. Note that dailyDiary
ActivityDate
was recoded to be referred to the previous day when StartedTime
is between midnight and 8 PM (see section 2.7.5).
# creating common variable IDday
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))
# sanity check
cat("Sanity check:",nrow(dailyDiary)==nlevels(dailyDiary$IDday)) # no double cases
## Sanity check: TRUE
Comments:
dailyDiary
datasetThen, we can join the two datasets by using the common variable IDday
to create the ema dataset. Note that the type
argument is set to “full” in order to include all the cases included in either one or the other dataset.
ema <- plyr::join(fitbit,dailyDiary,by="IDday",type="full") # joining
ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),] # sorting by ID and time
row.names(ema) <- 1:nrow(ema) # renaming rows
Here, we inspect the No. and percentage of nonmissing data for each class of variables included in the sleepLog
dataset: "classic"
(e.g., TST
), "stages"
(e.g., light
), "HR"
(e.g., stageHR_NREM
), "Act"
(e.g., TotalSteps
), and "diary"
(e.g., dailyStress
).
# counting No. and % of missing data per class of variable
Ncases <- nrow(ema) # total No. of cases
Ndiary <- ema[!is.na(ema$dailyStress) & !is.na(ema$eveningMood) & !is.na(ema$eveningWorry),] # Ndiary = non missing focal vars
infoMiss <- data.frame(total=Ncases,diary=nrow(Ndiary),Act=nrow(ema[!is.na(ema$TotalSteps),]),
classic=nrow(ema[!is.na(ema$TST),]),stages=nrow(ema[!is.na(ema$light),]),
HR=nrow(ema[!is.na(ema$stageHR_NREM),]),
classicANDdiary=nrow(Ndiary[!is.na(Ndiary$TST),]),
stagesANDdiary=nrow(Ndiary[!is.na(Ndiary$light),]),
HRANDNdiary=nrow(Ndiary[!is.na(Ndiary$stageHR_NREM),]),
ActANDdiary=nrow(Ndiary[!is.na(Ndiary$TotalSteps),]))
infoMiss <- rbind(infoMiss,Ncases-infoMiss,round(100*infoMiss/Ncases),round(100-100*infoMiss/Ncases))
row.names(infoMiss) <- c("N","Missing","%","%missing")
infoMiss
Comments:
Considering the total No. of cases (i.e., N = 6,219 non missing either in sleepLog, in dailyAct, or in dailyDiary):
diary information is missing in 1,286 cases (21%)
Act information is missing in 200 cases (3%)
sleep classic information is missing in 981 (16%) cases
sleep stages information is missing in 1,756 (28%) cases
sleep-stage HR information is missing in 1,787 (29%) cases
Then, we inspect the cases with dailyDiary
StartedTime
value (i.e., the time at which the survey was responded) and the corresponding sleepLog StartTime
values (i.e., lights-off time according to sleepLog
data).
ema$diary.timeDiff <- as.numeric(difftime(ema$StartTime,ema$StartedTime,units="hours"))
hist(ema$diary.timeDiff,breaks=100,main="hours between dailyDiary StartedTime and sleepLog StartTime")
# showing 1,000 cases with dailyDiary StartedTime AFTER sleepLog StartTime
ema[!is.na(ema$diary.timeDiff) & ema$diary.timeDiff<0,c("ID","ActivityDate","StartedTime","StartTime","EndTime")]
Comments:
in most cases (N = 3,402, 77%) dailyDiary
StartedTime
is before sleepLog StartTime
in 1,000 cases (23%) dailyDiary StartedTime
is after sleepLog
StartTime
, above we can see that most of these cases are cases in which the diary was responded on the following day
Here, we save the aggregated ema dataset.
# saving dataset with EBE-based sleep measures
save(ema,file="DATA/datasets/ema_aggregated.RData")
Finally, we join the ema
dataset (with all daily varying measures) with the demos
data (including demographic information of each participant) by using ID
as the matching variable. In this case, we set the type
argument as "left"
in order to only include those participants that were involved in the EMA protocol.
ema <- plyr::join(ema,demos,by="ID",type="left") # joining datasets
ema <- ema[,c(1:2,(ncol(ema)-ncol(demos)+2):ncol(ema),3:(ncol(ema)-ncol(demos)+1))] # demographics vars at the beginning
Here, we simply check the No. of demos participants that were not included in the ema dataset, and the No. of cases with missing demos values.
# No. demos cases not included (14)
nrow(demos[!(demos$ID %in% levels(as.factor(as.character(ema$ID)))),])
## [1] 14
# no cases with missing demos variables
cat("sanity check:",
nrow(ema[is.na(ema$sex)|is.na(ema$BMI)|is.na(ema$age)|is.na(ema$insomnia)|is.na(ema$insomnia.group),])==0)
## sanity check: TRUE
Comments:
14 participants were not included because they did not participate to the EMA protocol
no missing data in any demos variable are included in the ema dataset
Here, we save the aggregated ema dataset.
# saving dataset with EBE-based sleep measures
save(ema,file="DATA/datasets/emaRetro_aggregated.RData")
Here, we summarize the compliance (No. of non missing data) for each core variable, and we filter the data based on variable-specific criteria.
rm(list=ls()) # emptying the working environment
library(lubridate) # loading required packages
Sys.setenv(tz="GMT") # setting system time zone to GMT (for consistent temporal synchronization)
# loading processed datasets
load("DATA/datasets/dailyAct_aggregated.RData") # dailyAct
load("DATA/datasets/hourlySteps_recoded.RData") # hourlySteps
load("DATA/datasets/sleepLog_nonComb.RData") # sleepLog_nonComb
load("DATA/datasets/sleepLogEBEHR_aggregated.RData") # sleepLog
load("DATA/datasets/LogId_special.RData") # special LogIds
load("DATA/datasets/HR.1min_recoded.RData") # HR.1min
load("DATA/datasets/emaRetro_aggregated.RData") # ema
load("DATA/datasets/demos_recoded.RData") # demos
The demos
data include demographic information describing participants: sex
, age
, BMI
, and insomnia
groups.
Exclusion criteria based on demos
data were applied in the recruitment phase, and no further criteria need to be applied. Exclusion criteria were past-history and/or current severe medical (e.g., cancer, epilepsy, heart diseases, diabetes) and/or mental (e.g., major depressive disorder) conditions, taking current medication known to affect sleep and/or cardiovascular function (e.g., hypnotics, antihypertensives), self-reporting heaving breathing-related and/or movement-related sleep disorders, time-zones traveling in the past month, current pregnancy, or breast-feeding (girls).
No compliance information is needed to describe demos
data since, as highlighted above, no missing data is included in the ema
dataset.
cat("No. missing data in demos variables =",
nrow(ema[is.na(ema$sex) | is.na(ema$age) | is.na(ema$BMI) | is.na(ema$insomnia) | is.na(ema$insomnia.group),])) # 0
## No. missing data in demos variables = 0
As highlighted above, 14 participants were not included because they did not participate to the EMA protocol. No more participants should be excluded based on demos
variables.
# removing participants
memory <- demos
demos <- demos[demos$ID %in% levels(as.factor(as.character(ema$ID))),]
cat("No. excluded participants that did not participate to the EMA protocol =",
nrow(memory)-nrow(demos)) # 14
## No. excluded participants that did not participate to the EMA protocol = 14
As highlighted above, 14 participants were not included because they did not participate to the EMA protocol. No more participants should be excluded based on demos variables.
save(demos,file="DATA/datasets/demos_clean.RData")
sleepLog
data include information describing individual sleep periods detected by the FC3 device, with more than one sleep period being possibly identified within the same day.
Inclusion criteria for sleepLog
data were already introduced in section 2.3.3, consisting of the following conditions, identifying our definition of nocturnal sleep period:
- Starting between 6 PM and 6 AM
- Including at least 180 min (3 hours) of Total Sleep Time (TST)
- Possibly interrupted by an indefinite number of wake periods with undefinite duration, but with the last sleep period starting before 11 AM
- Possibly composed by consecutive sleep periods, but those periods between 6 PM and 11 PM, and between 6 AM and 11 AM are combined only when separated by less than 1.5 hour (otherwise considered as naps)
Here, we compute the original No. of distinct sleep periods (LogId
) identified by the FC3 device, and the original No. of distinct days with sleepLog
minimal information (e.g., time in bed)
cat("Original No. of sleep logs =", # 5,402 + 41 + 336 = 5,779
nlevels(sleepLog_noncomb$LogId) + length(LogId_special[[2]]) + length(LogId_special[[3]]))
## Original No. of sleep logs = 5779
cat("Original No. of sleepLog days =", # 4,764 + 41 + 336 = 4,764
nlevels(sleepLog_noncomb$IDday) + length(LogId_special[[2]]) + length(LogId_special[[3]]))
## Original No. of sleepLog days = 4764
For estimating compliance, we need to compute the ratio between the No. of nonmissing days per each participant by the length of the recording periods originally required by the study, that is two months (60 days).
# creating compliance dataset
compliance <- demos[demos$ID %in% levels(ema$ID),]
# adding EBEonly cases to sleepLog_noncomb
sleep_noncomb <- rbind(sleepLog_noncomb[,c("ID","StartTime","EndTime","ActivityDate","LogId")],
ema[!is.na(ema$EBEonly) & ema$EBEonly==TRUE,c("ID","StartTime","EndTime","ActivityDate","LogId")])
# updating ActivityDate in sleep_noncomb
sleep_noncomb$StartHour <- as.POSIXct(paste(hour(sleep_noncomb$StartTime),
minute(sleep_noncomb$StartTime)),format="%H %M",tz="GMT")
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")
sleep_noncomb[sleep_noncomb$StartHour >= h00 & sleep_noncomb$StartHour <= h06,"ActivityDate"] <-
sleep_noncomb[sleep_noncomb$StartHour >= h00 & sleep_noncomb$StartHour <= h06,"ActivityDate"] - 1
sleep_noncomb$IDday <- as.factor(paste(sleep_noncomb$ID,sleep_noncomb$ActivityDate,sep="_"))
# computing compliance (reference: 60 days)
Nsleep <- ema[!is.na(ema$TIB),]
for(i in 1:nrow(compliance)){
compliance[i,"nSleep"] <-
nlevels(as.factor(as.character(sleep_noncomb[as.character(sleep_noncomb$ID)==compliance[i,"ID"],"IDday"])))
compliance[i,"periodSleep"] <-
difftime(max(sleep_noncomb[as.character(sleep_noncomb$ID)==compliance[i,"ID"],"StartTime"]),
min(sleep_noncomb[as.character(sleep_noncomb$ID)==compliance[i,"ID"],"StartTime"]),units="days") }
compliance$compl.sleep <- 100*compliance$nSleep/60 # % of days on two months
# printing info
cat("sleepLog data:\n- Original No. days/participants =",
round(mean(compliance$nSleep),2)," ( SD =",round(sd(compliance$nSleep),2),
") \n- Original sleepLog compliance =",
round(mean(compliance$compl.sleep),2),"% ( SD =",round(sd(compliance$compl.sleep),2),
") \n- No. of missing days =",
round(mean(as.numeric(compliance$periodSleep)-compliance$nSleep),2),
" ( SD =",round(sd(as.numeric(compliance$periodSleep)-compliance$nSleep),2),")")
## sleepLog data:
## - Original No. days/participants = 55.52 ( SD = 13.01 )
## - Original sleepLog compliance = 92.53 % ( SD = 21.69 )
## - No. of missing days = 22.35 ( SD = 41.39 )
Comments:
compared to the criterion of having two months of continuous recordings, sleepLog
data show an original compliance of 92.53%
however, within the period of recording, an average of 22 missing days are observed
sleepLog
data was already filtered multiple times in the sections above:
339 cases were excluded because StartTime
were < 6 PM or > 11 AM in section 2.3.3
67 cases were combined to previous or following sleep periods in section 2.3.3
4 cases were manually excluded based on visual inspection (early-evening or late-morning naps) in section 2.3.3
75 cases were removed after sleep combination because StartTime
was > 6 AM
57 cases were manually excluded as they were cases of early-evening naps (StartTime
< 6 PM) recorded before the subsequent nocturnal sleep periods, in section 2.3.4.
3 cases were manually excluded as they were cases of diurnal sleep periods with the same IDday
value of following nocturnal sleep periods, in section 4.5.
In summary, a total of 545 cases were excluded mainly due to condition 1 (i.e., StartTime
< 6 PM or > 6 AM). Here, the count is 4 observation lower than expected, probably due to double cases.
# LogId back as factor, and printing info
ema$LogId <- as.factor(as.character(ema$LogId))
cat("No. of excluded cases with StartTime between 6 AM and 6 PM =",
nrow(sleepLog_noncomb) + length(LogId_special[[2]]) + length(LogId_special[[3]]) -
nrow(ema[!is.na(ema$LogId),])) # 541 (4 more cases than expected (?)))
## No. of excluded cases with StartTime between 6 AM and 6 PM = 541
Here, we remove further three sleepLog
cases with StartTime
> 6 AM.
# selecting sleepLogVars
sleepLogVars <- colnames(ema)[which(colnames(ema)=="LogId"):which(colnames(ema)=="nHR_REM")]
# re-computing StartHour and EndHour
ema$StartHour <- as.POSIXct(paste(hour(ema$StartTime),minute(ema$StartTime)),format="%H %M",tz="GMT")
# removing sleepLog variables from cases with StartTime between 6 AM and 6 PM
memory <- ema
h6 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")
h18 <- as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="GMT")
ema[!is.na(ema$StartHour) & ema$StartHour > h6 & ema$StartHour < h18, sleepLogVars] <- NA
# print info (3 removed cases)
cat("No. excluded cases with StartTime between 6 AM and 6 PM =",
nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))
## No. excluded cases with StartTime between 6 AM and 6 PM = 3
# showing 3 removed cases
memory[!is.na(memory$TIB) & is.na(ema$TIB),c("ID","ActivityDate","StartTime","EndTime","TST","EBEonly","EBEDataType")]
Comments:
only 3 cases were removed due to StartTime
between 6 AM and 6 PM
all the 3 cases were cases of diurnal sleep periods computed from classicEBE
data (i.e., cases of uniqueClassicLog
; see section 4.3.3), with only two of these cases having a TST > 3h
As a further sanity check, we inspect the distribution of nonmissing StartHour
times.
# recomputing StartHour
ema$StartHour <- as.POSIXct(paste(hour(ema$StartTime),minute(ema$StartTime)),format="%H %M",tz="GMT")
# sanity check (no more cases )
cat("Sanity check:",nrow(ema[!is.na(ema$StartHour) & ema$StartHour > h6 & ema$StartHour < h18,])==0)
## Sanity check: TRUE
# plotting StartHour
hist(ema$StartHour,breaks=100,col="black",xlab="",main="StartHour")
Then, we filter sleepLog cases with a TST
< 3 hours (i.e., condition 2 of our definition of nocturnal sleep period).
# removing sleepLog variables from cases with TST < 180 min (3h)
memory <- ema
ema[!is.na(ema$TST) & ema$TST < 180, sleepLogVars] <- NA
# print info (64 removed cases)
cat("No. excluded cases with StartTime between 6 AM and 6 PM =",
nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))
## No. excluded cases with StartTime between 6 AM and 6 PM = 64
# sanity check
summary(ema$TST)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 180.0 363.2 416.0 414.3 466.5 886.0 1048
Comments:
64 cases were removed due to TST < 3h not matching with our definition of nocturnal sleep period
included TST values now range from 180 min (3h) to 886 (15h)
Then, we better inspect cases of tempolarily isolated sleep periods, that is sleepLog
data recorded substantially later than all other sleepLog
data previously recorded from the same participant. As shown in section 2.3, cases with extreme No. of consecutive missing days are mainly due to such cases of isolated sleep periods.
# computing LAG values for ActivityDate
library(dplyr)
ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),]
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)
# computing and plotting time lags
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))
hist(Nsleep$lag,main="Consecutive missing days day(i) - day(i-1)",breaks=100)
# printing info
n <- c(10,15,20,30,50)
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
nrow(Nsleep[(!is.na(Nsleep$lag) & Nsleep$lag>n[i]),])) }
##
## - No. cases with > 10 consecutive missing days = 41
## - No. cases with > 15 consecutive missing days = 32
## - No. cases with > 20 consecutive missing days = 25
## - No. cases with > 30 consecutive missing days = 14
## - No. cases with > 50 consecutive missing days = 8
# showing 41 cases with more than 10 missing days
Nsleep.vars <- c("ID","ActivityDate","lag",sleepLogVars)
isolatedSleep <- as.data.frame(matrix(nrow=0,ncol=length(Nsleep.vars)+1))
for(i in 1:nrow(Nsleep)){
if(!is.na(Nsleep[i,"lag"]) & Nsleep[i,"lag"]>10){
isolatedSleep <- rbind(isolatedSleep,Nsleep[(i-2):(i+2),c(Nsleep.vars,"dailyStress")]) } }
isolatedSleep[,c("ID","LogId","ActivityDate","lag","StartTime","EBEonly","dailyStress")]
Comments:
consecutive missing days of sleepLog
data (i.e., No. of days between each and the preceding observation from the same participant) are higher than 10 only for 41 cases (0.7%), with even less cases showing consecutive missing days > than 15, 20, 30, and 50 days
about the half of these cases are mainly due to isolated final sleep periods recorded several days later than the previous sleep period, with no corresponding dailyDiary
values
only in two of these cases (LogId
26112469243 and 26112469243) sleepLog
data was computed based on EBEonly
Here, we use the isolatedSleep.rm
function to progressively remove cases of isolated sleep periods.
isolatedSleep.rm <- function(SLEEPdata=NA,DayDiff.max=10,DayDiff.nDays=1,printInfo=TRUE,showData=FALSE){
# preparing vector that will include all the filtered LogId values
ISOLogs <- character()
for(h in 1:10000){
require(dplyr)
SLEEPdata <- SLEEPdata %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
SLEEPdata <- as.data.frame(SLEEPdata)
detach("package:dplyr", unload=TRUE)
# computing and plotting time lags
SLEEPdata$lag <- as.numeric(difftime(SLEEPdata$ActivityDate,SLEEPdata$AD_lag,units="days"))
# printing info
if(printInfo==TRUE){ n <- c(10,15,20,30,50)
for(i in 1:length(n)){ cat("\n - No. cases with >",n[i],"consecutive missing days =",
nrow(SLEEPdata[(!is.na(SLEEPdata$lag) & SLEEPdata$lag>n[i]),])) }}
# creating list of cases with each lag value > DayDiff.max and the previous and following DayDiff.nDays + 1 cases
sleep.vars <- c("ID","LogId","ActivityDate","lag","StartTime","EndTime","EBEonly")
isolatedSleep <- list()
for(i in 3:nrow(SLEEPdata)){
if(!is.na(SLEEPdata[i,"lag"]) & SLEEPdata[i,"lag"]>DayDiff.max){
isolatedSleep[[length(isolatedSleep)+1]] <- SLEEPdata[(i-DayDiff.nDays-1):(i+DayDiff.nDays+1),sleep.vars] } }
# creating vector of isolated sleep cases' LogId values
isolatedLogs <- character()
for(k in 1:length(isolatedSleep)){
for(i in 1:(nrow(isolatedSleep[[k]])-DayDiff.nDays)){
if(!is.na(isolatedSleep[[k]][i,"lag"]) & isolatedSleep[[k]][i,"lag"]>DayDiff.max &
isolatedSleep[[k]][i,"ID"] != isolatedSleep[[k]][i+DayDiff.nDays,"ID"]){
isolatedLogs <- c(isolatedLogs,as.character(isolatedSleep[[k]][(i-DayDiff.nDays+1):i,"LogId"])) } }}
isolatedLogs <- levels(as.factor(isolatedLogs))
# data filtering and printing info
if(length(isolatedLogs)>0){
ISOLogs <- c(ISOLogs,isolatedLogs)
memory <- SLEEPdata
SLEEPdata <- SLEEPdata[!(SLEEPdata$LogId %in% isolatedLogs),] # removing cases
if(printInfo==TRUE){
cat("\n\nCycle No.",h,": Excluding",nrow(memory[!is.na(memory$TIB),])-nrow(SLEEPdata[!is.na(SLEEPdata$TIB),]),
"cases of isolated sleep periods")}
# when no more cases of isolated sleep periods
} else{
if(printInfo==TRUE){
cat("\n\nNo more cases with >",n[i],"consecutive missing days due to isolated sleep periods",
"\nTotal No. of isolated sleep periods to be filtered =",length(ISOLogs)) }
if(showData==TRUE){
cat("\nshowing all remaining cases with >",n[i],"consecutive missing days:\n")
print(isolatedSleep) }
break }}
return(ISOLogs) }
# running function and selecting cases of isolated sleep periods at the end of participants' recording period
isoSleep <- isolatedSleep.rm(SLEEPdata=ema[!is.na(ema$TIB),],DayDiff.max=10,DayDiff.nDays=1,showData=FALSE)
##
## - No. cases with > 10 consecutive missing days = 41
## - No. cases with > 15 consecutive missing days = 32
## - No. cases with > 20 consecutive missing days = 25
## - No. cases with > 30 consecutive missing days = 14
## - No. cases with > 50 consecutive missing days = 8
##
## Cycle No. 1 : Excluding 20 cases of isolated sleep periods
## - No. cases with > 10 consecutive missing days = 21
## - No. cases with > 15 consecutive missing days = 14
## - No. cases with > 20 consecutive missing days = 9
## - No. cases with > 30 consecutive missing days = 2
## - No. cases with > 50 consecutive missing days = 2
##
## Cycle No. 2 : Excluding 6 cases of isolated sleep periods
## - No. cases with > 10 consecutive missing days = 15
## - No. cases with > 15 consecutive missing days = 10
## - No. cases with > 20 consecutive missing days = 5
## - No. cases with > 30 consecutive missing days = 1
## - No. cases with > 50 consecutive missing days = 1
##
## No more cases with > 30 consecutive missing days due to isolated sleep periods
## Total No. of isolated sleep periods to be filtered = 26
Comments:
Here, we remove these 26 cases from the ema
dataset.
# removing sleepLog variables from cases with corresponding to isoSleep
memory <- ema
ema[!is.na(ema$TST) & ema$LogId %in% isoSleep, sleepLogVars] <- NA
# print info (26 removed cases)
cat("No. excluded cases of isolated sleep =",
nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))
## No. excluded cases of isolated sleep = 26
Then, we further inspect the remaining cases with 10+ missing consecutive sleepLog
days.
# computing LAG values for ActivityDate
library(dplyr)
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))
# showing 15 cases with more than 10 missing days
Nsleep.vars <- c("ID","ActivityDate","lag",sleepLogVars)
isolatedSleep <- as.data.frame(matrix(nrow=0,ncol=length(Nsleep.vars)))
for(i in 1:nrow(Nsleep)){
if(!is.na(Nsleep[i,"lag"]) & Nsleep[i,"lag"]>10){
isolatedSleep <- rbind(isolatedSleep,Nsleep[(i-3):(i+3),c(Nsleep.vars,"dailyStress")]) } }
isolatedSleep[,c("ID","LogId","ActivityDate","lag","StartTime","EBEonly","dailyStress")]
Comments:
some of the remaining cases of 10+ consecutive missing days are still due to isolated final sleep periods recorded several days later than the previous sleep period, with most of them having missing values for dailyDiary variables
however, these cases were not filtered above because of one or two cases with lag < 10 between them and the end of the data collection period for a given participant
Here, we use the isolatedSleep.rm to filter these cases by considering two consecutive rows instead of one (i.e., DayDiff.nDays is set to 2)
# running function and selecting cases of isolated sleep periods at the end of participants' recording period
isoSleep <- isolatedSleep.rm(SLEEPdata=ema[!is.na(ema$TIB),],DayDiff.max=10,DayDiff.nDays=2,showData=FALSE)
##
## - No. cases with > 10 consecutive missing days = 15
## - No. cases with > 15 consecutive missing days = 10
## - No. cases with > 20 consecutive missing days = 5
## - No. cases with > 30 consecutive missing days = 1
## - No. cases with > 50 consecutive missing days = 1
##
## Cycle No. 1 : Excluding 8 cases of isolated sleep periods
## - No. cases with > 10 consecutive missing days = 15
## - No. cases with > 15 consecutive missing days = 10
## - No. cases with > 20 consecutive missing days = 6
## - No. cases with > 30 consecutive missing days = 2
## - No. cases with > 50 consecutive missing days = 1
##
## Cycle No. 2 : Excluding 8 cases of isolated sleep periods
## - No. cases with > 10 consecutive missing days = 11
## - No. cases with > 15 consecutive missing days = 6
## - No. cases with > 20 consecutive missing days = 3
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0
##
## Cycle No. 3 : Excluding 2 cases of isolated sleep periods
## - No. cases with > 10 consecutive missing days = 10
## - No. cases with > 15 consecutive missing days = 6
## - No. cases with > 20 consecutive missing days = 3
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0
##
## Cycle No. 4 : Excluding 2 cases of isolated sleep periods
## - No. cases with > 10 consecutive missing days = 9
## - No. cases with > 15 consecutive missing days = 5
## - No. cases with > 20 consecutive missing days = 3
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0
##
## No more cases with > 50 consecutive missing days due to isolated sleep periods
## Total No. of isolated sleep periods to be filtered = 20
Comments:
the function identified further 20 cases of isolated sleep periods accounting for 6 of the 15 cases of consecutive missing days > 10, and for both the remaining two cases with 30+ consecutive missing days
note that the 20 cases also include the sleep periods (N = 10) recorded between each case with lag
> 10 and the end of the data collection period for that participant
Here, we remove these 20 cases from the ema
dataset.
# removing sleepLog variables from cases with corresponding to isoSleep
memory <- ema
ema[!is.na(ema$TST) & ema$LogId %in% isoSleep, sleepLogVars] <- NA
# print info (20 removed cases)
cat("No. excluded cases of isolated sleep =",
nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))
## No. excluded cases of isolated sleep = 20
Then, we further inspect the remaining cases with 10+ missing consecutive sleepLog
days.
# computing LAG values for ActivityDate
library(dplyr)
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))
# showing 9 cases with more than 10 missing days
Nsleep.vars <- c("ID","ActivityDate","lag",sleepLogVars)
isolatedSleep <- as.data.frame(matrix(nrow=0,ncol=length(Nsleep.vars)))
for(i in 1:nrow(Nsleep)){
if(!is.na(Nsleep[i,"lag"]) & Nsleep[i,"lag"]>10){
isolatedSleep <- rbind(isolatedSleep,Nsleep[(i-4):(i+4),c(Nsleep.vars,"dailyStress")]) } }
isolatedSleep[,c("ID","LogId","ActivityDate","lag","StartTime","EBEonly","dailyStress")]
Comments:
some of the remaining cases of 10+ consecutive missing days are still due to isolated final sleep periods recorded several days later than the previous sleep period, among which 4 cases have missing dailyDiary values**
however, these cases were not filtered above because of two cases with lag < 10 between them and the end of the data collection period for a given participant
Here, we manually remove these 4 cases:
# manually removing sleepLog variables from cases with corresponding to isoSleep
memory <- ema
isoSleep <- c("23751186378","23751186379","23758815846", # s041 - removing last 3 cases (20 days after)
"23769564082") # s050 - removing first 1 case (21 days before)
ema[!is.na(ema$TIB) & ema$LogId %in% isoSleep, sleepLogVars] <- NA
# print info (10 removed cases)
cat("No. excluded cases of isolated sleep =",
nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))
## No. excluded cases of isolated sleep = 4
In summary, we removed a total of 50 isolated observations recorded 10+ days before or after the remaining observations obtained from the same participant. A total of six cases still show a No. of consecutive missing days between 11 and 26.
# computing LAG values for ActivityDate
library(dplyr)
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))
# printing info
cat("No. of cases with 11 to",max(Nsleep$lag,na.rm=TRUE),"consecutive missing days =",
nrow(Nsleep[(!is.na(Nsleep$lag) & Nsleep$lag>10 & Nsleep$lag<max(Nsleep$lag,na.rm=TRUE)),]))
## No. of cases with 11 to 26 consecutive missing days = 6
Finally, we check again for duplicated cases, that is cases with the same sleepLog
values, or with the same ID
and ActivityDate
values.
# creating IDday variable
ema$IDday <- as.factor(paste(ema$ID,ema$ActivityDate,sep="_"))
Nsleep <- ema[!is.na(ema$TIB),]
# sanity check by LogId (0 cases)
cat("Sanity check:",nrow(Nsleep[duplicated(Nsleep$LogId),])==0)
## Sanity check: TRUE
# sanity check by IDday (0 cases)
cat("Sanity check:",nrow(Nsleep[duplicated(Nsleep$IDday),])==0)
## Sanity check: TRUE
Comments:
ema
dataset for sleepLog
variablesIn summary, from the original No. of sleep logs (N = 5,779 sleep periods recorded over 4,764 days), 67 cases (1%) were combined with preceding or following consecutive sleep periods, 478 + 3 = 481 cases (8%) were removed due to StartTime
between 6 AM and 6 PM, and 64 cases (1%) were removed due to TST
< 3h. 50 cases (1%) of isolated sleep periods recorded 10 or more days before or after the remaining observations from the same participant were also removed.
Thus, data cleaning led to a total No. of excluded sleep periods = 662 cases (13.9%).
# 658 (4 less than expected)
cat("Total No. of excluded cases =",nrow(sleep_noncomb) - nrow(ema[!is.na(ema$TIB),]))
## Total No. of excluded cases = 658
Here, we compute the updated information on the non-missing data related to sleepLog
variables.
# updating compliance dataset
Nsleep <- ema[!is.na(ema$TIB),]
# computing compliance_clean (reference: 60 days)
for(i in 1:nrow(compliance)){
compliance[i,"nSleep_clean"] <-
nlevels(as.factor(as.character(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"],"IDday"])))
compliance[i,"periodSleep_clean"] <-
difftime(max(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"],"StartTime"]),
min(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"],"StartTime"]),units="days") }
compliance$compl.sleep_clean <- 100*compliance$nSleep_clean/60 # % of days on two months
# printing compliance info
cat("\n\nsleepLog data:\n- 'Cleaned' No. days/participants =",
round(mean(compliance$nSleep_clean),2)," ( SD =",round(sd(compliance$nSleep_clean),2),
") \n- 'Cleaned' sleepLog compliance =",
round(mean(compliance$compl.sleep_clean),2),"% ( SD =",round(sd(compliance$compl.sleep_clean),2),
") \n- No. of missing days =",
round(mean(as.numeric(compliance$periodSleep_clean)-compliance$nSleep_clean),2),
" ( SD =",round(sd(as.numeric(compliance$periodSleep_clean)-compliance$nSleep_clean),2),")")
##
##
## sleepLog data:
## - 'Cleaned' No. days/participants = 55.06 ( SD = 14.03 )
## - 'Cleaned' sleepLog compliance = 91.77 % ( SD = 23.38 )
## - No. of missing days = 5.82 ( SD = 9.05 )
# plotting No. of cases
hist(compliance$nSleep_clean,breaks=50,main="No. of nonmissing sleep periods per participant",xlab="")
Comments:
Finally, we summarize the No. and % of sleepLog
cases with sleep stage data.
# computing No. of cases with nonmissing sleep stage data
for(i in 1:nrow(compliance)){
compliance[i,"nSleepStage_clean"] <-
nlevels(as.factor(as.character(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"] &
!is.na(Nsleep$light),"IDday"]))) }
# printing compliance info
cat("\n\nsleepLog data (nonmissing sleep stages):\n- 'Cleaned' No. days/participants =",
round(mean(compliance$nSleepStage_clean),2)," ( SD =",round(sd(compliance$nSleepStage_clean),2),
") \n- sleep stages compliance =",
round(100*mean(compliance$nSleepStage_clean/60),2),"% ( SD =",
round(sd(100*compliance$nSleepStage_clean/60),2),
") \n- non missing sleep stages/Total No. 'clean' cases =",
round(mean(100*compliance$nSleepStage_clean/compliance$nSleep_clean),2),
"% ( SD =",round(sd(100*compliance$nSleepStage_clean/compliance$nSleep_clean),2),")")
##
##
## sleepLog data (nonmissing sleep stages):
## - 'Cleaned' No. days/participants = 47.32 ( SD = 14.57 )
## - sleep stages compliance = 78.87 % ( SD = 24.28 )
## - non missing sleep stages/Total No. 'clean' cases = 85.98 % ( SD = 15.01 )
# plotting No. of cases
hist(compliance$nSleepStage_clean,breaks=50,main="No. of nonmissing sleep stage values per participant",xlab="")
Comments:
sleepLog
data to the 21% for sleepStage
datasleepHR
data include information describing heart rate (HR) mean values for NREM and REM sleep, as computed in section 4.4.
Here, we summarize the No. and % of sleepLog
cases with sleep-related HR data.
# computing No. of cases with nonmissing sleep stage data
for(i in 1:nrow(compliance)){
compliance[i,"nSleepHR"] <-
nlevels(as.factor(as.character(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"] &
!is.na(Nsleep$stageHR_NREM),"IDday"]))) }
# printing compliance info
cat("\n\nsleepLog data (nonmissing TST-related HR):\n- 'Cleaned' No. days/participants =",
round(mean(compliance$nSleepHR),2)," ( SD =",round(sd(compliance$nSleepHR),2),
") \n- sleep HR compliance =",
round(mean(100*compliance$nSleepHR/60),2),"% ( SD =",
round(sd(100*compliance$nSleepHR/60),2),
") \n- non missing stageHR_NREM/Total No. 'clean' cases =",
round(mean(100*compliance$nSleepHR/compliance$nSleep_clean),2),
"% ( SD =",round(sd(100*compliance$nSleepHR/compliance$nSleep_clean),2),")")
##
##
## sleepLog data (nonmissing TST-related HR):
## - 'Cleaned' No. days/participants = 47.03 ( SD = 14.3 )
## - sleep HR compliance = 78.39 % ( SD = 23.83 )
## - non missing stageHR_NREM/Total No. 'clean' cases = 85.54 % ( SD = 14.98 )
# plotting No. of cases
hist(compliance$nSleepHR,breaks=50,main="No. of nonmissing stageHR_NREM values per participant",xlab="")
Comments:
sleepHR
data was computed in section 4.4 from all the available epochs recorded within each TIB
interval. Thus, sleepHR
data can be filtered both by accounting for the No. of missing HR epochs used for computing each measure, and by accounting for the range of HR values.
Here, we inspect the distribution of the difference between the No. of minutes in each sleep period (TST
), and the total No. of epochs (i.e., minutes) used for computing sleepHR
variables (nHR_TST
)
# computing and plotting differences between expected and observed No. of TIB epochs
NsleepHR <- ema[!is.na(ema$stageHR_NREM),]
NsleepHR$nHR_TST <- apply(NsleepHR[,c("nHR_NREM","nHR_REM")],1,sum,na.rm=TRUE)
NsleepHR$nHR_TSTdiff <- NsleepHR$TST - NsleepHR$nHR_TST
hist(NsleepHR$nHR_TSTdiff,breaks=100,main="TST - nHR_TST (minutes)",xlab="")
# printing info (No. of missing epochs)
nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR[NsleepHR$nHR_TSTdiff>=nMiss[i],]),
"cases with",nMiss[i],"or more missing HR epochs")}
##
## - 3711 cases with 0 or more missing HR epochs
## - 3364 cases with 1 or more missing HR epochs
## - 1457 cases with 10 or more missing HR epochs
## - 143 cases with 20 or more missing HR epochs
## - 35 cases with 30 or more missing HR epochs
## - 15 cases with 50 or more missing HR epochs
## - 4 cases with 75 or more missing HR epochs
## - 2 cases with 100 or more missing HR epochs
## - 2 cases with 150 or more missing HR epochs
## - 2 cases with 200 or more missing HR epochs
## - 1 cases with 400 or more missing HR epochs
Comments:
a substantial No. of HR values (N = 1,457, 33%) were computed from recordings with 20+ missing epochs (i.e., 10+ min)
a lower No. of cases (N = 143, 3%) were computed from recordings with 20+ missing epochs (i.e., 20+ min)
Here, we filter sleepHR
values computed from recordings with 50+ missing epochs (i.e., 15 cases with 50+ minutes with no HR).
# identifying sleepHR variables
sleepHRVars <- colnames(ema)[which(colnames(ema)=="stageHR_REM"):which(colnames(ema)=="stageHR_NREM")]
# identifying cases with 50+ missing epochs
toRemove <- levels(as.factor(as.character(NsleepHR[NsleepHR$nHR_TSTdiff>=50,"LogId"])))
cat("No. of cases to be removed =",length(toRemove)) # 106
## No. of cases to be removed = 15
# removing sleepHR variables from cases with 50+ missing epochs
memory <- ema
ema[!is.na(ema$stageHR_NREM) & ema$LogId %in% toRemove, sleepHRVars] <- NA
# print info (15 removed cases)
cat("No. excluded cases with 50+ missing epochs =",
nrow(memory[!is.na(memory$stageHR_NREM),])-nrow(ema[!is.na(ema$stageHR_NREM),]))
## No. excluded cases with 50+ missing epochs = 15
Comments:
TST
intervalHere, we further inspect the differences between the expected and the observed No. of HR epochs in each sleep period.
# computing and plotting differences between expected and observed No. of TST epochs
NsleepHR <- ema[!is.na(ema$stageHR_NREM),]
NsleepHR$nHR_TST <- apply(NsleepHR[,c("nHR_NREM","nHR_REM")],1,sum,na.rm=TRUE)
NsleepHR$nHR_TSTdiff <- NsleepHR$TST - NsleepHR$nHR_TST
hist(NsleepHR$nHR_TSTdiff,breaks=100,main="TST - nHR_TST (minutes)",xlab="")
# printing info (No. of missing epochs)
nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR[NsleepHR$nHR_TSTdiff>=nMiss[i],]),
"cases with",nMiss[i],"or more missing HR epochs")}
##
## - 3696 cases with 0 or more missing HR epochs
## - 3349 cases with 1 or more missing HR epochs
## - 1442 cases with 10 or more missing HR epochs
## - 128 cases with 20 or more missing HR epochs
## - 20 cases with 30 or more missing HR epochs
## - 0 cases with 50 or more missing HR epochs
## - 0 cases with 75 or more missing HR epochs
## - 0 cases with 100 or more missing HR epochs
## - 0 cases with 150 or more missing HR epochs
## - 0 cases with 200 or more missing HR epochs
## - 0 cases with 400 or more missing HR epochs
Comments:
sleepHR
measures are computed from recordings with 50+ minutes of missing dataThen, we can apply the same criterion to separately filter stageHR_NREM
and stageHR_REM
, based on maximum acceptable No. of missing data. The HRdata.filter
function is used to optimize the process.
HRdata.filter <- function(EMAdata=NA,nHRvar=NA,HRvar=NA,logId=NA,SLEEPvar=NA,filter=FALSE,maxDiff=NA){
# preparing dataset
colnames(EMAdata)[which(colnames(EMAdata)==nHRvar)] <- "nHRvar"
colnames(EMAdata)[which(colnames(EMAdata)==logId)] <- "logId"
if(length(HRvar)>1){
for(i in 1:length(HRvar)){ colnames(EMAdata)[which(colnames(EMAdata)==HRvar[i])] <- paste("HRvar",i,sep="") }
colnames(EMAdata)[colnames(EMAdata)=="HRvar1"] <- "HRvar"
} else { colnames(EMAdata)[which(colnames(EMAdata)==HRvar)] <- "HRvar" }
if(is.numeric(SLEEPvar)){ EMAdata$SLEEPvar <- SLEEPvar
} else { colnames(EMAdata)[which(colnames(EMAdata)==SLEEPvar)] <- "SLEEPvar" }
NsleepHR <- EMAdata[!is.na(EMAdata$HRvar),]
NsleepHR$diff <- NsleepHR$SLEEPvar - NsleepHR$nHRvar
# printing info (No. of missing epochs)
cat("\n\nNo. of cases with missing HR epochs:\n")
nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR[NsleepHR$diff>=nMiss[i],]),
"cases with",nMiss[i],"or more missing",nHRvar,"epochs")
if(nrow(NsleepHR[NsleepHR$diff>=nMiss[i],])==0){ break }}
# filtering HRmeasures data based on maxDiff
if(filter==TRUE & !is.na(maxDiff)){
if(is.numeric(maxDiff)){
toRemove <- levels(as.factor(as.character(NsleepHR[NsleepHR$diff>=maxDiff,"logId"])))
} else if(is.character(maxDiff) & substr(maxDiff,nchar(maxDiff),nchar(maxDiff))=="%"){
maxDiff_perc <- as.numeric(gsub("%","",maxDiff))/100
toRemove <- levels(as.factor(as.character(NsleepHR[NsleepHR$diff >= NsleepHR$SLEEPvar*maxDiff_perc,"logId"]))) }
memory <- EMAdata
EMAdata[!is.na(EMAdata$nHRvar) & EMAdata$logId %in% toRemove, c("nHRvar","HRvar")] <- NA
if(length(HRvar)>1){
EMAdata[!is.na(EMAdata$HRvar2) & EMAdata$logId %in% toRemove, paste("HRvar",2:length(HRvar),sep="")] <- NA }
# printing info
cat("\n\nNo. excluded cases with more than",maxDiff,"missing epochs =",
nrow(memory[!is.na(memory$HRvar),])-nrow(EMAdata[!is.na(EMAdata$HRvar),]))
NsleepHR2 <- EMAdata[!is.na(EMAdata$HRvar),]
NsleepHR2$diff <- NsleepHR2$SLEEPvar - NsleepHR2$nHRvar
cat("\n\nUpdated No. of cases with missing HR epochs:\n")
nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR2[NsleepHR2$diff>=nMiss[i],]),
"cases with",nMiss[i],"or more missing",nHRvar,"epochs")
if(nrow(NsleepHR[NsleepHR2$diff>=nMiss[i],])==0){ break }}
# plotting
par(mfrow=c(2,1))
hist(NsleepHR$diff,xlab="",breaks=100,main=paste("differences between",SLEEPvar,"and",nHRvar))
hist(NsleepHR2$diff,xlab="",breaks=100,main=paste("Updated differences between",SLEEPvar,"and",nHRvar)) }
# renaming variables
colnames(EMAdata)[which(colnames(EMAdata)=="nHRvar")] <- nHRvar
if(!is.numeric(SLEEPvar)){ colnames(EMAdata)[which(colnames(EMAdata)=="SLEEPvar")] <- SLEEPvar
} else { EMAdata$SLEEPvar <- NULL }
colnames(EMAdata)[which(colnames(EMAdata)=="logId")] <- logId
if(length(HRvar)>1){ colnames(EMAdata)[which(substr(colnames(EMAdata),1,5)=="HRvar")] <- HRvar
} else { colnames(EMAdata)[which(colnames(EMAdata)=="HRvar")] <- HRvar }
return(EMAdata) }
Here, the procedure is applied to both stage-related HR measures by setting the maxDiff
argument to 20 in order to remove all cases with 20 or more missing epochs.
# stageHR_REM: 20 (filtering 5 cases)
ema <- HRdata.filter(EMAdata=ema,nHRvar="nHR_REM",HRvar="stageHR_REM",logId="LogId",SLEEPvar="rem",
filter=TRUE,maxDiff=20)
##
##
## No. of cases with missing HR epochs:
##
## - 3520 cases with 0 or more missing nHR_REM epochs
## - 2580 cases with 1 or more missing nHR_REM epochs
## - 12 cases with 10 or more missing nHR_REM epochs
## - 5 cases with 20 or more missing nHR_REM epochs
## - 0 cases with 30 or more missing nHR_REM epochs
##
## No. excluded cases with more than 20 missing epochs = 5
##
## Updated No. of cases with missing HR epochs:
##
## - 3515 cases with 0 or more missing nHR_REM epochs
## - 2575 cases with 1 or more missing nHR_REM epochs
## - 7 cases with 10 or more missing nHR_REM epochs
## - 0 cases with 20 or more missing nHR_REM epochs
# stageHR_NREM: 20 (filtering 39 cases)
ema$nrem <- ema$light + ema$deep
ema <- HRdata.filter(EMAdata=ema,nHRvar="nHR_NREM",HRvar="stageHR_NREM",logId="LogId",SLEEPvar="nrem",
filter=TRUE,maxDiff=20)
##
##
## No. of cases with missing HR epochs:
##
## - 3647 cases with 0 or more missing nHR_NREM epochs
## - 3228 cases with 1 or more missing nHR_NREM epochs
## - 995 cases with 10 or more missing nHR_NREM epochs
## - 39 cases with 20 or more missing nHR_NREM epochs
## - 10 cases with 30 or more missing nHR_NREM epochs
## - 0 cases with 50 or more missing nHR_NREM epochs
##
## No. excluded cases with more than 20 missing epochs = 39
##
## Updated No. of cases with missing HR epochs:
##
## - 3608 cases with 0 or more missing nHR_NREM epochs
## - 3189 cases with 1 or more missing nHR_NREM epochs
## - 956 cases with 10 or more missing nHR_NREM epochs
## - 0 cases with 20 or more missing nHR_NREM epochs
Comments:
5 (0.1%) stageHR_REM
values were excluded due to 20+ missing HR epochs
39 (0.8%) stageHR_NREM
values were excluded due to 20+ missing HR epochs
Finally, we inspect the range of HR values in each stageHR
variable, and we compare HR values with normative cut-offs (i.e., 1st and 99th centiles for ages 15-18y = 43 and 104 bpm, respectively) from Fleming et al. (2011).
sleepHRVars <- paste("stageHR",c("NREM","REM"),sep="_")
for(i in 1:length(sleepHRVars)){
colnames(ema)[which(colnames(ema)==sleepHRVars[i])] <- "sleepHR"
cat("\n\n",sleepHRVars[i],":\n -",nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR < 43,]),"cases with mean HR < 43 bpm (",
round(100*nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR < 43,])/nrow(ema[!is.na(ema$sleepHR),]),1),
"% )\n -",nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR > 104,]),"cases with mean HR > 104 bpm (",
round(100*nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR > 104,])/nrow(ema[!is.na(ema$sleepHR),]),1),
")\n - min HR =",min(ema[!is.na(ema$sleepHR),"sleepHR"]),"- max HR =",max(ema[!is.na(ema$sleepHR),"sleepHR"]),"bpm")
hist(ema$sleepHR,breaks=100,xlab="",main=sleepHRVars[i]); abline(v=c(43,104),col="red")
colnames(ema)[which(colnames(ema)=="sleepHR")] <- sleepHRVars[i] }
##
##
## stageHR_NREM :
## - 28 cases with mean HR < 43 bpm ( 0.6 % )
## - 0 cases with mean HR > 104 bpm ( 0 )
## - min HR = 39.825 - max HR = 100.93 bpm
##
##
## stageHR_REM :
## - 19 cases with mean HR < 43 bpm ( 0.4 % )
## - 0 cases with mean HR > 104 bpm ( 0 )
## - min HR = 41.011 - max HR = 97.297 bpm
Comments:
a small No. of cases (ranging from 19 for stageHR_REM
to 28 for stageHR_NREM
) show HR lower than 43 bpm
no cases show mean HR higher than 104 bpm
Due to the very low No. of cases with extreme cases, we decide to not filter sleepHR
data based on HR values.
In summary, from the original No. of sleep logs with nonmissing HR data (N = 5,037), 15 cases were removed due values computed from recordings with 50+ missing epochs, and further cases were removed due to 20+ missing epochs in stageHR_REM
(N = 5) and stageHR_NREM
(N = 39).
Thus, data cleaning led to a total No. of excluded sleep periods = 54 cases for stageHR_NREM
and 20 cases for stageHR_REM
.
# stageHR_NREM: 54 cases (as expected)
cat("stageHR_NREM: Total No. of excluded cases =",nrow(memory[!is.na(memory$stageHR_NREM),]) - nrow(ema[!is.na(ema$stageHR_NREM),]))
## stageHR_NREM: Total No. of excluded cases = 54
# stageHR_NREM: 19 cases (1 less than expected)
cat("stageHR_REM: Total No. of excluded cases =",nrow(memory[!is.na(memory$stageHR_REM),]) - nrow(ema[!is.na(ema$stageHR_REM),]))
## stageHR_REM: Total No. of excluded cases = 19
Here, compute the updated information on the nonmissing data related to sleepHR
variables.
for(i in 1:length(sleepHRVars)){
# updating compliance dataset
colnames(ema)[which(colnames(ema)==sleepHRVars[i])] <- "sleepHR"
NsleepHR <- ema[!is.na(ema$sleepHR),]
# computing compliance_clean (reference: 60 days)
for(j in 1:nrow(compliance)){
compliance[j,"sleepHR"] <-
nlevels(as.factor(as.character(NsleepHR[as.character(NsleepHR$ID)==compliance[j,"ID"] &
!is.na(NsleepHR$sleepHR),"IDday"]))) }
# printing compliance info
cat("\n\n",sleepHRVars[i],":\n- 'Cleaned' No. days/participants =",
round(mean(compliance$sleepHR),2)," ( SD =",round(sd(compliance$sleepHR),2),
") \n- compliance =",
round(mean(100*compliance$sleepHR/60),2),"% ( SD =",
round(sd(100*compliance$sleepHR/60),2),
") \n- non missing",sleepHRVars[i],"/Total No. 'clean' sleepLog cases =",
round(mean(100*compliance$sleepHR/compliance$nSleep_clean),2),
"% ( SD =",round(sd(100*compliance$sleepHR/compliance$nSleep_clean),2),")")
# plotting No. of cases
hist(compliance$sleepHR,breaks=50,main=paste("No. of nonmissing",sleepHRVars[i],"values per participant"),xlab="")
# back to original variable name
colnames(ema)[which(colnames(ema)=="sleepHR")] <- sleepHRVars[i]
colnames(compliance)[which(colnames(compliance)=="sleepHR")] <- paste("n",sleepHRVars[i],"_clean",sep="") }
##
##
## stageHR_NREM :
## - 'Cleaned' No. days/participants = 46.45 ( SD = 14.13 )
## - compliance = 77.42 % ( SD = 23.56 )
## - non missing stageHR_NREM /Total No. 'clean' sleepLog cases = 84.54 % ( SD = 14.88 )
##
##
## stageHR_REM :
## - 'Cleaned' No. days/participants = 46.8 ( SD = 14.25 )
## - compliance = 77.99 % ( SD = 23.74 )
## - non missing stageHR_REM /Total No. 'clean' sleepLog cases = 85.15 % ( SD = 15.05 )
Comments:
dailyAct
data include information on daily TotalSteps
and physical activity durations. The TotalSteps
variable was recomputed from hourlySteps
data in section 4.1, with only 13 cases being only included in the dailyAct
but not in the hourlySteps
dataset.
Here, we compute the original No. of dailyAct
days (IDday
) identified by the FC3 device.
cat(nrow(ema[!is.na(ema$TotalSteps),]),"nonmissing cases of dailyAct data") # 6,019
## 6019 nonmissing cases of dailyAct data
For estimating compliance, we compute the ratio between the No. of nonmissing days per each participant by the length of the recording periods originally required by the study, that is two months (60 days).
# computing No. of cases with nonmissing dailyAct data
for(i in 1:nrow(compliance)){
compliance[i,"ndailyAct"] <-
nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] &
!is.na(ema$TotalSteps),"IDday"])))
compliance[i,"ndailyAct_sleep"] <-
nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] &
!is.na(ema$TIB) & !is.na(ema$TotalSteps),"IDday"]))) }
# printing compliance info
cat("\n\ndailyAct data:\n- Original No. days/participants =",
round(mean(compliance$ndailyAct),2)," ( SD =",round(sd(compliance$ndailyAct),2),
") \n- Original dailyAct compliance =",
round(mean(100*compliance$ndailyAct/60),2),"% ( SD =",round(sd(100*compliance$ndailyAct/60),2),
") \n- non missing original dailyAct/Total No. Original sleep cases =",
round(mean(100*compliance$ndailyAct_sleep/compliance$nSleep),2),
"% ( SD =",round(sd(100*compliance$ndailyAct_sleep/compliance$nSleep),2),")")
##
##
## dailyAct data:
## - Original No. days/participants = 64.72 ( SD = 5.25 )
## - Original dailyAct compliance = 107.87 % ( SD = 8.75 )
## - non missing original dailyAct/Total No. Original sleep cases = 96.52 % ( SD = 11.59 )
# plotting No. of cases
hist(compliance$ndailyAct,breaks=50,main="No. of nonmissing TotalSteps values per participant",xlab="")
Comments:
compared to the criterion of having two months of continuous recordings, dailyAct
data show an original compliance of 108%, meaning that most participants recorded more than two months of physical activity
as showed in section 1.1, no missing days occurred during the data collection period
the 96% of sleepLog
data also include dailyAct
data
First, we remove the 13 cases that were not computed from HourlySteps
data.
# printing No. of cases
cat(nrow(ema[!is.na(ema$hourlySteps) & ema$hourlySteps==FALSE,]),"cases with no corresponding hourlySteps data")
## 13 cases with no corresponding hourlySteps data
# removing 13 cases with no corresponding hourlySteps data
dailyActVars <- colnames(ema)[which(colnames(ema)=="TotalSteps"):which(colnames(ema)=="hourlySteps")]
memory <- ema
ema[!is.na(ema$hourlySteps) & ema$hourlySteps==FALSE,dailyActVars] <- NA
cat("Removed",nrow(ema[is.na(ema$TotalSteps),])-nrow(memory[is.na(memory$TotalSteps),]),"cases")
## Removed 13 cases
As suggested by Herrmann et al (2012), the validity of physical activity data should be inpected based on wear time. Here, we implement this approach by using three indicators of wear time:
one based on the TotalActivityMinutes
variable, computed by summing the durations automatically stored in Fitabase
one based on nonmissing diurnalHR
epochs
one based on the No. of hourlySteps
counts higher than zero
First, we compute actWearTime
variable, simply expressing the TotalActivityMinutes
in hours, that is the total No. of physical activity minutes recorded by the device, according to Fitabase.
# computing actWearTime
ema$actWearTime <- ema$TotalActivityMinutes/60
# plotting
hist(ema$actWearTime,xlab="",
main=paste("actWearTime (hours) - min =",round(min(ema$actWearTime,na.rm=T),1),
"max =",round(max(ema$actWearTime,na.rm=T),1)),breaks=100); abline(v=13,col="red")
Second, we use the HR.1min
dataset to create the variable diurnalHR
as the average of all nonmissing HR values outside sleepLog
StartTime
and EndTime
. This is done with the dayTimeHR
function.
dayTimeHR <- function(SLEEPACTdata=NA,HRdata=NA){ require(tcltk); require(lubridate)
# HR data preparation (creating Hour as the current date + Time from HR.1min)
HRdata$Hour <- as.POSIXct(paste(hour(HRdata$Time),minute(HRdata$Time)),format="%H %M",tz="GMT")
HRdata$IDday <- as.factor(paste(HRdata$ID,substr(HRdata$Time,1,10),sep="_"))
# SLEEP-ACT data preparation (creating IDday and setting first row)
SLEEPACTdata$IDday <- as.factor(paste(SLEEPACTdata$ID,SLEEPACTdata$ActivityDate,sep="_"))
if(!is.na(SLEEPACTdata[1,"TotalSteps"])){
if(!is.na(SLEEPACTdata[1,"TST"])){
# selecting HR data with the same IDday value
HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[1,"IDday"] & HRdata$Time <= SLEEPACTdata[1,"StartTime"],]
} else {
HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] &
HRdata$Hour >= as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") &
HRdata$Hour <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:00:00"),tz="GMT"),] }
SLEEPACTdata[1,c("nEpochsdiurnalHR","diurnalHR.type","diurnalHR")] <- # computing HR variables
data.frame(nEpochsdiurnalHR=nrow(HRday),diurnalHR.type="6.to.23",diurnalHR=mean(HRday$HR)) }
# iteratively computing dayTimeHR
pb <- tkProgressBar("", "%",0, 100, 50) # progress bar
for(i in 2:nrow(SLEEPACTdata)){ IDday <- as.character(SLEEPACTdata[i,"IDday"])
info <- sprintf("%d%% done", round(which(rownames(SLEEPACTdata)==i)/nrow(SLEEPACTdata)*100))
setTkProgressBar(pb, round(which(rownames(SLEEPACTdata)==i)/nrow(SLEEPACTdata)*100), sprintf("Computing mean HR...", info), info)
diurnalHR.type <- NA
if(!is.na(SLEEPACTdata[i,"TotalSteps"])){
# if sleepLog data was NOT missing on day i
if(!is.na(SLEEPACTdata[i,"TST"])){
# if same subject and nonmissing sleepLog on day i-1 --> diurnal HR from the previous EndTime to the current StartTime
if(SLEEPACTdata[i,"ID"]==SLEEPACTdata[i-1,"ID"] & !is.na(SLEEPACTdata[i-1,"TST"]) &
difftime(SLEEPACTdata[i,"ActivityDate"],SLEEPACTdata[i-1,"ActivityDate"],units="days")<=1){
diurnalHR.type <- "TIB.based"
HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] &
HRdata$Time >= SLEEPACTdata[i-1,"EndTime"] & HRdata$Time <= SLEEPACTdata[i,"StartTime"],]
} else { # if different subject or missing sleepLog day --> diurnal HR from 6:00 to the current StartTime
diurnalHR.type <- "6.to.start"
HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] &
HRdata$Hour >= as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") &
HRdata$Time <= SLEEPACTdata[i,"StartTime"],]
}} else { # if sleepLog data was missing on day i --> diurnal HR from 6:00 to 23:00
diurnalHR.type <- "6.to.23"
HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] &
HRdata$Hour >= as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") &
HRdata$Hour <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:00:00"),tz="GMT"),] }
# updating dataset
meanHR <- ifelse(nrow(HRday)>0,mean(HRday$HR),NA)
SLEEPACTdata[i,c("nEpochsdiurnalHR","diurnalHR.type","diurnalHR")] <-
data.frame(nEpochsdiurnalHR=nrow(HRday),diurnalHR.type=diurnalHR.type,diurnalHR=meanHR) }}
close(pb)
return(SLEEPACTdata) }
ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),] # sorting by ID and ActivityDate
ema <- dayTimeHR(SLEEPACTdata=ema,HRdata=HR.1min)
First, we inspect the No. and percentage of missing data.
Nact <- ema[!is.na(ema$TotalSteps),]
cat("Manually computed",nrow(Nact[!is.na(Nact$diurnalHR),]),"diurnalHR values, of which:\n -",
summary(as.factor(Nact$diurnalHR.type))[1],"from 6 AM to 11 PM (missing sleepLog data)\n -",
summary(as.factor(Nact$diurnalHR.type))[2],"from 6 AM to StartTime (missing previous sleepLog data)\n -",
summary(as.factor(Nact$diurnalHR.type))[3],"based on TIB boundaries \n\n -",
nrow(Nact[is.na(Nact$diurnalHR),]),
"missing values (",round(100*nrow(Nact[is.na(Nact$diurnalHR),])/nrow(Nact),1),"% )")
## Manually computed 5604 diurnalHR values, of which:
## - 996 from 6 AM to 11 PM (missing sleepLog data)
## - 408 from 6 AM to StartTime (missing previous sleepLog data)
## - 4602 based on TIB boundaries
##
## - 402 missing values ( 6.7 % )
Comments:
Considering the total No. of cases (non missing either in sleepLog
or in dailyAct
):
diurnalHR
information is missing in 402 cases (6.7%)
most cases (82%) were computed based on TIB
boundaries
a minority of cases were computed from 6 AM to 11 PM (18%) or from 6 AM to sleepLog
StartTime
(7%) due to missing sleepLog
data
Then, we inspect the No. of cases whose nEpochsdiurnalHR
is lower than the distance between diurnalMinutes
boundaries, computed by accounting for diurnalHR.type
.
# computing differences
ema$StartHour <- as.POSIXct(paste(hour(ema$StartTime),minute(ema$StartTime)),format="%H %M",tz="GMT")
for(i in 1:nrow(ema)){
if(!is.na(ema[i,"diurnalHR"])){
if(ema[i,"diurnalHR.type"]=="6.to.23"){ ema[i,"diurnalMinutes"] <- 17*60 # 6.to.23 -> 17 hours
} else if(ema[i,"diurnalHR.type"]=="6.to.start"){ # 6.to.start -> StartHour - 6 AM
timeDiff <- as.numeric(difftime(ema[i,"StartHour"],as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT"),
units="mins"))
if(timeDiff>0){ ema[i,"diurnalMinutes"] <- timeDiff } else { # when StartHour after midnight -> StartHour - 1 day
ema[i,"diurnalMinutes"] <- as.numeric(difftime(ema[i,"StartHour"],
as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),
tz="GMT")-1*60*60*24,
units="mins"))
}} else if(ema[i,"diurnalHR.type"]=="TIB.based"){ # TIB.based -> StartHour - previous EndHour
ema[i,"diurnalMinutes"] <- as.numeric(difftime(ema[i,"StartTime"],ema[i-1,"EndTime"],units="mins")) }}}
ema$diurnalHR.timeDiff <- ema$diurnalMinutes - ema$nEpochsdiurnalHR
# plotting differences
hist(ema$diurnalHR.timeDiff,breaks=100,
main=paste("Differences between diurnalMinutes and No. of diurnalHR epochs \nmin =",
min(ema$diurnalHR.timeDiff,na.rm=TRUE),"max =",max(ema$diurnalHR.timeDiff,na.rm=TRUE)))
# summarizing and showing 122 diurnalHR.timeDiff < 0
cat(nrow(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff<0,]),"negative differences")
## 125 negative differences
summary(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff<0,"diurnalHR.timeDiff"]) # from -1 to -0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000 -1.000 -1.000 -0.784 -0.500 -0.500
# no cases with diurnalHR.timeDiff > 1 day = 24*60 = 1,440 min
cat(nrow(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff>1440,]),"differences > 1 day")
## 0 differences > 1 day
# summarizing 120 cases with diurnalHR.timeDiff > 1,000 minutes (16.7h)
cat(nrow(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff>1000,]),"differences > 16.7h")
## 120 differences > 16.7h
ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff>1000,
c("StartTime","EndTime","diurnalMinutes","diurnalHR.type","nEpochsdiurnalHR","diurnalHR.timeDiff")]
Comments:
differences between diurnalMinutes
and the No. of available diurnalHR
epochs range from -1 to 1,400 min (23.3h)
only 126 differences (2%) are negative, ranging from -1 to 0.5 min, probably due to approximations of timing variables
many differences are substantial, with 120 cases (2%) showing more than 1,000 min of missing diurnalHR
epochs, which is interesting for comparing valid and invalid dailyAct
data based on HRwearTime
values
Here, we compute the HRWearTime
variable quantifying wear time in terms of diurnal hours of nonmissing HR values.
# computing actWearTime
ema$HRWearTime <- ema$nEpochsdiurnalHR/60
# plotting
hist(ema$HRWearTime,xlab="",
main=paste("HRWearTime (hours) - min =",round(min(ema$HRWearTime,na.rm=T),1),
"max =",round(max(ema$HRWearTime,na.rm=T),1)),breaks=100); abline(v=13,col="red")
Third, we compute the stepsWearTime variable by counting the non-wear time as the No. of hourlySteps
hours with no StepTotal
data (i.e., zero counts) (e.g., see Aadland et al. 2018; Herrmann et al., 2014)
# counting No. of zero counts per day (i.e., 24h periods)
ema$IDday <- as.factor(paste(ema$ID,ema$ActivityDate,sep="_"))
hourlySteps$IDday <- as.factor(paste(hourlySteps$ID,hourlySteps$ActivityDate,sep="_"))
for(i in 1:nrow(ema)){
ema[i,"nZeroCounts"] <- nrow(hourlySteps[hourlySteps$IDday==as.character(ema[i,"IDday"])
& hourlySteps$StepTotal==0,]) }
ema[is.na(ema$TotalSteps),"nZeroCounts"] <- NA
# computing stepsWearTime (24 - nZeroCounts)
ema$stepsWearTime <- 24 - ema$nZeroCounts
# plotting
hist(ema$stepsWearTime,xlab="",
main=paste("stepsWearTime (hours) - min =",round(min(ema$stepsWearTime,na.rm=T),1),
"max =",round(max(ema$stepsWearTime,na.rm=T),1)),breaks=100); abline(v=13,col="red")
Here, we use the actWearTime
, HRWearTime
, and stepsWearTime
variables to inspect the validity of dailyAct
data. Specifically, we focus on the 13-hour criterion as 13h+ of non-wear time were recommended as reliable approximates of 14h/day accelerometer data collection, which is the average wear time in large studies using accelerometers (see Herrmann et al., 2013; 2014; Quante et al., 2015).
# plotting
hist(ema$stepsWearTime,xlab="",col=rgb(0,1,0,alpha=0.5),breaks=100,
main=paste("actWearTime (blue), HRWearTime (red), and stepsWearTime (green) in hours"))
hist(ema$actWearTime,add=TRUE,col=rgb(0,0,1,alpha=0.5),breaks=100) ; abline(v=13,col="red")
hist(ema$HRWearTime,add=TRUE,col=rgb(1,0,0,alpha=0.5),breaks=100) ; abline(v=13,col="red")
# printing info
Nact <- ema[!is.na(ema$TotalSteps),]; n = nrow(Nact)
cat("No. of cases with wearTime < 13h: \n 1) based on dailyAct data:",nrow(Nact[Nact$actWearTime<13,]),
"(",round(100*nrow(Nact[Nact$actWearTime<13,])/n,1),"% )\n 2) based on diurnalHR data:",
nrow(Nact[Nact$HRWearTime<13,]),"(",round(100*nrow(Nact[Nact$HRWearTime<13,])/n,1),
"% )\n 3) based on hourlySteps non-zero counts:",nrow(Nact[Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$stepsWearTime<13,])/n,1),
"% )\n 4) based on 1) and 2):", nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,])/n,1),
"% )\n 5) based on 1) and 3):",nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,])/n,1),
"% )\n 6) based on 2) and 3):",nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
"% )\n 7) based on all criteria:",nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
"% )\n 8) based on at least one criterion:",
nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,])/n,1),"% )")
## No. of cases with wearTime < 13h:
## 1) based on dailyAct data: 889 ( 14.8 % )
## 2) based on diurnalHR data: 2311 ( 38.5 % )
## 3) based on hourlySteps non-zero counts: 1418 ( 23.6 % )
## 4) based on 1) and 2): 278 ( 4.6 % )
## 5) based on 1) and 3): 109 ( 1.8 % )
## 6) based on 2) and 3): 1342 ( 22.3 % )
## 7) based on all criteria: 97 ( 1.6 % )
## 8) based on at least one criterion: 2986 ( 49.7 % )
Comments:
almost half of the data (49.7%) show less than 13h of wear time based on one or more wear time criteria
the most conservative criterion is diurnalHR
data, with 38.4% of cases showing less than 13h of wear time
the less conservative criterion is actWearTime
, with the 14.8% of cases showing less than 13h of wear time
Since we rely more on high-resolution data than on aggregate scores, we apply the stepsWearTime
criterion while accounting for HRWearTime
, that is we keep those cases with stepsWearTime
< 13h but HRWearTime
>= 13h. Thus, we are removing 1,342 cases (22.3%)** that do not meet both criteria.
# identifying sleepHR variables
dailyActVars <- colnames(ema)[which(colnames(ema)=="TotalSteps"):which(colnames(ema)=="hourlySteps")]
# identifying cases with 30+ missing epochs
toRemove <- levels(as.factor(as.character(Nact[Nact$stepsWearTime<13 & Nact$HRWearTime<13,"IDday"])))
cat("No. of cases to be removed =",length(toRemove))
## No. of cases to be removed = 1342
# removing sleepHR variables from cases with 50+ missing epochs
memory <- ema
ema[!is.na(ema$TotalSteps) & ema$IDday %in% toRemove, dailyActVars] <- NA
# print info (64 removed cases)
cat("No. excluded cases with stepsWearTime AND HRWearTime < 13h =",
nrow(memory[!is.na(memory$TotalSteps),])-nrow(ema[!is.na(ema$TotalSteps),]))
## No. excluded cases with stepsWearTime AND HRWearTime < 13h = 1342
# plotting TotalSteps
hist(memory$TotalSteps,xlab="",breaks=100,col=rgb(0,0,1),main="TotalSteps in original (blue) and filtered data (red)")
hist(ema$TotalSteps,xlab="",breaks=100,col=rgb(1,0,0),add=TRUE)
And we check again the three wear time criteria.
# excluding wearTime variables from filtered cases
ema[is.na(ema$TotalSteps),c("actWearTime","HRWearTime","stepsWearTime")] <- NA
# plotting
hist(ema$stepsWearTime,xlab="",col=rgb(0,1,0,alpha=0.5),breaks=100,
main=paste("actWearTime (blue), HRWearTime (red), and stepsWearTime (green) in hours"))
hist(ema$actWearTime,add=TRUE,col=rgb(0,0,1,alpha=0.5),breaks=100) ; abline(v=13,col="red")
hist(ema$HRWearTime,add=TRUE,col=rgb(1,0,0,alpha=0.5),breaks=100) ; abline(v=13,col="red")
# printing info
Nact <- ema[!is.na(ema$TotalSteps),]; n = nrow(Nact)
cat("No. of cases with wearTime < 13h: \n 1) based on dailyAct data:",nrow(Nact[Nact$actWearTime<13,]),
"(",round(100*nrow(Nact[Nact$actWearTime<13,])/n,1),"% )\n 2) based on diurnalHR data:",
nrow(Nact[Nact$HRWearTime<13,]),"(",round(100*nrow(Nact[Nact$HRWearTime<13,])/n,1),
"% )\n 3) based on hourlySteps non-zero counts:",nrow(Nact[Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$stepsWearTime<13,])/n,1),
"% )\n 4) based on 1) and 2):", nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,])/n,1),
"% )\n 5) based on 1) and 3):",nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,])/n,1),
"% )\n 6) based on 2) and 3):",nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
"% )\n 7) based on all criteria:",nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
"% )\n 8) based on at least one criterion:",
nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,]),"(",
round(100*nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,])/n,1),"% )")
## No. of cases with wearTime < 13h:
## 1) based on dailyAct data: 792 ( 17 % )
## 2) based on diurnalHR data: 969 ( 20.8 % )
## 3) based on hourlySteps non-zero counts: 76 ( 1.6 % )
## 4) based on 1) and 2): 181 ( 3.9 % )
## 5) based on 1) and 3): 12 ( 0.3 % )
## 6) based on 2) and 3): 0 ( 0 % )
## 7) based on all criteria: 0 ( 0 % )
## 8) based on at least one criterion: 1644 ( 35.2 % )
Comments:
we filtered 1,342 cases (22.3%), leading to a dramatic reduction of the No. of cases with TotalSteps = 0 (from 196 to 7)
no more cases have less than 13h of wear time based on both the stepsWearTime
and the HRWearTime
criteria, suggesting that the data filtering was effective
35.2% of the cases still show less than 13h of wear time based on one or more criteria
Here, as done for sleepLog
data in section 5.2.2.3, we better inspect cases of tempolarily isolated days of dailyAct
measures, that is dailyAct
data recorded substantially later than all other dailyAct
data previously recorded from the same participant.
# computing LAG values for ActivityDate
library(dplyr)
ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),]
Nact <- ema[!is.na(ema$TotalSteps),]
Nact <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nact <- as.data.frame(Nact)
detach("package:dplyr", unload=TRUE)
# computing and plotting time lags
Nact$lag <- as.numeric(difftime(Nact$ActivityDate,Nact$AD_lag,units="days"))
hist(Nact$lag,main="Consecutive missing days day(i) - day(i-1)",breaks=100)
# printing info
n <- c(10,15,20,30,50)
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
nrow(Nact[(!is.na(Nact$lag) & Nact$lag>n[i]),])) }
##
## - No. cases with > 10 consecutive missing days = 7
## - No. cases with > 15 consecutive missing days = 3
## - No. cases with > 20 consecutive missing days = 2
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0
# showing 41 cases with more than 10 missing days
Nact.vars <- c("ID","ActivityDate","lag",dailyActVars)
isolatedDay <- as.data.frame(matrix(nrow=0,ncol=length(Nact.vars)+1))
for(i in 1:nrow(Nact)){
if(!is.na(Nact[i,"lag"]) & Nact[i,"lag"]>10){
isolatedDay <- rbind(isolatedDay,Nact[(i-2):(i+2),c(Nact.vars,"TIB")]) } }
isolatedDay[,c("ID","ActivityDate","lag","TotalSteps","TIB")]
Comments:
contrarily to sleepLog
, and consistently with what observed in section 2.1, dailyAct
data do not show cases of isolated recording days
in contrast, all cases of consecutive missing days are observed within each participant data collection periods (not at the beginning or at the end)
Finally, we check again for duplicated cases, that is cases with the same ID
and ActivityDate
values.
# creating IDday variable
ema$IDday <- as.factor(paste(ema$ID,ema$ActivityDate,sep="_"))
Nact <- ema[!is.na(ema$TotalSteps),]
# sanity check by IDday (0 cases)
cat("Sanity check:",nrow(Nact[duplicated(Nact$IDday),])==0)
## Sanity check: TRUE
Comments:
ema
dataset for dailyAct
variablesIn summary, from the original No. of dailyAct
measures (N = 6,019 days, corresponding to a compliance higher than the 100%), 13 cases (0.2%) were excluded due to no corresponding hourlySteps
data, and 1,342 cases (22%) were removed due to less than 13h of wear time based on hourlySteps
and diurnalHR
data.
Thus, data cleaning led to a total No. of excluded sleep periods = 1,355 cases (22%).
cat(nrow(ema[!is.na(ema$TotalSteps),]),"'cleaned' cases of dailyAct data") # 4,664
## 4664 'cleaned' cases of dailyAct data
Here, compute the updated information on the nonmissing data related to dailyAct
variables.
# updating compliance dataset
Nact <- ema[!is.na(ema$TotalSteps),]
# computing compliance_clean (reference: 60 days)
for(i in 1:nrow(compliance)){
compliance[i,"ndailyAct_clean"] <-
nlevels(as.factor(as.character(Nact[as.character(Nact$ID)==compliance[i,"ID"],"IDday"])))
compliance[i,"ndailyAct_sleep_clean"] <-
nlevels(as.factor(as.character(Nact[as.character(Nact$ID)==compliance[i,"ID"] &
!is.na(Nact$TIB) & !is.na(Nact$TotalSteps),"IDday"]))) }
compliance$compl.act_clean <- 100*compliance$ndailyAct_clean/60 # % of days on two months
# printing compliance info
cat("\n\ndailyAct data:\n- 'Clean' No. days/participants =",
round(mean(compliance$ndailyAct_clean),2)," ( SD =",round(sd(compliance$ndailyAct_clean),2),
") \n- 'clean' dailyAct compliance =",
round(mean(100*compliance$ndailyAct_clean/60),2),"% ( SD =",
round(sd(100*compliance$ndailyAct_clean/60),2),
") \n- 'clean' non missing sleep stages/Total No. Original sleep cases =",
round(mean(100*compliance$ndailyAct_sleep_clean/compliance$nSleep_clean),2),
"% ( SD =",round(sd(100*compliance$ndailyAct_sleep_clean/compliance$nSleep),2),")")
##
##
## dailyAct data:
## - 'Clean' No. days/participants = 50.15 ( SD = 15.97 )
## - 'clean' dailyAct compliance = 83.58 % ( SD = 26.62 )
## - 'clean' non missing sleep stages/Total No. Original sleep cases = 82.8 % ( SD = 22.75 )
# plotting No. of cases
hist(compliance$ndailyAct_sleep,breaks=50,main="No. of nonmissing TotalSteps values per participant",xlab="")
Comments:
data cleaning substantially decreased the No. of available dailyAct
observations
compared to the criterion of having two months of continuous recordings, dailyAct
compliance decreased from 107 to 84%, which is slightly lower than that shown by sleepLog
and dailyDiary
data
the average percentage of sleepLog
data also including dailyAct
data decreased from 98 to 83%
The final variable to be ‘cleaned’ is dailyDiary
, consisting of the day-by-day participants’ self-reports of three core variables, namely dailyStress
, eveningMood
, and eveningWorry
, as recorded with the Survey Sparrow mobile application.
From the original No. of available dailyDiary
cases (N = 5,133) a total of 188 cases were removed because they were duplicated responses or for other reasons (see below), leading to an actual No. of 4,945 nonmissing dailyDiary cases.
Here, we compute the original No. of dailyDiary days (IDday
) identified by the FC3 device.
cat(nrow(ema[!is.na(ema$StartedTime),]),"nonmissing cases of dailyDiary data") # 4,945
## 4945 nonmissing cases of dailyDiary data
For estimating compliance, we compute the ratio between the No. of nonmissing days per each participant by the length of the recording periods originally required by the study, that is two months (60 days).
# computing No. of cases with nonmissing dailyDiary dat
Ndiary <- ema[!is.na(ema$StartedTime),]
for(i in 1:nrow(compliance)){
compliance[i,"nDiary"] <- nlevels(as.factor(as.character(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],
"IDday"])))
compliance[i,"periodDiary"] <-
difftime(max(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],"StartedTime"]),
min(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],"StartedTime"]),
units="days")
compliance[i,"nDiary_sleep"] <-
nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] &
!is.na(ema$TIB) & !is.na(ema$StartedTime),"IDday"])))
compliance[i,"nDiary_sleepAct"] <-
nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] &
!is.na(ema$TIB) & !is.na(ema$TotalSteps) & !is.na(ema$StartedTime),"IDday"]))) }
# printing compliance info
cat("\n\ndailyDiary data:\n- Original No. days/participants =",
round(mean(compliance$nDiary),2)," ( SD =",round(sd(compliance$nDiary),2),
") \n- Original dailyDiary compliance =",
round(mean(100*compliance$nDiary/60),2),"% ( SD =",round(sd(100*compliance$nDiary/60),2),
") \n- No. of missing days =",
round(mean(as.numeric(compliance$periodDiary)-compliance$nDiary),2),
" ( SD =",round(sd(as.numeric(compliance$periodDiary)-compliance$nDiary),2),
") \n- non missing original diary cases/Total No. Original sleep cases =",
round(mean(100*compliance$nDiary_sleep/compliance$nSleep),2),
"% ( SD =",round(sd(100*compliance$nDiary_sleep/compliance$nSleep),2),
") \n- non missing original diary cases/Total No. Original sleep AND dailyAct cases =",
round(mean(100*compliance$nDiary_sleepAct/compliance$nSleep),2),
"% ( SD =",round(sd(100*compliance$nDiary_sleepAct/compliance$nSleep),2),")")
##
##
## dailyDiary data:
## - Original No. days/participants = 53.17 ( SD = 10.17 )
## - Original dailyDiary compliance = 88.62 % ( SD = 16.94 )
## - No. of missing days = 9.05 ( SD = 10.78 )
## - non missing original diary cases/Total No. Original sleep cases = 82.94 % ( SD = 17.81 )
## - non missing original diary cases/Total No. Original sleep AND dailyAct cases = 70.36 % ( SD = 23.57 )
# plotting No. of cases
hist(compliance$nDiary,breaks=50,main="No. of nonmissing dailyDiary values per participant",xlab="")
Comments:
compared to the criterion of having two months of continuous recordings, dailyDiary
data show an original mean compliance of 88.6%, slightly lower than that originally showed by sleepLog data (92.3%)
an average of 9 missing days, lower than that shown by sleepLog
data, are observed
the 83% of sleepLog
data also include dailyDiary
data, which is included in the 70.4% of nonmissing sleepLog
AND dailyAct
data
dailyAct
data was already filtered in section 2.7:
43 double responses with the same StartedTime
values were excluded
further 139 double responses with the same ID
and ActivityDate
values were excluded
6 cases were excluded because surveyDuration
was longer than 17h (i.e., not sure if the responses were referred to the current or the following day)
In summary, a total of 188 cases were excluded mainly due to double responses.
# printing info
cat("No. of excluded cases =",5133-nrow(Ndiary)) # 188
## No. of excluded cases = 188
Here, as done for sleepLog
data in section 5.2.2.3, we better inspect cases of tempolarily isolated days of dailyDiary
measures, that is dailyDiary
data recorded substantially later than all other dailyDiary
data previously recorded from the same participant.
# computing LAG values for ActivityDate
library(dplyr)
Ndiary <- Ndiary[order(Ndiary$ID,Ndiary$ActivityDate,Ndiary$StartedTime),]
Ndiary <- Ndiary %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Ndiary <- as.data.frame(Ndiary)
detach("package:dplyr", unload=TRUE)
# computing and plotting time lags
Ndiary$lag <- as.numeric(difftime(Ndiary$ActivityDate,Ndiary$AD_lag,units="days"))
hist(Ndiary$lag,main="Consecutive missing days day(i) - day(i-1)",breaks=100)
# printing info
n <- c(5,7,10,15,20,30,50)
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
nrow(Ndiary[(!is.na(Ndiary$lag) & Ndiary$lag>n[i]),])) }
##
## - No. cases with > 5 consecutive missing days = 23
## - No. cases with > 7 consecutive missing days = 7
## - No. cases with > 10 consecutive missing days = 2
## - No. cases with > 15 consecutive missing days = 1
## - No. cases with > 20 consecutive missing days = 0
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0
# showing 7 cases with more than 7 missing days
Ndiary.vars <- c("ID","ActivityDate","lag",
colnames(ema)[which(colnames(ema)=="StartedTime"):which(colnames(ema)=="diary.timeDiff")])
isolatedDay <- as.data.frame(matrix(nrow=0,ncol=length(Ndiary.vars)+1))
for(i in 1:nrow(Nsleep)){
if(!is.na(Ndiary[i,"lag"]) & Ndiary[i,"lag"]>7){
isolatedDay <- rbind(isolatedDay,Ndiary[(i-2):(i+2),c(Ndiary.vars,"TIB","TotalSteps")]) } }
isolatedDay[,c("ID","ActivityDate","lag","StartedTime","dailyStress","eveningMood","TIB","TotalSteps")]
Comments:
consecutive missing days of dailyDiary
data (i.e., No. of days between each and the preceding observation from the same participant) are higher than 7 only for 7 cases, with only two cases showing consecutive missing days > 10 days
only one of these cases (ID
s052, StartedTime
2019-12-01 21:09:00 ) is due nly due to isolated final day recorded several days later than the previous observation from the same participant, with no corresponding sleepLog
or dailyAct
values
Here, we manually remove this single case:
# selecting dailyDiary variables
diaryVars <- colnames(ema)[which(colnames(ema)=="StartedTime"):which(colnames(ema)=="diary.timeDiff")]
# manually removing dailyDiary variables from 1 case
memory <- ema
ema[!is.na(ema$StartedTime) & ema$ID=="s052" & as.character(ema$StartedTime)=="2019-12-01 21:09:00",
diaryVars] <- NA
# print info (10 removed cases)
cat("No. excluded cases of isolated days =",
nrow(memory[!is.na(memory$StartedTime),])-nrow(ema[!is.na(ema$StartedTime),]))
## No. excluded cases of isolated days = 1
In summary, we removed only 1 isolated observations recorded 17 days after the remaining observations obtained from the same participant. Only one case shows more than 10 consecutive missing days.
library(dplyr)
Ndiary <- ema[!is.na(ema$StartedTime),]
Ndiary <- Ndiary[order(Ndiary$ID,Ndiary$ActivityDate,Ndiary$StartedTime),]
Ndiary <- Ndiary %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Ndiary <- as.data.frame(Ndiary)
detach("package:dplyr", unload=TRUE)
# printing info
Ndiary$lag <- as.numeric(difftime(Ndiary$ActivityDate,Ndiary$AD_lag,units="days"))
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
nrow(Ndiary[(!is.na(Ndiary$lag) & Ndiary$lag>n[i]),])) }
##
## - No. cases with > 5 consecutive missing days = 22
## - No. cases with > 7 consecutive missing days = 6
## - No. cases with > 10 consecutive missing days = 1
## - No. cases with > 15 consecutive missing days = 0
## - No. cases with > 20 consecutive missing days = 0
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0
Then, we check again for duplicated cases, that is cases with the same ID
and ActivityDate
values.
# creating IDday variable
Ndiary <- ema[!is.na(ema$StartedTime),]
Ndiary$IDday <- as.factor(paste(Ndiary$ID,Ndiary$ActivityDate,sep="_"))
# sanity check by IDday (0 cases)
cat("Sanity check:",nrow(Ndiary[duplicated(Ndiary$IDday),])==0)
## Sanity check: TRUE
# No. of duplicated cases in general
cat("Sanity check:",nrow(ema[duplicated(ema$IDday),])==0)
## Sanity check: TRUE
Comments:
no duplicated cases are included in the ema
dataset for dailyDiary
variables
in general, there are no cases of duplicated days
Finally, we check the No. of missing responses at the three core variables, namely dailyStress
, eveningMood
, and eveningWorry
, as well as the remaining dailyDiary
variables.
# printing info
Ndiary <- ema[!is.na(ema$StartedTime),]
n <- nrow(Ndiary)
cat("- No. of cases with missing dailyStress:",nrow(Ndiary[is.na(Ndiary$dailyStress),]),
"(",round(100*nrow(Ndiary[is.na(Ndiary$dailyStress),])/n,1),
"% )\n - No. of cases with missing eveningMood:",nrow(Ndiary[is.na(Ndiary$eveningMood),]),
"(",round(100*nrow(Ndiary[is.na(Ndiary$eveningMood),])/n,1),
"% )\n - No. of cases with missing eveningWorry:",nrow(Ndiary[is.na(Ndiary$eveningWorry),]),
"(",round(100*nrow(Ndiary[is.na(Ndiary$eveningWorry),])/n,1),
"% )\n - all missing:", nrow(Ndiary[is.na(Ndiary$dailyStress) &
is.na(Ndiary$eveningMood) & is.na(Ndiary$eveningWorry),]),
"(",round(100*nrow(Ndiary[is.na(Ndiary$dailyStress) & is.na(Ndiary$eveningMood) &
is.na(Ndiary$eveningWorry),])/n,1),"% )")
## - No. of cases with missing dailyStress: 4 ( 0.1 % )
## - No. of cases with missing eveningMood: 6 ( 0.1 % )
## - No. of cases with missing eveningWorry: 5 ( 0.1 % )
## - all missing: 0 ( 0 % )
# showing cases with 1+ missing in core variables
Ndiary[is.na(Ndiary$dailyStress) | is.na(Ndiary$eveningWorry) | is.na(Ndiary$eveningMood),
c("ID","StartedTime","dailyStress","eveningMood","eveningWorry","TIB","TotalSteps","stageHR_NREM")]
Comments:
dailyDiary
variablesHere, we remove these 12 cases with missing responses to one or more dailyDiar
core variables.
# removing 12 cases with no corresponding hourlySteps data
memory <- ema
ema[!is.na(ema$StartedTime) & (is.na(ema$dailyStress) | is.na(ema$eveningWorry) | is.na(ema$eveningMood)),
diaryVars] <- NA
cat("Removed",nrow(ema[is.na(ema$StartedTime),])-nrow(memory[is.na(memory$StartedTime),]),"cases")
## Removed 12 cases
Finally, we inspect the No. of missing data in the remaining dailyDiary
variables.
summary(ema[!is.na(ema$StartedTime),diaryVars])
## StartedTime SubmittedTime surveyDuration
## Min. :2019-01-07 21:00:00 Min. :2019-01-07 21:01:00 Min. : 0.0000
## 1st Qu.:2019-06-21 21:19:30 1st Qu.:2019-06-21 21:19:30 1st Qu.: 0.0000
## Median :2019-12-24 00:10:30 Median :2019-12-24 00:10:30 Median : 0.0000
## Mean :2020-02-13 20:54:59 Mean :2020-02-13 20:55:53 Mean : 0.8974
## 3rd Qu.:2020-10-17 22:06:45 3rd Qu.:2020-10-17 22:08:15 3rd Qu.: 1.0000
## Max. :2021-04-30 21:00:00 Max. :2021-04-30 21:00:00 Max. :758.0000
##
## dailyStress eveningWorry eveningMood stress_school stress_family
## Min. :1.000 Min. :1.000 Min. :1.000 0 :2617 0 :3440
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1 :2141 1 : 404
## Median :2.000 Median :2.000 Median :4.000 NA's: 174 NA's:1088
## Mean :2.213 Mean :2.245 Mean :3.606
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## stress_health stress_COVID stress_peers stress_other stress_total worry_school
## 0 :2628 0 : 772 0 :2832 0 :3556 0:1554 0 :2500
## 1 : 309 1 : 99 1 : 537 1 :1277 1:2286 1 :2241
## NA's:1995 NA's:4061 NA's:1563 NA's: 99 2: 849 NA's: 191
## 3: 201
## 4: 34
## 5: 4
## 6: 4
## worry_family worry_health worry_peer worry_COVID worry_sleep worry_other
## 0 :3281 0 :2670 0 :2890 0 : 871 0 :2928 0 :3313
## 1 : 294 1 : 351 1 : 554 1 : 92 1 : 732 1 :1289
## NA's:1357 NA's:1911 NA's:1488 NA's:3969 NA's:1272 NA's: 330
##
##
##
##
## diary.timeDiff
## Min. :-22.1583
## 1st Qu.: 0.0708
## Median : 1.0167
## Mean : -0.5663
## 3rd Qu.: 2.5500
## Max. : 8.7417
## NA's :541
no no more missing cases are observed for core variables
as noted in section 3.7, the No. of missing values in the remaining variables varies from the 2 to the 82%
In summary, from the original No. of dailyDiary
measures (N = 5,133, corresponding to a compliance of 88.6%), we excluded: 43 + 139 = 182 duplicated cases, 6 cases with surveyDuration
>17h, 1 case of isolated day, and 12 cases of missing responses to one or more core variables.
Thus, data cleaning led to a total No. of excluded dailyDiary
responses = 201 cases (4%).
cat(nrow(ema[!is.na(ema$StartedTime),]),"'cleaned' cases of dailyDiary data") # 4,932
## 4932 'cleaned' cases of dailyDiary data
Here, compute the updated information on the nonmissing data related to dailyDiary
variables.
# updating compliance dataset
Ndiary <- ema[!is.na(ema$StartedTime),]
for(i in 1:nrow(compliance)){
compliance[i,"nDiary_clean"] <-
nlevels(as.factor(as.character(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],"IDday"])))
compliance[i,"nDiary_sleep_clean"] <-
nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] &
!is.na(ema$TIB) & !is.na(ema$StartedTime),"IDday"])))
compliance[i,"nDiary_sleepAct_clean"] <-
nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] &
!is.na(ema$TIB) & !is.na(ema$TotalSteps) & !is.na(ema$StartedTime),"IDday"]))) }
# printing compliance info
cat("\n\ndailyDiary data:\n- 'clean' No. days/participants =",
round(mean(compliance$nDiary_clean),2)," ( SD =",round(sd(compliance$nDiary_clean),2),
") \n- 'Clean' dailyDiary compliance =",
round(mean(100*compliance$nDiary_clean/60),2),"% ( SD =",round(sd(100*compliance$nDiary_clean/60),2),
") \n- non missing 'clean' diary cases/Total No. 'clean' sleep cases =",
round(mean(100*compliance$nDiary_sleep_clean/compliance$nSleep),2),
"% ( SD =",round(sd(100*compliance$nDiary_sleep_clean/compliance$nSleep),2),
") \n- non missing 'clean' diary cases/Total No. 'clean' sleep AND dailyAct cases =",
round(mean(100*compliance$nDiary_sleepAct_clean/compliance$nSleep),2),
"% ( SD =",round(sd(100*compliance$nDiary_sleepAct_clean/compliance$nSleep),2),")")
##
##
## dailyDiary data:
## - 'clean' No. days/participants = 53.03 ( SD = 10.18 )
## - 'Clean' dailyDiary compliance = 88.39 % ( SD = 16.96 )
## - non missing 'clean' diary cases/Total No. 'clean' sleep cases = 82.72 % ( SD = 17.74 )
## - non missing 'clean' diary cases/Total No. 'clean' sleep AND dailyAct cases = 70.18 % ( SD = 23.48 )
# plotting No. of cases
hist(compliance$nDiary_clean,breaks=50,main="No. of nonmissing dailyDiary values per participant",xlab="")
Comments:
Finally, we inspect the overall individual compliance by considering all the ‘cleaned’ data in the five classes of variables: sleepLog
, sleepStages
, sleepHR
, dailyDiary
, and dailyAct
, and their combination. For each of them, we also evaluate the No. of participants with extreme missing data, and the temporal continuity for each class of data.
Here we summarize the No. and % of ‘clean’ data for each class of variables.
# printing compliance info
cat("\n\nsleepLog data: total No. 'clean' cases =",nrow(ema[!is.na(ema$TIB),]), # sleepLog
"\n- No. days/participants =",round(mean(compliance$nSleep_clean),2)," ( SD =",
round(sd(compliance$nSleep_clean),2),") \n- 'Clean' sleepLog compliance =",
round(mean(100*compliance$nSleep_clean/60),2),"% ( SD =",round(sd(100*compliance$nSleep_clean/60),2),
")\n\nsleepStages data: total No. 'clean' cases =",nrow(ema[!is.na(ema$light),]), # sleepStages
"\n- No. days/participants =", round(mean(compliance$nSleepStage_clean),2)," ( SD =",
round(sd(compliance$nSleepStage_clean),2),") \n- 'Clean' sleepStages compliance =",
round(mean(100*compliance$nSleepStage_clean/60),2),"% ( SD =",round(sd(100*compliance$nSleepStage_clean/60),2),
")\n\nsleepHR data: total No. 'clean' cases =",nrow(ema[!is.na(ema$stageHR_TST),]), # sleepHR
"\n- No. days/participants =",round(mean(compliance$nstageHR_NREM_clean),2)," ( SD =",
round(sd(compliance$nstageHR_NREM_clean),2),") \n- 'Clean' dailyDiary compliance =",
round(mean(100*compliance$nstageHR_NREM_clean/60),2),"% ( SD =",round(sd(100*compliance$nstageHR_NREM_clean/60),2),
")\n\ndailyDiary data: total No. 'clean' cases =",nrow(ema[!is.na(ema$StartedTime),]), # dialyDiary
"\n- No. days/participants =",round(mean(compliance$nDiary_clean),2)," ( SD =",
round(sd(compliance$nDiary_clean),2),") \n- 'Clean' dailyDiary compliance =",
round(mean(100*compliance$nDiary_clean/60),2),"% ( SD =",round(sd(100*compliance$nDiary_clean/60),2),
")\n\ndailyAct data: total No. 'clean' cases =",nrow(ema[!is.na(ema$TotalSteps),]), # dailyAct
"\n- No. days/participants =",round(mean(compliance$ndailyAct_clean),2)," ( SD =",
round(sd(compliance$ndailyAct_clean),2),") \n- 'Clean' dailyDiary compliance =",
round(mean(100*compliance$ndailyAct_clean/60),2),"% ( SD =",round(sd(100*compliance$ndailyAct_clean/60),2))
##
##
## sleepLog data: total No. 'clean' cases = 5121
## - No. days/participants = 55.06 ( SD = 14.03 )
## - 'Clean' sleepLog compliance = 91.77 % ( SD = 23.38 )
##
## sleepStages data: total No. 'clean' cases = 4401
## - No. days/participants = 47.32 ( SD = 14.57 )
## - 'Clean' sleepStages compliance = 78.87 % ( SD = 24.28 )
##
## sleepHR data: total No. 'clean' cases = 0
## - No. days/participants = 46.45 ( SD = 14.13 )
## - 'Clean' dailyDiary compliance = 77.42 % ( SD = 23.56 )
##
## dailyDiary data: total No. 'clean' cases = 4932
## - No. days/participants = 53.03 ( SD = 10.18 )
## - 'Clean' dailyDiary compliance = 88.39 % ( SD = 16.96 )
##
## dailyAct data: total No. 'clean' cases = 4664
## - No. days/participants = 50.15 ( SD = 15.97 )
## - 'Clean' dailyDiary compliance = 83.58 % ( SD = 26.62
Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing sleepLog cases.
# printing compliance info
n <- ema[!is.na(ema$TIB),]
cat("\n\nsleepLog data: total No. 'clean' cases =",nrow(n),", of which:",
"\n- ",nrow(n[!is.na(n$light),]),"cases with nonmissing sleepStages data (",
round(100*nrow(n[!is.na(n$light),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$stageHR_NREM),]),"cases with nonmissing stageHR data (",
round(100*nrow(n[!is.na(n$stageHR_NREM),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$StartedTime),]),"cases with nonmissing dailyDiary data (",
round(100*nrow(n[!is.na(n$StartedTime),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")
##
##
## sleepLog data: total No. 'clean' cases = 5121 , of which:
## - 4401 cases with nonmissing sleepStages data ( 85.94 % )
## - 4320 cases with nonmissing stageHR data ( 84.36 % )
## - 4333 cases with nonmissing dailyDiary data ( 84.61 % )
## - 4310 cases with nonmissing TotalSteps data ( 84.16 % )
Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing sleepStages
cases.
# printing compliance info
n <- ema[!is.na(ema$light),]
cat("\n\nsleepStages data: total No. 'clean' cases =",nrow(n),", of which:",
"\n- ",nrow(n[!is.na(n$stageHR_NREM),]),"cases with nonmissing stageHR data (",
round(100*nrow(n[!is.na(n$stageHR_NREM),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$StartedTime),]),"cases with nonmissing dailyDiary data (",
round(100*nrow(n[!is.na(n$StartedTime),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")
##
##
## sleepStages data: total No. 'clean' cases = 4401 , of which:
## - 4320 cases with nonmissing stageHR data ( 98.16 % )
## - 3741 cases with nonmissing dailyDiary data ( 85 % )
## - 3729 cases with nonmissing TotalSteps data ( 84.73 % )
Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing sleepHR
cases. Note that the No. of missing values for sleepHR
cases varies from variable to variable. Here, we consider stageHR_TST
as one of the variables with less missing data.
# printing compliance info
n <- ema[!is.na(ema$stageHR_TST),]
cat("\n\nsleepHR data: total No. 'clean' cases =",nrow(n),", of which:",
"\n- ",nrow(n[!is.na(n$light),]),"cases with nonmissing sleepStages data (",
round(100*nrow(n[!is.na(n$light),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$StartedTime),]),"cases with nonmissing dailyDiary data (",
round(100*nrow(n[!is.na(n$StartedTime),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")
##
##
## sleepHR data: total No. 'clean' cases = 0 , of which:
## - 0 cases with nonmissing sleepStages data ( NaN % )
## - 0 cases with nonmissing dailyDiary data ( NaN % )
## - 0 cases with nonmissing TotalSteps data ( NaN % )
Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing dailyDiary
cases.
# printing compliance info
n <- ema[!is.na(ema$StartedTime),]
cat("\n\ndailyDiary data: total No. 'clean' cases =",nrow(n),", of which:",
"\n- ",nrow(n[!is.na(n$TIB),]),"cases with nonmissing sleepLog data (",
round(100*nrow(n[!is.na(n$TIB),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$light),]),"cases with nonmissing sleepStages data (",
round(100*nrow(n[!is.na(n$light),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$stageHR_NREM),]),"cases with nonmissing stageHR data (",
round(100*nrow(n[!is.na(n$stageHR_NREM),])/nrow(n),2),
"% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")
##
##
## dailyDiary data: total No. 'clean' cases = 4932 , of which:
## - 4333 cases with nonmissing sleepLog data ( 87.85 % )
## - 3741 cases with nonmissing sleepStages data ( 75.85 % )
## - 3692 cases with nonmissing stageHR data ( 74.86 % )
## - 3949 cases with nonmissing TotalSteps data ( 80.07 % )
Here, we inspect the No. of participants with extreme No. of missing values for each variable and their combination. The missingInfo
and the infoParticipant
functions are used to optimize the process.
missingInfo <- function(data=NA,var.name=NA,missingThreshold=NA){
# renaming focus variable
colnames(data)[which(colnames(data)==var.name)] <- "variable"
# selecting cases with focus variable < missingThreshold
highMiss <- data[data$variable<missingThreshold,]
# plotting
hist(data$variable,breaks=100,xlab="",main=paste(var.name,": No. of nonmissing from",
min(data$variable),"to",max(data$variable)))
# when 1+ cases have a No. of nonmissing observations < missingThreshold
if(nrow(highMiss)>0){
abline(v=missingThreshold,col="red")
# printing info
cat("\n\n",nrow(highMiss),"participants with less than",missingThreshold,"nonmissing days of",var.name,":")
for(i in 1:nrow(highMiss)){ cat("\n- ",i,as.character(highMiss[i,"ID"]),"( insomnia =",
as.character(highMiss[i,"insomnia"]),
") :",highMiss[i,"variable"],"observations") }
} else { cat("\n\n","no participants with less than",missingThreshold,"nonmissing days of",var.name) }
}
infoParticipant <- function(data=NA,var.name=NA,participants=NA){
# renaming focus variable
colnames(data)[which(colnames(data)==var.name)] <- "variable"
# selecting cases with focus variable < missingThreshold
highMiss <- data[data$ID%in%participants,]
# printing info on participant
for(i in 1:nrow(highMiss)){ cat("\n- ",i,as.character(highMiss[i,"ID"]),"( insomnia =",
as.character(highMiss[i,"insomnia"]),
") :",highMiss[i,"variable"],"observations") }}
missingInfo(data=compliance,var.name="nSleep_clean",missingThreshold=10)
##
##
## 2 participants with less than 10 nonmissing days of nSleep_clean :
## - 1 s038 ( insomnia = 0 ) : 5 observations
## - 2 s089 ( insomnia = 1 ) : 6 observations
Comments:
only two participants have less than 10 nonmissing observations for sleepLog
variables: s038 and s089
further three participants have from 10 to 20 observations
missingInfo(data=compliance,var.name="nSleepStage_clean",missingThreshold=10)
##
##
## 3 participants with less than 10 nonmissing days of nSleepStage_clean :
## - 1 s038 ( insomnia = 0 ) : 5 observations
## - 2 s052 ( insomnia = 1 ) : 8 observations
## - 3 s089 ( insomnia = 1 ) : 4 observations
Comments:
only three participants have less than 10 nonmissing observations for sleepStages
variables: s038, s089, and s052
further four participants have from 10 to 20 observations
missingInfo(data=compliance,var.name="nstageHR_NREM_clean",missingThreshold=10)
##
##
## 3 participants with less than 10 nonmissing days of nstageHR_NREM_clean :
## - 1 s038 ( insomnia = 0 ) : 5 observations
## - 2 s052 ( insomnia = 1 ) : 8 observations
## - 3 s089 ( insomnia = 1 ) : 4 observations
Comments:
missingInfo(data=compliance,var.name="nDiary_clean",missingThreshold=10)
##
##
## no participants with less than 10 nonmissing days of nDiary_clean
Comments:
no participants have less than 10 nonmissing observations for dailyDiary
variables
further four participants have from 10 to 30 observations
Let’s see who these four participants are.
missingInfo(data=compliance,var.name="nDiary_clean",missingThreshold=30)
##
##
## 3 participants with less than 30 nonmissing days of nDiary_clean :
## - 1 s040 ( insomnia = 0 ) : 26 observations
## - 2 s050 ( insomnia = 1 ) : 27 observations
## - 3 s051 ( insomnia = 0 ) : 29 observations
Comments:
Finally, let’s see the No. of nonmissing dailyDiary
observations for those participants with the highest No. of missing data for sleepLog
variables (i.e., s038, s089, and s052). We can see these participants have more than one month of nonmissing dailyDiary
observations.
infoParticipant(data=compliance,var.name="nDiary_clean",participants=c("s038","s089","s052"))
##
## - 1 s038 ( insomnia = 0 ) : 35 observations
## - 2 s052 ( insomnia = 1 ) : 31 observations
## - 3 s089 ( insomnia = 1 ) : 38 observations
missingInfo(data=compliance,var.name="ndailyAct_clean",missingThreshold=10)
##
##
## 4 participants with less than 10 nonmissing days of ndailyAct_clean :
## - 1 s065 ( insomnia = 1 ) : 8 observations
## - 2 s080 ( insomnia = 1 ) : 3 observations
## - 3 s089 ( insomnia = 1 ) : 5 observations
## - 4 s095 ( insomnia = 0 ) : 0 observations
Comments:
four participants have less than 10 nonmissing observations for sleepLog
variables: s089, s065, s080, and s095
importantly, s095 has no data with nonmissing dailyAct
variables
further three participants have from 10 to 20 observations
missingInfo(data=compliance,var.name="ndailyAct_sleep_clean",missingThreshold=10)
##
##
## 6 participants with less than 10 nonmissing days of ndailyAct_sleep_clean :
## - 1 s038 ( insomnia = 0 ) : 4 observations
## - 2 s040 ( insomnia = 0 ) : 7 observations
## - 3 s065 ( insomnia = 1 ) : 8 observations
## - 4 s080 ( insomnia = 1 ) : 2 observations
## - 5 s089 ( insomnia = 1 ) : 4 observations
## - 6 s095 ( insomnia = 0 ) : 0 observations
Comments:
six participants have less than 10 nonmissing observations for both sleepLog
and dailyAct
variables: s038, s089, s040, s065, s080, and s095
importantly, s095 has no data with simultaneously nonmissing sleepLog
and dailyAct
variables
one further participant has from 10 to 20 observations
missingInfo(data=compliance,var.name="nDiary_sleep_clean",missingThreshold=10)
##
##
## 4 participants with less than 10 nonmissing days of nDiary_sleep_clean :
## - 1 s038 ( insomnia = 0 ) : 2 observations
## - 2 s040 ( insomnia = 0 ) : 9 observations
## - 3 s041 ( insomnia = 1 ) : 8 observations
## - 4 s089 ( insomnia = 1 ) : 4 observations
Comments:
four participants have less than 10 nonmissing observations for both sleepLog
and dailyDiary
variables: s038, s089, s040, s041
one further participant has from 10 to 20 observations
missingInfo(data=compliance,var.name="nDiary_sleepAct_clean",missingThreshold=10)
##
##
## 7 participants with less than 10 nonmissing days of nDiary_sleepAct_clean :
## - 1 s038 ( insomnia = 0 ) : 2 observations
## - 2 s040 ( insomnia = 0 ) : 5 observations
## - 3 s041 ( insomnia = 1 ) : 6 observations
## - 4 s065 ( insomnia = 1 ) : 6 observations
## - 5 s080 ( insomnia = 1 ) : 2 observations
## - 6 s089 ( insomnia = 1 ) : 3 observations
## - 7 s095 ( insomnia = 0 ) : 0 observations
Comments:
seven participants have less than 10 nonmissing observations for both sleepLog
, dailyDiary
, and dailyAct
variables: s038, s089, s040, s041, s065, s080, and s095
importantly, s095 has no data with simultaneously nonmissing sleepLog
and dailyAct
variables
one further participant has from 10 to 20 observations
Finally, we use the tempCont
function to plot DAY order against time for each participant in order to better inspect the pattern of missing data, and the cases of discontinuous data recording. Yellow is for dailyAct
, blue is for sleepLog
, purple is for sleepStages
, red is for sleepHR
, and green is for dailyDiary
. Note that the actual data collection period is that shown in blue for sleepLog
variables. The horizontal red line shows the No. of observations with nonmissing sleepLog
, dailyDiary
, and dailyAct
data.
tempCont <- function(data=NA,compliance=NA){
data <- data[!(is.na(data$TIB) & is.na(data$StartedTime)),]
par(mfrow=c(3,3))
for(ID in levels(data$ID)){
IDdata <- data[data$ID==ID,]
IDcompliance <- compliance[compliance$ID==ID,]
Xlim <- c(min(IDdata[IDdata$ID==ID,"ActivityDate"])-30,max(IDdata[IDdata$ID==ID,"ActivityDate"])+30)
Ylim <- c(0,
max(IDcompliance$nSleep_clean,IDcompliance$nSleepStage_clean,IDcompliance$nstageHR_NREM_clean,
IDcompliance$nDiary_clean,IDcompliance$ndailyAct_clean))
n <- IDdata[!is.na(IDdata$TotalSteps),] # yellow = dailyAct
if(nrow(n[n$ID==ID,])>0){ n$ActivityDate <- n$ActivityDate - 10
plot((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
col=rgb(1,1,0,alpha=0.4),cex=3,pch=20)
n <- IDdata[!is.na(IDdata$TIB),]; n$ActivityDate <- n$ActivityDate # blue = sleepLog
points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
col=rgb(0,0,1,alpha=0.4),cex=3,pch=20) } else {
n <- IDdata[!is.na(IDdata$TIB),]; n$ActivityDate <- n$ActivityDate
plot((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
col=rgb(0,0,1,alpha=0.4),cex=3,pch=20) }
n <- IDdata[!is.na(IDdata$light),]; n$ActivityDate <- n$ActivityDate + 20 # purple = sleepStages
points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
col=rgb(1,0,1,alpha=0.4),cex=3,pch=20)
n <- IDdata[!is.na(IDdata$stageHR_NREM),]; n$ActivityDate <- n$ActivityDate + 30 # red = sleepHR
points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
col=rgb(1,0,0,alpha=0.4),cex=3,pch=20)
n <- IDdata[!is.na(IDdata$StartedTime),]; n$ActivityDate <- n$ActivityDate - 20 # green = dailyDiary
points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
col=rgb(0,1,0,alpha=0.4),cex=3,pch=20)
abline(h=ifelse(IDcompliance$ndailyAct_clean > 0,IDcompliance$nDiary_sleepAct_clean,
IDcompliance$nDiary_sleep_clean),col="red") }}
tempCont(data=ema,compliance=compliance)
Comments:
the most critical cases are s038 (control; with only 5 on 14 nSleep_clean, only 10 on 64 ndailyAct_clean
, and only 2 observations with simultaneously nonmissing dailyDiary
and sleepLog
observations), s040 (control; with only 15 on 18 nSleep_clean
, only 13 on 64 dailyAct
clean, and only 9 observations with simultaneously nonmissing dailyDiary
and sleepLog
observations), s041 (insomnia; with only 15 on 21 nSleep_clean
, only 18 on 100 ndailyAct_clean
, and only 13 on 64 dailyAct
clean, and only 8 observations with simultaneously nonmissing dailyDiary
and sleepLog
observations), and s089 (insomnia; with only 6 on 6 nSleep_clean
, only 5 on 63 ndailyAct_clean
, and only 4 observations with simultaneously nonmissing dailyDiary
and sleepLog
observations)
a No. of further participants have minor problems, including s024 (control; with only 15 on 64 ndailyAct_clean
), s052 (insomnia; with only 14 on 23 nSleep_clean
, and only 10 on 64 ndailyAct_clean
), s065 (insomnia; with only 15 on 63 ndailyAct_clean
), s080 (insomnia; with only 3 on 63 ndailyAct_clean
), s095 (control; with 0 on 64 ndailyAct_clean
), and s105 (with only 17 on 63 nSleepStage_clean
)
Here, we show details on compliance for all of the mentioned participants.
# major problems
majors <- c("s038","s040","s041","s089")
compliance[compliance$ID %in% majors,]
# minor problems
minors <- c("s024","s052","s065","s080","s095","s105")
compliance[compliance$ID %in% minors,]
Comments:
major problems are probably due to technical dysfunction with the FC3 device resulting in a low No. of ‘clean’ sleepLog
data (ranging from 5 to 15) and dailyAct
(ranging from 5 to 18), despite the higher No. of ‘clean’ dailyDiary
values (ranging from 26 to 46)
in these cases, the difference between the No. of original and ‘clean’ dailyAct
observations suggests that our filtering of the former was effective (i.e., from 63-100 original data, only 5-15 were valid, a similar No. to that of ‘clean’ sleepLog
cases)
technical dysfunctions with the FC3 device are also the likely origin of some minor problems including low No. of Fitbit data (i.e., s052), a low No. of dailyAct
‘clean’ data (i.e., s024, s065, s080, and s095; No. clean dailyAct
ranging from 0 to 15), and a low No. of sleepStages
‘clean’ data (i.e., s105)
Finally, we mark the cases highlighted in the TEMPORAL CONTINUITY section by specifying the majMiss
and the minMiss
variables.
# creating majMiss and minMiss variables
ema$majMiss <- ema$minMiss <- 0
ema[ema$ID %in% majors,"majMiss"] <- 1
ema[ema$ID %in% minors,"minMiss"] <- 1
ema[,c("majMiss","minMiss")] <- lapply(ema[,c("majMiss","minMiss")],as.factor)
# summarizing variables
summary(ema[,c("majMiss","minMiss")])
## majMiss minMiss
## 0:5910 0:5829
## 1: 309 1: 390
As the very final step, we sort the ema columns one more time, and we export the final processed dataset.
# sorting final dataset
ema <- ema[,c("ID","sex","age","BMI","insomnia","insomnia.group", # demos at the beginning
"majMiss","minMiss", # response rate criteria
"ActivityDate", # date (day of the year)
# sleepLog
"LogId","StartTime","EndTime","SleepDataType","EBEDataType", # sleepLog info
"TIB","TST","SE","SO","WakeUp","midSleep","SOL","WASO","nAwake","fragIndex", # sleepLog variables
# sleepStages
"light","deep","rem",
# sleepHR
"stageHR_NREM","stageHR_REM",
# dailyAct
"TotalSteps",
# dailyDiary
"StartedTime","SubmittedTime","surveyDuration",
"dailyStress","eveningWorry","eveningMood",
"stress_school","stress_family","stress_health","stress_COVID","stress_peers","stress_other",
"worry_school","worry_family","worry_health","worry_peer","worry_COVID","worry_sleep","worry_other",
# removed variables
# "combined","combinedLogId","nCombined", # info on combined sleep cases
# "EBEonly","nEpochs","nFinalWake","missing_start","missing_middle", # info on EBE data
# "TimeInBed","MinutesAsleep","fitabaseWASO","MinutesToFallAsleep", # original Fitabase sleepLog variables
# "nHRepochs1","nHRepochs2","nHRepochs3", # sleepHR No. of considered epochs
# "nSleepHRepochs1","nSleepHRepochs2","nSleepHRepochs3",
# "nHR_TST","nHR_SOL","nHR_WASO","nHR_NREM","nHR_REM",
# "LightlyActiveMinutes","SedentaryMinutes","ModerateVigorousMinutes","TotalActivityMinutes","hourlySteps",
"group")]
ema$group <- NULL
# saving dataset
save(ema,file="DATA/datasets/ema_finalClean.RData")
# showing first 3 rows
ema[1:3,]
INDIVIDUAL-LEVEL VARIABLES
Demographics:
ID
= participants’ identification code
sex
= participants’ sex (“F” = female, “M” = male)
age
= participants’ age (years)
BMI
= participants’ BMI (kg*m^(-2))
insomnia
= participants’ group (1 = insomnia, 0 = control)
insomnia.group
= participants’ insomnia group (“control” = control, “DSM.ins” = DSM insomnia, “sub.ins” = insomnia subthreshold)
majMiss
= participants with extreme missing data (N = 4 with value 1)
minMiss
= participants with substantial missing data (N = 6 with value 1)
DAY-LEVEL VARIABLES
ActivityDate
= day of assessment (date in “mm-dd-yyyy” format)sleepLog
LogId
= sleep period identification code
StartTime
= sleep period start hour (in “mm-dd-yyyy hh:mm:ss” format)
EndTime
= sleep period end hour (in “mm-dd-yyyy hh:mm:ss” format)
SleepDataType
: sleep data type originally scored by Fitabase
EBEDataType
: type of EBE data used for manually recomputing sleep measures (i.e., updated by excluding the last wake epochs)
TIB
= time in bed (min) computed as the number of minutes between StartTime
(i.e., considering missing epochs at the beginning as wake epochs) and the last epoch included in EBE data (i.e., excluding the last wake epochs)
TST
= total sleep time (min)
SE
= sleep efficiency as the percent of TST
over TIB
(%)
SO
= sleep onset hour (in “mm-dd-yyyy hh:mm:ss” format) corresponding to the time of the first epoch classified as sleep
WakeUp
= wake-up time (in “mm-dd-yyyy hh:mm:ss” format) corresponding to the time of the last epoch classified as sleep
midSleep
= mid-sleep time (in “mm-dd-yyyy hh:mm:ss” format) calculated as the halfway point between sleep onset and sleep offset
SOL
= sleep onset latency (min), only for cases with wake epochs between StartTime
and SO
WASO
= wake after sleep onset (min)
nAwake
= No. of awakenings longer than 5 minutes after SO
fragIndex
= No. of sleep stage shifting (including wake) per hour
light
= No. of minutes classified as “light” sleep
deep
= No. of minutes classified as “deep” sleep
rem
= No. of minutes classified as REM sleep
sleepHR
stageHR_NREM
= mean HR value (bpm) computed over NREM sleep epochs
stageHR_REM
= mean HR values (bpm) computed over REM sleep epochs
dailyAct
TotalSteps
= sum of the No. of steps recoded in each day (recomputed from hourly steps data)dailyDiary
StartedTime
= initiation hour of the diary form (in “mm-dd-yyyy hh:mm:ss” format)
SubmittedTime
= submission hour of the diary form (in “mm-dd-yyyy hh:mm:ss” format)
surveyDuration
= duration of the survey (min)
dailyStress
= score at the daily stress item (1-5)
eveningWorry
= score at the evening worry item (1-5)
eveningMood
= score at the evening Mood item (1-5)
stress_school, ..., stress_other
= stressor categories (0 or 1)
worry_school, ..., worry_other
= sources of worry (0 or 1)
Aadland, E., Andersen, L. B., Anderssen, S. A., & Resaland, G. K. (2018). A comparison of 10 accelerometer non-wear time criteria and logbooks in children. BMC public health, 18(1), 1-9. https://doi.org/10.1186/s12889-018-5212-4
Fleming, S., Thompson, M., Stevens, R., Heneghan, C., Plüddemann, A., Maconochie, I., … & Mant, D. (2011). Normal ranges of heart rate and respiratory rate in children from birth to 18 years of age: a systematic review of observational studies. The Lancet, 377(9770), 1011-1018. https://doi.org/10.1016/S0140-6736(10)62226-X
Herrmann, S. D., Barreira, T. V., Kang, M., & Ainsworth, B. E. (2014). Impact of accelerometer wear time on physical activity data: a NHANES semisimulation data approach. British journal of sports medicine, 48(3), 278-282. https://doi.org/10.1136/bjsports-2012-091410
Herrmann, S. D., Barreira, T. V., Kang, M., & Ainsworth, B. E. (2013). How many hours are enough? Accelerometer wear time may provide bias in daily activity estimates. Journal of Physical Activity and Health, 10(5), 742-749. https://doi.org/10.1123/jpah.10.5.742
Menghini, L., Cellini, N., Goldstone, A., Baker, F. C., & de Zambotti, M. (2021a). A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code. Sleep, 44(2), zsaa170. https://doi.org/10.1093/sleep/zsaa170
Quante, M., Kaplan, E. R., Rueschman, M., Cailler, M., Buxton, O. M., & Redline, S. (2015). Practical considerations in using accelerometers to assess physical activity, sedentary behavior, and sleep. Sleep health, 1(4), 275-284. https://doi.org/10.1016/j.sleh.2015.09.002