Aims and content

The present document includes the pre-processing steps to read the intensive longitudinal data collected with the Fitbit Charge 3 (FC3) and Survey Sparrows from a sample of 93 adolescents with and without insomnia.

The types of data collected in this study were the following:

dailyAct = diurnal daily activity* data including the day-by-day total No. of steps and the resting heart rate recorded with the FC3, and the associated metrics automatically averaged by the Fitabase system, for each participant
hourlySteps = hourly steps count data including the hour-by-hour No. of steps recorded with the FC3 device, automatically organized by the Fitabase system (the same information is included at a lower temporal resolution in the dailyAct dataset), for each participant
sleepLog = nocturnal daily sleep data including the log-by-log sleep measures recorded with the FC3 device, automatically organized by the Fitabase system with one row per identified sleep period (sleepLog), for each participant
sleepEBE = nocturnal sleep data including the 30-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘light,’ ‘deep,’ or ‘REM’ sleep, for each participant
classicEBE = nocturnal sleep data including the 60-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘sleep,’ automatically organized by the Fitabase system with one row per epoch, for each participant
HR.1min = diurnal and nocturnal heart rate (HR) data including 60-sec epoch-by-epoch HR values recorded by the Fitbit device, automatically organized by the Fitabase system with one row per epoch, for each participant
dailyDiary = daily diary reports including day-by-day appraisal of psychological distress recorded and automatically stored by Survey Sparrow (SurveySparrow Inc.) for each participant
demos = demographic data (i.e., group, sex, and BMI) manually stored for each participant

Here, we remove all objects from the R global environment, and we set the system time zone.

# removing all objets from the workspace
rm(list=ls())

# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")

The following R packages are used in this document (see References section):

# required packages
packages <- c("ggplot2","gridExtra","lubridate","tidyr","dplyr","tcltk")

# generate packages references
knitr::write_bib(c(.packages(), packages),"packagesProc.bib")

# # run to install missing packages
# xfun::pkg_attach2(packages, message = FALSE); rm(list=ls())

1. Data reading

Here, we use the multidata.read function to read the data downloaded from the the Fitabase and the Survey Sparrow clouds. The data were downloaded separately for each participant, and stored into separate folders.

show multidata.read

#' @title Reading data from multiple files (one per subject)
#' @param data.path = character vector indicating the path to the folder including the data file.
#' @param idChar = numeric vector of length 2 indicating the first and the last letter in the file names to be used as participants' identification code
#' @param groupChar = numeric vector of length 2 indicating the first and the last letter in the file names to be used as group identification code (default: NA)
#' @param nSubj = integer indicating the number of participants that should be included in the data files (for sanity check)
#' @param surveySparrow = ad-hoc argument to process the data files collected with Survey Sparrow
multidata.read <- function(data.path,idChar,groupChar=NA,nSubj=NA,surveySparrow=FALSE){ 
  
  # reading files from data.path
  paths <- list.files(data.path)
  
  # iteratively reading each file and adding it to the same data.frame
  for(path in paths){ 
    new.data <- read.csv(paste(data.path,path,sep="/"))
    new.data$ID <- as.factor(paste("s",substr(path,idChar[1],idChar[2]),sep="")) # participant's identifier from file name
    if(!is.na(groupChar[1]) & !is.na(groupChar[2])){
      new.data$group <- as.factor(substr(path,groupChar[1],groupChar[2])) # group's identifier from file name
      new.data <- new.data[,c(ncol(new.data)-1,ncol(new.data),1:(ncol(new.data)-2))] # sorting columns
    } else { new.data <- new.data[,c(ncol(new.data),1:(ncol(new.data)-1))] }
    
    # ad-hoc code for surveySparrow data (shortening column names, adding missing columns)
    if(surveySparrow==TRUE){
      colnames(new.data) <- 
        gsub("PleaseindicateifthestresswasrelatedtooneormoreofthefollowingfactorsafteryoumadeyourselectionsclickOK","",
             gsub("AreyouworriedaboutanyormoreofthefollowingafteryoumadeyourselectionsclickOK","",
                  gsub("\\.","",colnames(new.data))))
      if(ncol(new.data)==40){ new.data[,c("SubmissionId","TimeZone","DeviceType","BrowserLanguage")] <- NULL }
      if(ncol(new.data)==34){ # adding COVID-related columns (empty) in first subjects' data
        new.data <- cbind(new.data[,1:6],Coronavirusrelatednegmydailyactivitieswererestrictedduetocoronavirus=NA,
                          new.data[,7:14],Coronavirusrelatednegmydailyactivitieswererestrictedduetocoronavirus=NA,
                          new.data[,15:ncol(new.data)]) }
      if(new.data[1,"ID"]=="s095"){ colnames(new.data)[32:33] <- colnames(data)[32:33] } # fixing wrongly encoded col names
      colnames(new.data)[which(colnames(new.data)=="Coronavirusrelatednegmydailyactivitieswererestrictedduetocoronavirus")] <-
        c("COVIDrestrictions_stress","COVIDrestrictions_worry") } 
    
    # adding the new participant to the dataset
    if(path == paths[1]){ data <- new.data } else { data <- rbind(data,new.data) }}
  
  # sanity check based on No. of subjects
  if(!is.na(nSubj)){ cat("sanity check:",nlevels(as.factor(data$ID)) == nSubj) }
  
  return(data) }

# Fitbit - dailyAct (6,019 days from 93 participants)
dailyAct <- multidata.read(data.path="DATA/Fitbit All Daily Activity",idChar=c(5,7),groupChar=c(8,11),nSubj=93)

## sanity check: TRUE

# Fitbit - hourlySteps (143,961 hours from 93 participants)
hourlySteps <- multidata.read(data.path="DATA/Fitbit Hourly Steps",idChar=c(5,7),groupChar=c(8,11),nSubj=93)

## sanity check: TRUE

# Fitbit - sleepLog (5,403 nights from 93 participants)
sleepLog <- multidata.read(data.path="DATA/Fitbit Sleep Log",idChar=c(5,7),groupChar=c(8,11),nSubj=93)

## sanity check: TRUE

# Fitbit - sleepEBE (4,243,037 epochs from 93 participants)
sleepEBE <- multidata.read(data.path="DATA/Fitbit Sleep Stages (30sec)",idChar=c(5,7),groupChar=c(8,11),nSubj=93)

## sanity check: TRUE

# Fitbit - classicEBE (2,505,034 epochs from 93 participants)
classicEBE <- multidata.read(data.path="DATA/Fitbit Sleep Classic (1min)",idChar=c(5,7),groupChar=c(8,11),nSubj=93)

## sanity check: TRUE

# Fitbit - HR.1min (6,986,307 epochs from 93 participants)
HR.1min <- multidata.read(data.path="DATA/Fitbit HR (1min)",idChar=c(5,7),groupChar=c(8,11),nSubj=93)

## sanity check: TRUE

# Survey Sparrow - dailyDiary (5,133 daily responses from 93 participants)
dailyDiary <- multidata.read(data.path="DATA/Survey Sparrows",idChar=c(5,7),groupChar=c(8,11),nSubj=93,surveySparrow=TRUE)

## sanity check: TRUE

Then, we read the demos dataset, including demographic variables (N = 107) that were recorded in the demographics.csv data file.

# Demographics - demos (107 participants)
demos <- read.csv2("DATA/demographics.csv",header=TRUE) # csv2 because saved with ; as column separator

Here, we inspect the structure of each dataset. Note that each FC3-derived dataset includes all variables available from the Fitabase platform.

# dailyAct
str(dailyAct)

## 'data.frame':    6019 obs. of  20 variables:
##  $ ID                      : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group                   : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ActivityDate            : chr  "1/7/2019" "1/8/2019" "1/9/2019" "1/10/2019" ...
##  $ TotalSteps              : int  11023 16524 14904 8000 0 0 0 0 5273 6718 ...
##  $ TotalDistance           : num  9.66 13.17 11.97 6.62 0 ...
##  $ TrackerDistance         : num  9.66 13.17 11.97 6.62 0 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  2.84 0 1.97 1.24 0 ...
##  $ ModeratelyActiveDistance: num  4.55 3.54 2.28 0.21 0 ...
##  $ LightActiveDistance     : num  2.27 9.47 7.71 5.06 0 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  16 0 21 11 0 0 0 0 9 7 ...
##  $ FairlyActiveMinutes     : int  54 39 34 3 0 0 0 0 3 3 ...
##  $ LightlyActiveMinutes    : int  113 414 266 153 0 0 0 0 147 222 ...
##  $ SedentaryMinutes        : int  1257 266 453 735 1440 1440 1440 1440 1281 1208 ...
##  $ Calories                : int  2138 2669 2519 2013 1505 1505 1505 1505 1945 2085 ...
##  $ Floors                  : int  4 9 13 10 0 0 0 0 0 0 ...
##  $ CaloriesBMR             : int  1505 1505 1505 1505 1505 1505 1505 1505 1505 1505 ...
##  $ MarginalCalories        : int  479 790 716 356 0 0 0 0 308 360 ...
##  $ RestingHeartRate        : int  60 64 64 65 NA NA NA NA NA NA ...

# hourlySteps
str(hourlySteps)

## 'data.frame':    143961 obs. of  4 variables:
##  $ ID          : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group       : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ActivityHour: chr  "1/7/2019 0:00" "1/7/2019 1:00" "1/7/2019 2:00" "1/7/2019 3:00" ...
##  $ StepTotal   : int  0 0 0 0 0 0 0 0 0 0 ...

# sleepLog
str(sleepLog)

## 'data.frame':    5403 obs. of  30 variables:
##  $ ID                     : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group                  : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LogId                  : num  2.08e+10 2.08e+10 2.08e+10 2.11e+10 2.11e+10 ...
##  $ StartTime              : chr  "1/7/2019 10:06:30 PM" "1/8/2019 9:59:30 PM" "1/9/2019 10:45:00 PM" "1/25/2019 9:53:00 PM" ...
##  $ Duration               : int  36060000 35460000 32280000 36780000 34800000 34560000 31620000 31560000 29040000 27240000 ...
##  $ Efficiency             : int  89 87 90 90 92 91 91 93 93 93 ...
##  $ IsMainSleep            : chr  "True" "True" "True" "True" ...
##  $ SleepDataType          : chr  "stages" "stages" "stages" "stages" ...
##  $ MinutesAfterWakeUp     : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ MinutesAsleep          : int  507 502 469 511 515 496 448 451 423 387 ...
##  $ MinutesToFallAsleep    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TimeInBed              : int  601 591 538 613 580 576 527 526 484 454 ...
##  $ ClassicAsleepCount     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ClassicAsleepDuration  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ClassicAwakeCount      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ClassicAwakeDuration   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ClassicRestlessCount   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ClassicRestlessDuration: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ StagesWakeCount        : int  53 43 43 53 45 46 36 34 42 24 ...
##  $ StagesWakeDuration     : int  94 89 69 102 65 80 79 75 61 67 ...
##  $ StagesWakeThirtyDayAvg : int  0 94 92 84 89 84 83 83 82 79 ...
##  $ StagesLightCount       : int  45 32 31 46 39 42 22 28 29 24 ...
##  $ StagesLightDuration    : int  353 297 283 330 272 287 258 269 243 178 ...
##  $ StagesLightThirtyDayAvg: int  0 353 325 311 316 307 304 297 294 288 ...
##  $ StagesDeepCount        : int  2 2 3 3 6 4 3 4 4 4 ...
##  $ StagesDeepDuration     : int  66 91 97 92 119 104 75 65 88 103 ...
##  $ StagesDeepThirtyDayAvg : int  0 66 79 85 87 93 95 92 89 89 ...
##  $ StagesREMCount         : int  17 20 17 16 18 15 20 16 19 8 ...
##  $ StagesREMDuration      : int  88 114 89 89 124 105 115 117 92 106 ...
##  $ StagesREMThirtyDayAvg  : int  0 88 101 97 95 101 102 103 105 104 ...

# sleepEBE
str(sleepEBE)

## 'data.frame':    4243037 obs. of  7 variables:
##  $ ID        : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group     : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LogId     : num  2.08e+10 2.08e+10 2.08e+10 2.08e+10 2.08e+10 ...
##  $ Time      : chr  "1/7/2019 22:06" "1/7/2019 22:07" "1/7/2019 22:07" "1/7/2019 22:08" ...
##  $ Level     : chr  "light" "light" "light" "light" ...
##  $ ShortWakes: chr  "" "" "" "" ...
##  $ SleepStage: chr  "light" "light" "light" "light" ...

# classicEBE
str(classicEBE)

## 'data.frame':    2505034 obs. of  5 variables:
##  $ ID   : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group: Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date : chr  "1/7/2019 22:06" "1/7/2019 22:07" "1/7/2019 22:08" "1/7/2019 22:09" ...
##  $ value: int  1 1 1 2 1 1 2 2 1 1 ...
##  $ logId: num  2.08e+10 2.08e+10 2.08e+10 2.08e+10 2.08e+10 ...

# HR.1min
str(HR.1min)

## 'data.frame':    6986307 obs. of  4 variables:
##  $ ID   : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group: Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time : chr  "1/7/2019 3:50:00 PM" "1/7/2019 3:51:00 PM" "1/7/2019 3:52:00 PM" "1/7/2019 3:53:00 PM" ...
##  $ Value: int  84 74 69 69 68 68 70 65 70 70 ...

# dailyDiary
str(dailyDiary)

## 'data.frame':    5133 obs. of  36 variables:
##  $ ID                                                                     : Factor w/ 93 levels "s001","s002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group                                                                  : Factor w/ 6 levels "FBSR","FSRB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Howstressfulwasyourday                                                 : chr  "Not so stressful" "Not so stressful" "Not at all stressful" "Somewhat stressful" ...
##  $ SchoolegIhadanexam                                                     : chr  "School (e.g., I had an exam)" "" "" "" ...
##  $ FamilyegIhadanargumentwithmyparents                                    : chr  "" "" "" "" ...
##  $ HealthegIhadanaccident                                                 : chr  NA NA NA NA ...
##  $ COVIDrestrictions_stress                                               : chr  NA NA NA NA ...
##  $ RelationswithyourpeersegIhadafightwithmyfriend                         : chr  "" "" "" "" ...
##  $ Other                                                                  : chr  "Other" "Other" "" "Other" ...
##  $ Howisyourmoodrightnow                                                  : chr  "Somewhat good" "Very good" "Very good" "Very good" ...
##  $ Howworrieddoyoufeelrightnow                                            : chr  "Not so worried" "Not so worried" "Not so worried" "Not at all worried" ...
##  $ SchoolegtomorrowIhaveanexam                                            : chr  "School (e.g., tomorrow I have an exam)" "" "" "" ...
##  $ FamilyegtomorrowIneedtodosomethingimportantwithmyparents               : chr  "" "" "" "" ...
##  $ HealthegtomorrowIhaveanimportantvisittothedoctor                       : chr  NA NA NA NA ...
##  $ RelationswithyourpeersegmyfriendaskedmetotalkandIdonotknowwhatitisabout: chr  "" "" "" "" ...
##  $ COVIDrestrictions_worry                                                : chr  NA NA NA NA ...
##  $ SleepegIamworriedthatIamnotgoingtosleepwelltonight                     : chr  NA NA NA NA ...
##  $ OtheregIamworriedaboutsomethingelsehappeningtomorrow                   : chr  "Other (e.g., I am worried about something else happening tomorrow)" "Other (e.g., I am worried about something else happening tomorrow)" "Other (e.g., I am worried about something else happening tomorrow)" "" ...
##  $ TotalScore                                                             : int  2 1 NA 2 1 NA NA 2 6 3 ...
##  $ StartedTime                                                            : chr  "4/9/2019 21:10" "4/8/2019 21:08" "4/7/2019 21:29" "4/6/2019 21:01" ...
##  $ SubmittedTime                                                          : chr  "4/9/2019 21:10" "4/8/2019 21:08" "4/7/2019 21:29" "4/6/2019 21:01" ...
##  $ CompletionStatus                                                       : chr  "Completed" "Completed" "Completed" "Completed" ...
##  $ IPAddress                                                              : chr  "66.180.182.236" "66.180.182.232" "172.56.34.144" "172.58.231.24" ...
##  $ Location                                                               : logi  NA NA NA NA NA NA ...
##  $ DMSLatLong                                                             : logi  NA NA NA NA NA NA ...
##  $ ChannelType                                                            : chr  "EMAIL" "EMAIL" "EMAIL" "EMAIL" ...
##  $ ChannelName                                                            : chr  NA NA NA NA ...
##  $ DeviceID                                                               : logi  NA NA NA NA NA NA ...
##  $ DeviceName                                                             : logi  NA NA NA NA NA NA ...
##  $ Browser                                                                : chr  NA NA NA NA ...
##  $ OS                                                                     : chr  NA NA NA NA ...
##  $ ContactName                                                            : chr  NA NA NA NA ...
##  $ ContactEmail                                                           : chr  NA NA NA NA ...
##  $ ContactMobile                                                          : logi  NA NA NA NA NA NA ...
##  $ ContactPhone                                                           : logi  NA NA NA NA NA NA ...
##  $ ContactJobTitle                                                        : logi  NA NA NA NA NA NA ...

# demos
str(demos)

## 'data.frame':    107 obs. of  13 variables:
##  $ id          : chr  "INSA001FBSR" "INSA002FSRB" "INSA004MBSR" "INSA005FSRB" ...
##  $ sex         : int  0 0 1 0 1 0 0 1 1 0 ...
##  $ age         : chr  "19.3322473" "19.28570288" "18.80930706" "18.754663" ...
##  $ BMI         : chr  "20.59570313" "18.54801038" "31.09001041" "19.93383743" ...
##  $ insomnia    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ DSMinsomnia : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ sub_insomnia: int  0 0 0 0 NA 0 0 0 0 0 ...
##  $ X           : logi  NA NA NA NA NA NA ...
##  $ X.1         : logi  NA NA NA NA NA NA ...
##  $ X.2         : logi  NA NA NA NA NA NA ...
##  $ X.3         : logi  NA NA NA NA NA NA ...
##  $ X.4         : logi  NA NA NA NA NA NA ...
##  $ X.5         : logi  NA NA NA NA NA NA ...

2. Temporal synchronization

Here, we recode the time format of each dataset to consistently synchronize the temporal coordinates across measurement modalities. For each dataset, dependently on its temporal resolution, we recode or create the ActivityDate (i.e., indicating the day of the year using the ‘yyyy-mm-dd’ format), and the time variable within (i.e., indicating the time within each day using the ‘hh:mm:ss’ format).

The timeCheck function is used to check whether the temporal coordinates of the data points match with the data collection interval (i.e., from Jaunary 2019 to April 2021), and whether missing data points are present within the data.

show timeCheck

#' @title Recoding and checking data temporal synchronization
#' @param data = data.frame.
#' @param day = character string indicating the name of the variable including the day of the year
#' @param hour = character string indicating the name of the variable including the time within the day (optional, default: NA)
#' @param returnInfo = logical indicating whether the participants' compliance dataset should be returned instead of the recoded dataset (default: FALSE)
#' @param printInfo = logical indicating whether the summary information on participants' compliance should be printed (default: TRUE)
#' @param input.dayFormat = character string indicating the current format of the 'day' variable (default: "%m/%d/%Y"). See ?strptime for details
#' @param output.dayFormat = character string indicating the desired format of the 'day' variable (default: "%Y-%m-%d"). See ?strptime for details
#' @param output.hourFormat = character string indicating the desired format of the 'hour' variable (defualt: "%m/%d/%Y %H:%M"). See ?strptime for details
timeCheck <- function(data,ID="ID",day="ActivityDate",hour=NA,returnInfo=FALSE,printInfo=TRUE,
                      input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M",
                      add30=FALSE,LogId=NA,day.withinNight=FALSE){ 
  require(ggplot2); require(gridExtra) # required packages for data visualization
 
  # setting column names
  if(!is.na(day)){ colnames(data)[which(colnames(data)==day)] <- "day" }
  colnames(data)[which(colnames(data)==ID)] <- "ID"
  if(!is.na(hour)){ colnames(data)[which(colnames(data)==hour)] <- "hour" 
    if(!is.na(LogId)){ colnames(data)[which(colnames(data)==LogId)] <- "LogId" 
      data$LogId <- as.factor(as.character(data$LogId))}}
  
  # setting time format
  if(!is.null(data$day[1])){ # setting day format
    data$day <- as.Date(format(as.POSIXct(data$day,format=input.dayFormat),format=output.dayFormat)) } 
  if(!is.na(hour)){ # setting hour format
    data$hour <- as.POSIXct(data$hour,format=output.hourFormat,tz="GMT")
    # resolution from minute to seconds (i.e., for 30-sec EBE data: adds 30 sec to each other epoch)
    if(add30==TRUE & !is.na(LogId)){
      for(LOG in levels(data$LogId)){ 
        data[data$LogId==LOG,"hour"][seq(from=2,to=nrow(data[data$LogId==LOG,]),by=2)] <- 
          data[data$LogId==LOG,"hour"][seq(from=2,to=nrow(data[data$LogId==LOG,]),by=2)] + 30 
        if(difftime(head(data[data$LogId==LOG,"hour"],2)[2],head(data[data$LogId==LOG,"hour"],1),units="secs")>30){
          data[data$LogId==LOG,"hour"][1] <- data[data$LogId==LOG,"hour"][1] + 30 # adds 30 sec when sleep starts at :30
        }}}
    if(is.na(day) | is.null(data$day[1])){ # when hour but not day is specified, day is computed from hour
      data$day <- as.Date( format( as.POSIXct(data$hour,format=input.dayFormat), output.dayFormat)) 
      if(!is.na(hour) & !is.na(LogId) & day.withinNight==TRUE){ # date as the first epoch's date
        for(LOG in levels(data$LogId)){ data[data$LogId==LOG,"day"] <- data[data$LogId==LOG,"day"][1]}}}}
  
  # sorting data by participant, day, and time
  if(is.na(hour)){ data <- data[order(data$ID,data$day),]
    }else{ data <- data[order(data$ID,data$day,data$hour),] }
  
# checking number of data and temporal interval for each participant
  info <- data.frame(ID=levels(data$ID))
  if(!is.na(hour)){
    data$IDday <- as.factor(paste(data$ID,data$day,sep="_")) # participant X day identifier
    dataDay <- data[!duplicated(data$IDday),] # taking only the first raw for each participant X day combination
  } else { dataDay <- data }
  for(i in 1:nrow(info)){
    IDdata <- dataDay[dataDay$ID==info[i,"ID"],]
    info[i,"nDays"] <- nrow(IDdata) # nDays = No. of data points (days) per participant
    info[i,"Tint"] <- as.integer(difftime(tail(IDdata$day,1),head(IDdata$day,1),units="days")) + 1
    
    # counting No. of missing days
    nMissingDays <- maxdayDiff <- 0
    if(nrow(IDdata)>1){
      for(j in 2:nrow(IDdata)){ dayDiff <- as.integer(difftime(IDdata[j,"day"],IDdata[j-1,"day"],units="days"))
      if(dayDiff>1){ nMissingDays <- nMissingDays + as.integer(dayDiff) - 1 
        if(dayDiff>maxdayDiff){ maxdayDiff <- dayDiff }}}
      info[i,"nMissingDays"] <- nMissingDays 
      info[i,"maxdayDiff"] <- maxdayDiff }
    }
    
  
  # counting No. of duplicates
  if(!is.na(hour)){
    data$IDhour <- as.factor(paste(data$ID,data$hour,sep=""))
    dupHour <- nrow(data[duplicated(data$IDhour),]) 
  } else {  
    dataDay$IDday <- as.factor(paste(dataDay$ID,dataDay$day,sep="_"))
    dupDay <- nrow(dataDay[duplicated(dataDay$IDday),]) }
  
  # plotting the temporal distribution of ActivityDate
  if(printInfo==TRUE){
    grid.arrange(ggplot(data[,which(duplicated(colnames(data))==FALSE)],aes(day)) + geom_histogram(bins=30) + 
                   ggtitle(paste(day,"distribution ( example format:",data$day[1]," )")),
                 ggplot(info,aes(nDays)) + geom_histogram(bins=30) + ggtitle("No. of non-missing days per participant"),nrow=2)
  
    # printing information
    if(!is.na(hour)){ cat(nrow(data),"observations in",nrow(dataDay),"days from",nlevels(data$ID),"participants:") 
      } else { cat(nrow(data),"days from",nlevels(data$ID),"participants:") }
    cat("\n\n- mean No. of days/participant =",
        round(mean(info$nDays),2)," SD =",round(sd(info$nDays),2)," min =",min(info$nDays)," max =",max(info$nDays),
        "\n- mean data collection duration (days) =",
        round(mean(info$Tint),2),"- SD =",round(sd(info$Tint),2)," min =",min(info$Tint)," max =",max(info$Tint),
        "\n\n- mean No. of missing days per participant =",round(mean(info$nMissingDays),2),
        " SD =",round(sd(info$nMissingDays),2)," min =",min(info$nMissingDays)," max =",max(info$nMissingDays),
        "\n- mean No. of consecutive missing days per participant =",round(mean(info$maxdayDiff),2),
        " SD =",round(sd(info$maxdayDiff),2)," min =",min(info$maxdayDiff)," max =",max(info$maxdayDiff))
  if(!is.na(hour)){ cat("\n\n- No. of duplicated cases by hour (same ID and hour) =",dupHour) 
    }else{ cat("\n\n- No. of duplicated cases by day (same ID and day) =",dupDay) }}
  
  
  # resetting column names
  if(!is.na(day)) { colnames(data)[which(colnames(data)=="day")] <- day }
  colnames(data)[which(colnames(data)=="ID")] <- ID
  if(!is.na(hour)){ colnames(data)[which(colnames(data)=="hour")] <- hour }
  
  # data output
  if(returnInfo==TRUE){ return(info) }else{ return(data) } 
  }

2.1. dailyAct

dailyAct data are stored in a dataset with one row per day. Thus, in this dataset we only need to recode the ActivityDate variable.

# recoding day and hour, and checking time and missing data points
dailyAct <-  timeCheck(data=dailyAct,ID="ID",day="ActivityDate",hour=NA,
                       input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d")

## Warning: package 'ggplot2' was built under R version 4.0.5

## 6019 days from 93 participants:
## 
## - mean No. of days/participant = 64.72  SD = 5.25  min = 49  max = 100 
## - mean data collection duration (days) = 64.72 - SD = 5.25  min = 49  max = 100 
## 
## - mean No. of missing days per participant = 0  SD = 0  min = 0  max = 0 
## - mean No. of consecutive missing days per participant = 0  SD = 0  min = 0  max = 0
## 
## - No. of duplicated cases by day (same ID and day) = 0

Comments:

the ActivityDate variable has been successfully recoded with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 63 non-missing days of data, with only one participant (s039) having less than 50 days (i.e., 49), and only two participants having more than 75 days (i.e., s001: 88 days; s041: 100 days)
no missing days are observed
no duplicates are included in the dataset

2.1.1. Temporal continuity

Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.

par(mfrow=c(3,3))
for(ID in levels(dailyAct$ID)){ 
  plot((1:nrow(dailyAct[dailyAct$ID==ID,])~dailyAct[dailyAct$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="") }

Comments:

coherently with what observed above, no temporal interruptions are included in dailyAct data

2.1.2. Saving dataset

Here, we save the processed dailyAct dataset to be used in the following steps.

save(dailyAct,file="DATA/datasets/dailyAct_timeProcessed.RData")

2.2. hourlySteps

hourlySteps data are stored in a dataset with one row per hour. Thus, in this dataset we need to recode the ActivityHour variable, based on which the ActivityDate variable is computed.

# recoding day and hour, and checking time and missing data points
hourlySteps <-  timeCheck(data=hourlySteps,ID="ID",day="ActivityDate",hour="ActivityHour",
                          input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")

## 143961 observations in 6006 days from 93 participants:
## 
## - mean No. of days/participant = 64.58  SD = 5.21  min = 49  max = 99 
## - mean data collection duration (days) = 64.58 - SD = 5.21  min = 49  max = 99 
## 
## - mean No. of missing days per participant = 0  SD = 0  min = 0  max = 0 
## - mean No. of consecutive missing days per participant = 0  SD = 0  min = 0  max = 0
## 
## - No. of duplicated cases by hour (same ID and hour) = 0

Comments:

the ActivityDate variable has been succesfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 63 nonmissing days of data, with only one participant (s039) having less than 50 days (i.e., 49), and only two participants having more than 75 days (i.e., s001: 88 days; s041: 100 days)
no missing days are observed

2.2.1. Temporal continuity

Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.

par(mfrow=c(3,3))
for(ID in levels(hourlySteps$ID)){ 
  plot((1:nrow(hourlySteps[hourlySteps$ID==ID,])~hourlySteps[hourlySteps$ID==ID,"ActivityHour"]),main=ID,xlab="",ylab="",
       cex=0.5) }

Comments:

coherently with what observed above, no temporal interruptions are included in hourlySteps data

2.2.2. Saving dataset

Here, we save the processed hourlySteps dataset to be used in the following steps.

save(hourlySteps,file="DATA/datasets/hourlySteps_timeProcessed.RData")

2.3. sleepLog

sleepLog data exported from Fitabase are stored in a dataset with one row per sleep period, and multiple sleep periods can be found in the same participant × day combination. Thus, in this dataset we need to recode the StartTime variable (i.e., the “date and time the sleep record started”), based on which the ActivityDate variable is computed.

Here, before applying the timeCheck function, we need standardize the format of this variable (i.e., sometimes encoded with the “PM”/“AM” specification, and sometimes not).

# standardizing StartTime format and converting as POSIXct
sleepLog$lastTimeChar <- substr(sleepLog$StartTime,nchar(sleepLog$StartTime),nchar(sleepLog$StartTime))
sleepLog[sleepLog$lastTimeChar=="M","StartTime2"] <-
  as.POSIXct(sleepLog[sleepLog$lastTimeChar=="M","StartTime"],format="%m/%d/%Y %I:%M:%S %p",tz="GMT") # AM/PM specification
sleepLog[sleepLog$lastTimeChar!="M","StartTime2"] <-
  as.POSIXct(sleepLog[sleepLog$lastTimeChar!="M","StartTime"],format="%m/%d/%Y %H:%M",tz="GMT") # no AM/PM specification
sleepLog[is.na(sleepLog$StartTime2),"StartTime2"] <- 
  as.POSIXct(sleepLog[is.na(sleepLog$StartTime2),"StartTime"],format="%d/%m/%Y %H.%M",tz="GMT") # no AM/PM + '.' instead of ':'
sleepLog$StartTime <- sleepLog$StartTime2 # updating StartTime

# recoding day and hour, and checking time and missing data points
p <-  timeCheck(data=sleepLog,ID="ID",day="ActivityDate",hour="StartTime",
                input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")

## 5403 observations in 4387 days from 93 participants:
## 
## - mean No. of days/participant = 47.17  SD = 12.57  min = 6  max = 86 
## - mean data collection duration (days) = 76.82 - SD = 38.93  min = 7  max = 331 
## 
## - mean No. of missing days per participant = 29.65  SD = 40.59  min = 0  max = 275 
## - mean No. of consecutive missing days per participant = 16.95  SD = 37.07  min = 0  max = 268
## 
## - No. of duplicated cases by hour (same ID and hour) = 1

Comments:

the ActivityDate variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 40-60 nonmissing days of data, with only one participant showing less than 10 days (s089: 6 days), and further three participants showing less than 20 days (s038, 41, and 105)
two cases have the same participant’s identifier and temporal coordinate (see section 2.3.1)
a substantial No. of missing days is observed, with the 28% of the sample showing 10+ consecutive missing days, and the 38% showing 20+ consecutive missing days. Here, we can see that at least some cases of long missing data periods are due to one or two isolated data points (nights) that were recorded some months later than the other data points. These cases will be better discussed in the data cleaning section.

(sleepLog_compliance <-  
  timeCheck(data=sleepLog,ID="ID",day="ActivityDate",hour="StartTime",returnInfo=TRUE,printInfo=FALSE,
            input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M"))

# plotting missing days
par(mfrow=c(1,2))
hist(sleepLog_compliance$nMissingDays,main="No. of missing days",breaks=30)
hist(sleepLog_compliance$maxdayDiff,main="Max No. of consecutive missing days",breaks=30)

# showing examples of participant with > 250 consecutive missing days
for(ID in sleepLog_compliance[sleepLog_compliance$maxdayDiff>50,"ID"]){
  print(tail(sleepLog[sleepLog$ID==ID,c("ID","LogId","StartTime")],3)) }

##       ID       LogId           StartTime
## 257 s006 21983522520 2019-04-15 23:17:00
## 258 s006 21996799829 2019-04-16 23:17:00
## 259 s006 25381321835 2020-01-09 23:28:30
##        ID       LogId           StartTime
## 1191 s025 22737210199 2019-06-16 02:02:30
## 1192 s025 22750117624 2019-06-17 00:54:00
## 1193 s025 24233083393 2019-10-10 22:52:30
##        ID       LogId           StartTime
## 1511 s029 22924261808 2019-07-01 01:12:00
## 1512 s029 23768589455 2019-09-06 22:18:30
## 1513 s029 24046024418 2019-09-27 22:57:00
##        ID       LogId           StartTime
## 1582 s030 23099294456 2019-07-15 01:09:00
## 1583 s030 23110901994 2019-07-16 02:02:00
## 1584 s030 23834537413 2019-09-10 22:45:00
##        ID       LogId           StartTime
## 1913 s038 23951489399 2019-09-19 22:14:30
## 1914 s038 24914189245 2019-12-05 22:03:00
## 1915 s038 24997629259 2019-12-12 22:47:30
##        ID       LogId           StartTime
## 1981 s040 23056415228 2019-07-08 08:59:30
## 1982 s040 23056415229 2019-07-09 00:50:30
## 1983 s040 25340851080 2020-01-07 22:28:00
##        ID       LogId           StartTime
## 2470 s053 24978955724 2019-12-09 00:27:00
## 2471 s053 24987038952 2019-12-12 00:29:00
## 2472 s053 25885992780 2020-02-12 23:20:30
##        ID       LogId           StartTime
## 2711 s058 25000689911 2019-12-11 00:06:30
## 2712 s058 25000689912 2019-12-12 01:04:00
## 2713 s058 25886123469 2020-02-12 23:40:00

2.3.1. LogId & SleepDataType

Here, we recode further important variables (as defined according to Fitabase (accessed on March 10th 2021):

LogId: identifying separate sleep periods for each participant. Here, we remove one case (i.e., the same highlighted above with the timeCheck function) showing the same LogId but a less realistic sleep duration (i.e., 1.3h) than another case (i.e., sleep duration = 11.3h).

# LogId as factor
sleepLog <- p # dataset processed with timeCheck
sleepLog$LogId <- as.factor(sleepLog$LogId)

# sanity check
nrow(sleepLog) - nlevels(sleepLog$LogId) # only one LogId is included twice

## [1] 1

sleepLog[sleepLog$LogId==names(summary(sleepLog$LogId)[which(summary(sleepLog$LogId)==2)]),] # showing LogId

# removing dupicated LogId with SleepDataType = "stages"
sleepLog <- sleepLog[!(sleepLog$LogId==names(summary(sleepLog$LogId)[which(summary(sleepLog$LogId)==2)]) &
                         sleepLog$SleepDataType=="stages"),]
cat("sanity check:",nlevels(as.factor(as.character(sleepLog$LogId)))==nrow(sleepLog)) # now each row has its own identifier

## sanity check: TRUE

SleepDataType: indicates the sleep algorithm used for the sleep record, i.e., “stages” (84%) or “classic” (16%). Note that information on sleep stage duration is reported only in the former cases.

# SleepDataType as factor
sleepLog$SleepDataType <- as.factor(sleepLog$SleepDataType)
summary(sleepLog$SleepDataType) # "classic" in 16% of cases

## classic  stages 
##     857    4545

IsMainSleep: indicates whether the sleep record is the main sleep record for that day (TRUE; i.e., nocturnal sleep period) or not (FALSE; i.e., considered as nocturnal or diurnal nap).

# standardizing IsMainSleep encoding
sleepLog$IsMainSleep <- as.logical(toupper(gsub("FALSO","FALSE",gsub("VERO","TRUE",sleepLog$IsMainSleep))))
summary(sleepLog$IsMainSleep) # only 9% with IsMainSleep = FALSE

##    Mode   FALSE    TRUE 
## logical     500    4902

2.3.2. StartHour & EndHour

Then, to better inspect timing and duration of the recorded sleep periods, we recode the StartTime variable to create EndTime (i.e., StartTime + TimeInBed), as well as StartHour and EndHour, indicating only the time (and not the date). This is done with the StartTime_rec function.

show StartTime_rec

#' @title Recoding and plotting StartTime and EndTime
#' @param data = data.frame including at least the start colun.
#' @param start = character string indicating the name of the variable including the starting time
#' @param duration = character string indicating the name of the variable including the recording duration (optional, default: NA)
#' @param duration.unit = character string indicating the measurement unit of the duration variable: either "mins" (default) or "secs"
#' @param doPlot = logical indicating whether StartTime and EndTime should be plotted (default: TRUE)
#' @param returnData =  logical indicating whether the recoded dataset should be returned (default: TRUE)
StartTime_rec <- function(data,start="StartTime",end=NA,duration=NA,duration.unit="mins",doPlot=TRUE,returnData=TRUE){ 
  require(lubridate) # required package
 
  # setting columns names
  colnames(data)[colnames(data)==start] <- "start"
  if(!is.na(end)){ colnames(data)[colnames(data)==end] <- "end" }
  if(!is.na(duration)) { 
    colnames(data)[colnames(data)==duration] <- "duration" 
    if(duration.unit=="mins"){ TIB = data$duration*60 }else if(duration.unit=="secs"){ TIB = data$duration 
    }else{ stop("duration.unit can only be 'mins' or 'secs'") }}
  
  # creating EndTime (if both end and duration are specified)
  if(!is.na(end) & !is.na(duration)){ data$end <- data$start + TIB }
  
  # creating and plotting StartHour and EndHour
  data$StartHour <- as.POSIXct(paste(hour(data$start),minute(data$start)),format="%H %M",tz="GMT") 
  if(!is.na(end)){ data$EndHour <- as.POSIXct(paste(hour(data$end),minute(data$end)),format="%H %M",tz="GMT") }
  if(doPlot==TRUE){ require(ggplot2); require(gridExtra) # required packages for data visualization
    p <- ggplot(data,aes(StartHour))+geom_histogram(bins=30)+ggtitle(start)+xlab("")+scale_x_datetime(date_labels="%H:%M")
    if(!is.na(end)){
      grid.arrange(p,ggplot(data,aes(EndHour))+geom_histogram(bins=100)+ggtitle(end)+xlab("") +
                     scale_x_datetime(date_labels="%H:%M"))
    }else{ print(p) }}
  
  # updating EndHour so that it indicates the following days if after 00:00
  if(!is.na(end)){
    h18 <- as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="GMT")
    h23 <- as.POSIXct(paste(substr(Sys.time(),1,10),"23:59:59"),tz="GMT")
    h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
    data[data$StartHour > h18 & data$StartHour < h23 & data$EndHour >= h00 & data$EndHour < h18,"EndHour"] <-
      data[data$StartHour > h18 & data$StartHour < h23 & data$EndHour >= h00 & data$EndHour < h18,"EndHour"] + 1*24*60*60 }
  
  # resetting columns names
  colnames(data)[colnames(data)=="start"] <- start
  if(!is.na(end)){ colnames(data)[colnames(data)=="end"] <- end }
  if(!is.na(duration)) { colnames(data)[colnames(data)=="duration"] <- duration }
  
  # returning recoded data
  if(returnData==TRUE){ return(data) }}

# recoding and plotting StartTime and EndTime
sleepLog <- StartTime_rec(data=sleepLog,start="StartTime",end="EndTime",duration="TimeInBed",duration.unit="mins",
                          doPlot=TRUE,returnData=TRUE)

Comments:

most StartTime are between 10:00 PM and 2:00 AM
only a minority of cases has a StartTime in diurnal hours. Most of these cases are diurnal naps, as suggested by the following graph, showing the distribution of StartTime values in cases with IsMainSleep = TRUE (above) and IsMainSleep = FALSE (below)

# showing StartTime by IsMainSleep
StartTime_rec(data=sleepLog[sleepLog$IsMainSleep==TRUE,],returnData=FALSE) # IsMainSleep = TRUE

StartTime_rec(data=sleepLog[sleepLog$IsMainSleep==FALSE,],returnData=FALSE) # IsMainSleep = FALSE

Comments:

as expected, StartTime values in diurnal hours is more likely to be observed when IsMainSleep = FALSE
in some cases with IsMainSleep = FALSE, the StartTime is between 10:00 PM and 3:00 AM, probably due to bunch of sleep epochs incorrectly encoded as “naps”
there is still a (very low) number of cases with IsMainSleep = TRUE and StartTime in diurnal hours

2.3.3. Combining sleep

Here, we want to temporally recode sleepLog data based on our definition of nocturnal sleep periods, that is a period of inactivity (as detected by the FC3 device) characterized by the following conditions:

Starting between 6 PM and 6 AM

At least 180 min (3 hours) of Total Sleep Time

Possibly being interrupted by an indefinite number of wake periods with undefinite duration, but with the last sleep period starting before 11 AM

Consecutive sleep periods between 6 PM and 11 PM, and between 6 AM and 11 AM are combined only when separated by less than 1.5 hour

Here, we use the sleepPeriodRecode function to filter and recode the data based on Condition #1 (i.e., by excluding sleep periods with StartHour before 6 PM or after 6 AM), whereas Condition #2 will be applied in the data cleaning section below.

We also apply Condition #3 by combining consecutive sleep periods with StartTime up to 11 AM. Indeed, sometimes Fitbit encodes short bouts of morning sleep (or early evening sleep) as separate sleep periods. Usually, but not necessarily always, these short bouts are encoded with IsMainSleep = FALSE and SleepDataType = "Classic". With combined sleep stages, TIB is recomputed as the No. of minutes between sleep1’s StartTime and sleep2’s EndTime, whereas TST is recomputed as the sum of sleep1 and sleep2’s TST (the time between sleep1 and sleep 2 is considered as wake).

Note that Condition #3 is applied conditional to Condition #4, that is short sleep periods ending before 11 PM or starting after 6 AM and preceded or followed by 1.5 or more hours of wake are not combined with the preceding or following sleep period, but rather considered as naps (rather than nocturnal sleep periods).

show sleepRecode

#' @title Recoding Fitabase-derived sleep periods
#' @param data = data.frame of Fitabased-derived SleepLog datat (one row per night).
#' @param sleep_limits = character vector indicating the minimum and maximum starting time ("hh:mm") of sleep periods.
#' @param combine = logical value (default: TRUE) indicating whether consecutive sleep periods should be combined in a single sleep period. If FALSE, the function deletes all cases with IsMainSleep = FALSE. 
#' @param lastSleep_startTime = character string indicating the maximum StartHour ("hh:mm") of the last combined sleep period
#' @param max_wakeNumber = integer indicating the maximum number of wake periods separating consecutive sleep periods (used only when combine = TRUE).
#' @param max_wakeDuration = numeric indicating the maximum duration (in hours) of wake periods separating consecutive sleep periods (used only when combine = TRUE). By default (NA) it is computed as the hour difference between the first element of sleep_limits and the lastSleep_startTime value
#' @param max_wakeDuration_exclude = character vector indicating the minimum and maximum times ("hh:mm") between which the max_wakeDuration parameter is NOT applied.
#' @param notCombined_LogId = character vector indicating the LogId of sleep periods that should not be combined with preceding/subsequent sleep periods
#' @param doPlot = logical value (default: FALSE) indicating whether combined sleep periods should be plotted
sleepPeriodRecode <- function(data,
                              sleep_limits = c("18:00","06:00"),
                              combine.sleep=TRUE,
                              lastSleep_startTime = "11:00",
                              max_wakeNumber=Inf,
                              max_wakeDuration=1.5,
                              max_wakeDuration_exclude = c("23:00","06:00"),
                              notCombined_LogId = NA,
                              doPlot=FALSE){ 
  
  N.original = nrow(data)
  cat("\n\n---\nRECODING SLEEP PERIODS (combine = ",combine.sleep,")\nOriginal number of cases = ",N.original,sep="")
  data <- data[order(data$ID,data$StartTime),] # sorting by ID and time
  
  if(combine.sleep==FALSE){
    # .................................................
    # Not combining sleep periods
    # .................................................

    data <- data[data$IsMainSleep!=FALSE,]
    N.IsMainSleep=nrow(data)
    cat("\n\n - Removing",N.original-N.IsMainSleep,"cases with IsMainSleep = FALSE")
    data <- data[!(data$StartHour < as.POSIXct(paste(substr(Sys.time(),1,10),
                                                       paste(sleep_limits[1],"00",sep=":")),tz="GMT") &
                     data$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),
                                                       paste(sleep_limits[2],"00",sep=":")),tz="GMT")),]
    N.sleep_limits <- nrow(data)
    cat("\n\n - Removing",N.IsMainSleep-N.sleep_limits,"cases with StartHour outside the",sleep_limits[1],
        "-",sleep_limits[2],"interval")
    
  } else {
    # .................................................
    # Combining consecutive sleep periods
    # .................................................
    cat("\n\nCombining consecutive sleep periods...")
    
    # filtering data based on sleep_limits[1] amd lastSleep_startTime
    data <- data[!(data$StartHour < as.POSIXct(paste(substr(Sys.time(),1,10),
                                                       paste(sleep_limits[1],"00",sep=":")),tz="GMT") &
                     data$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),
                                                       paste(lastSleep_startTime,"00",sep=":")),tz="GMT")),]
    N.sleep_limits <- nrow(data)
    cat("\n\n - Removing",N.original-N.sleep_limits,"cases with StartHour outside the",sleep_limits[1],
        "-",lastSleep_startTime,"interval")
    
    # defining constraints on nocturnal wake duration (default: sleep_limits[1] - lastSleep_startTime)
    if(is.na(max_wakeDuration)){ 
      max_wakeDuration <- difftime(as.POSIXct(paste(substr(Sys.time(),1,10),
                                              paste(sleep_limits[1],"00",sep=":")),tz="GMT"),
                                   as.POSIXct(paste(substr(Sys.time(),1,10),
                                              paste(lastSleep_startTime,"00",sep=":")),tz="GMT")) } 
    
    # defining first row of new.data (now just taking the first row, or the second when the first is a nap) --> to fix later (!)
    data$combined <- FALSE
    combinedLogId <- NA
    if(!(data[1,"StartHour"] < as.POSIXct(paste(substr(Sys.time(),1,10),paste(sleep_limits[1],"00",sep=":")),tz="GMT") &
         data[1,"StartHour"] > as.POSIXct(paste(substr(Sys.time(),1,10), paste(sleep_limits[2],"00",sep=":")),tz="GMT"))){
      new.data <- cbind(data[1,],nCombined=0,combinedLogId=combinedLogId,combType=as.character(data[1,"IsMainSleep"]),
                        combSeq=paste(round(data[1,"MinutesAsleep"]/60,2),"S",sep="")) 
    } else { new.data <- cbind(data[2,],nCombined=0,combinedLogId=combinedLogId,combType=as.character(data[2,"IsMainSleep"]),
                               combSeq=paste(round(data[2,"MinutesAsleep"]/60,2),"S",sep="")) } 
    
    # iteratively adding sleep periods to new.data OR combining them when meeting the criteria identifying them as consecutive
    for(i in 2:nrow(data)){
      # updating cases to be compared
      sleep1 <- new.data[nrow(new.data),]
      sleep2 <- data[i,]
      
      # identification of consecutive sleep periods: (1) within the same subject 
      if(sleep2$ID == sleep1$ID & 
         # (2) AND sleep1 StartHour between 18:00 and 00:00 AND sleep2 StartHour before 11:00 of the following day
         (
           (sleep1$StartHour <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:59:59"),tz="GMT") &
           sleep1$StartHour >= as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="GMT") &
           sleep2$StartTime <= as.POSIXct(paste(as.character(sleep1$ActivityDate + 1),"11:00:00"),tz="GMT")) 
           |
          # OR sleep1 StartHour between 00:00 and 11:00 AND sleep2 StartHour before 11:00 of the same day
          (sleep1$StartHour >= as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT") &
              sleep1$StartHour <= as.POSIXct(paste(substr(Sys.time(),1,10),"11:00:00"),tz="GMT") &
              sleep2$StartTime <= as.POSIXct(paste(as.character(sleep1$ActivityDate),"11:00:00"),tz="GMT"))
          ) &
         # (3) AND the No. of combined sleep periods in this night is lower than max_wakeNumber (OLD TO BE CHECKED (!))
         sleep1$nCombined - 1 < max_wakeNumber & # (2) within the defined max No. of nocturnal wake periods
         
         # (4) AND unspecified max_wakeDuration_exclude AND sleep2 StartTime - sleep1 EndTime being < than max_wakeDuration
         (
           (is.na(max_wakeDuration_exclude[1]) & difftime(sleep2$StartTime,sleep1$EndTime,units="hours") < max_wakeDuration)
           |
           # OR specified max_wakeDuration_exclude AND wake between sleep1 and sleep2 is < than max_wakeDuration ...
           (!is.na(max_wakeDuration_exclude[1]) & 
            !(
              difftime(sleep2$StartTime,sleep1$EndTime,units="hours") > max_wakeDuration &
              (
                # ... to be applied ONLY when sleep1 ends before 23:00 
                sleep1$EndHour < as.POSIXct(paste(substr(Sys.time(),1,10),"23:00:00"),tz="GMT") 
                |
                # OR when sleep2 does starts after 6:00
                sleep2$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") ) )) )){
        
        # excluding those cases reported in the notCombined_LogId argument (taking only the longest one)
        if(!is.na(notCombined_LogId)[1] & (sleep1$LogId%in%notCombined_LogId | sleep2$LogId%in%notCombined_LogId)){
          sleep1_TIB <- difftime(sleep1$EndTime,sleep1$StartTime)
          sleep2_TIB <- difftime(sleep2$EndTime,sleep2$StartTime)
          if(sleep2_TIB>sleep1_TIB){ # replacing sleep1 with sleep2 when sleep2 is longer (otherwise ignoring sleep2)
            combinedLogId <- NA
            new.data[nrow(new.data),] <- cbind(data[i,],
                                               nCombined=0,combinedLogId=combinedLogId,
                                               combType=as.character(data[i,"IsMainSleep"]),
                                               combSeq=paste(round(data[i,"MinutesAsleep"]/60,2),"S",sep="")) }
          
        } else {  # updating the information of the first sleep period by integrating the consecutive one
          new.data[nrow(new.data),c("EndTime","EndHour")] <- data[i,c("EndTime","EndHour")] # updating EndTime = last period
          new.data[nrow(new.data),"MinutesAsleep"] <- new.data[nrow(new.data),"MinutesAsleep"] + data[i,"MinutesAsleep"] # summing TST
          new.data[nrow(new.data),"TimeInBed"] <- difftime(as.POSIXct(as.character(new.data[nrow(new.data),"EndTime"]),
                                                                      tz="GMT"), # TIB as EndTime - StartTime in GMT (like in EBE data)
                                                           as.POSIXct(as.character(new.data[nrow(new.data),"StartTime"]),
                                                                      tz="GMT"),units="min")
          new.data[nrow(new.data),"combined"] <- data[c(i,i-1),"combined"] <- TRUE # marking as combined
          new.data[nrow(new.data),"nCombined"] <- new.data[nrow(new.data),"nCombined"] + 1 # updating nCombined 
          combinedLogId <- ifelse(is.na(combinedLogId),
                                  as.character(data[i,"LogId"]),
                                  paste(combinedLogId,as.character(data[i,"LogId"]),sep="_"))
          new.data[nrow(new.data),"combinedLogId"] <- combinedLogId
          new.data[nrow(new.data),"combType"] <- paste(new.data[nrow(new.data),"combType"],data[i,"IsMainSleep"],sep="-")
          new.data[nrow(new.data),"combSeq"] <- paste(new.data[nrow(new.data),"combSeq"],
                                                      "-",round(difftime(data[i,"StartTime"],
                                                                         new.data[nrow(new.data),"EndTime"],
                                                                         units="hours"),2),"W",
                                                      "-",round(data[i,"MinutesAsleep"]/60,2),"S",sep="")
          
          # when not identified as consecutive sleep periods
        } } else { combinedLogId <- NA
          new.data <- rbind(new.data,cbind(data[i,],
                                           nCombined=0,combinedLogId=combinedLogId,combType=as.character(data[i,"IsMainSleep"]),
                                           combSeq=paste(round(data[i,"MinutesAsleep"]/60,2),"S",sep=""))) }}
    
    # recoding and printing information on combined sleep periods
    new.data$combType <- as.factor(gsub("TRUE","Main",gsub("FALSE","Short",new.data$combType)))
    new.data[,c("nCombined","combSeq")] <- lapply(new.data[,c("nCombined","combSeq")],as.factor)
    cat("\n\n - ",nrow(new.data[new.data$combined==TRUE,]),"identified groups of consecutive sleep periods:")
    N.combined <- nrow(new.data)
    cat("\n   Removing",N.sleep_limits-N.combined,"cases (integrated with previous sleep periods)")
    
    # filtering non-combined sleep periods starting between sleep_limits[2] and lastSleep_startTime
    new.data <- new.data[!(new.data$StartHour > as.POSIXct(paste(substr(Sys.time(),1,10),
                                                                 paste(sleep_limits[2],"00",sep=":")),tz="GMT") &
                             new.data$StartHour <= as.POSIXct(paste(substr(Sys.time(),1,10),
                                                                   paste(lastSleep_startTime,"00",sep=":")),tz="GMT")),]
    cat("\n   Removing further",N.combined-nrow(new.data),"cases of non-combined sleep starting between",
        sleep_limits[2],"and",lastSleep_startTime,
        "\n\n\nUpdated number of cases =",nrow(new.data))
    
    if(doPlot==TRUE){ require(ggplot2); require(gridExtra)
      p <- new.data[new.data$combined==TRUE,]
      cat("\n\nPlotting",nrow(p),"cases of combined sleep periods...")
      for(i in 1:nrow(p)){
        p.data <- data[data$ID==p[i,"ID"] & data$combined==TRUE,] # selecting data within the same two days
        p.data <- p.data[difftime(p.data$ActivityDate,p[i,"ActivityDate"],units="days")>=0 &
                           difftime(p.data$ActivityDate,p[i,"ActivityDate"],units="days")<=1,]
        
        # removing first night if it is the same night plotted in the previous case
        if(i > 1) { if(p.data[1,"EndTime"] == p[i-1,"EndTime"]){ p.data <- p.data[2:nrow(p.data),]  }}
        
        # removing nights recorded in the following day
        if(p.data[nrow(p.data),"StartHour"] >= as.POSIXct(paste(substr(Sys.time(),1,10),
                                                                paste(sleep_limits[1],"00",sep=":")),tz="CET") &
           p.data[nrow(p.data),"StartHour"] <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:59:59"),tz="CET") &
           p.data[nrow(p.data),"ActivityDate"] == p.data[1,"ActivityDate"] + 1){
          p.data <- p.data[1:(nrow(p.data)-1),] }
        
        # updating EndHour when StartHour is after EndHour
        p.data[p.data$EndHour<p.data$StartHour,"EndHour"] <- p.data[p.data$EndHour<p.data$StartHour,"EndHour"] + 1*24*60*60
        if(p[i,"EndHour"]<p[i,"StartHour"]){ p[i,"EndHour"] <- p[i,"EndHour"] + 1*24*60*60 }
        
        # updating StartHour and EndHour when StartHour is after the previous EndHour
        for(j in 2:nrow(p.data)){
          if(p.data[j,"StartHour"] < p.data[j-1,"EndHour"]){
            p.data[j,c("StartHour","EndHour")] <- p.data[j,c("StartHour","EndHour")] + 1*24*60*60 }}
        
        # updating all times to allign with the current day
        if(p.data[1,"StartHour"] < as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="CET")){
          p.data[,c("StartHour","EndHour")] <- p.data[,c("StartHour","EndHour")] + 1*24*60*60
          p[i,c("StartHour","EndHour")] <- p[i,c("StartHour","EndHour")] + 1*24*60*60
        }
        
        cat("\n\nCase",i,": Subject",as.character(p[i,"ID"]),"on",as.character(p[i,"ActivityDate"]))
        p1 <- qplot(data=p.data,
                    ymin=StartHour,ymax=EndHour,x=ID,geom="linerange",size=I(1)) +
          coord_flip() + theme_bw() + theme(panel.grid = element_blank()) + xlab("") +  ylab("") +
          scale_y_datetime(labels = function(x) format(x, format = "%H:%M"),
                           limits = c(as.POSIXct(paste(substr(Sys.time(),1,10),
                                              paste(sleep_limits[1],"00",sep=":")),tz="CET"),
                                      p.data[nrow(p.data),"EndHour"])) +
          geom_text(aes(label=format(StartHour,format = "%H:%M"),y=StartHour),hjust = 0.5, nudge_x = 0.1) +
          geom_text(aes(label=format(EndHour,format = "%H:%M"),y=EndHour),hjust = 0.5, nudge_x = -0.1) +
          ggtitle(paste("Original TIB =", paste(round(p.data$TimeInBed/60,2), collapse = ', '),"hours - TST =",
                        paste(round(p.data$MinutesAsleep/60,1), collapse = ', '),"hours"))
        p2 <- qplot(data=p[i,],ymin=StartHour,ymax=EndHour,x=ID,geom="linerange",size=I(1)) +
          coord_flip() + theme_bw() + theme(panel.grid = element_blank()) + xlab("") +  ylab("") +
          scale_y_datetime(labels = function(x) format(x, format = "%H:%M"),
                           limits = c(as.POSIXct(paste(substr(Sys.time(),1,10),
                                              paste(sleep_limits[1],"00",sep=":")),tz="CET"),
                                      p[i,"EndHour"])) +
          geom_text(aes(label=format(StartHour,format = "%H:%M"),y=StartHour),hjust = 0.5, nudge_x = 0.1) +
          geom_text(aes(label=format(EndHour,format = "%H:%M"),y=EndHour),hjust = 0.5, nudge_x = 0.1) +
          ggtitle(paste("Combined TIB =",round(p[i,"TimeInBed"]/60,1),
                        "hours - TST =",round(p[i,"MinutesAsleep"]/60,1),"hours"))
        grid.arrange(p1,p2,nrow=2)  }}
    
    # correcting EndHour when EndHour < StartHour (i.e., adding one day)
    new.data[difftime(new.data$EndHour,new.data$StartHour,units="min")<0,"EndHour"] <-
      new.data[difftime(new.data$EndHour,new.data$StartHour,units="min")<0,"EndHour"] + 1*24*60*60
    
    # updating data
    data <- new.data } 
  return(data) }

DATA PROCESSING

Here, we use the sleepPeriodRecode function to filter and combine sleep periods according to the criteria specified above.

Four cases of consecutive sleep periods are not combined based on visual inspection (see PLOTTING): s025 on 2019-05-20 (TIB = 11.7h, LogId = 22420732071), s026 on 2019-05-10 (TIB = 14h, LogId = 22287155636), s026 on 2019-05-17 (TIB = 12.7h, LogId = 22381565258), and s093 on 2020-11-04 (TIB = 14.9h, LogId = 29568099991).

sleepLog.new <- sleepPeriodRecode(data=sleepLog, # data to be processed
                                  sleep_limits=c("18:00","06:00"), # minimum and maximum StartHour
                                  combine.sleep=TRUE, # should consecutive sleep periods be combined?
                                  max_wakeNumber=Inf, # max No. of wake periods separating consecutive sleep periods
                                  max_wakeDuration=1.5, # max hours of wake periods between consecutive sleep periods
                                  max_wakeDuration_exclude=c("23:00","06:00"), # time limits between which max_wakeDur is NOT applied
                                  lastSleep_startTime="11:00", # max StartHour of the last combined sleep period
                                  notCombined_LogId=c("22420732071","22287155636","22381565258","29568099991")) # to not combine

## 
## 
## ---
## RECODING SLEEP PERIODS (combine = TRUE)
## Original number of cases = 5402
## 
## Combining consecutive sleep periods...
## 
##  - Removing 339 cases with StartHour outside the 18:00 - 11:00 interval
## 
##  -  62 identified groups of consecutive sleep periods:
##    Removing 67 cases (integrated with previous sleep periods)
##    Removing further 75 cases of non-combined sleep starting between 06:00 and 11:00 
## 
## 
## Updated number of cases = 4921

Here, we can see the patterns of sequences of combined Main and Short sleep periods, and the associated duration (in hours) of sleep and wake periods for the combined cases.

In other words, the following table shows the number of non-combined Main (isMainSleep=TRUE) and Short sleep periods (isMainSleep=FALSE), and the number of cases recoded by combining a Main and the following Short sleep period (Main-Short), a Short and the following Main sleep period (Short-Main), or more specific cases (Short-Short-Main)

# sequences of Main and non-Main sleep periods
as.data.frame(summary(sleepLog.new$combType))

PLOTTING

Here, we visualize all cases of automatically combined consecutive sleep periods. From this visual inspection, we decided to not combine four cases of automatically combined consecutive sleep periods: s025 on 2019-05-20 (TIB = 11.7h, LogId = 22420732071), s026 on 2019-05-10 (TIB = 14h, LogId = 22287155636), s026 on 2019-05-17 (TIB = 12.7h, LogId = 22381565258), and s093 on 2020-11-04 (TIB = 14.9h, LogId = 29568099991).

p <- sleepPeriodRecode(data = sleepLog, # data to be processed
                       sleep_limits = c("18:00","06:00"), # minimum and maximum StartHour
                       combine.sleep = TRUE, # should consecutive sleep periods be combined?
                       max_wakeNumber = Inf, # max No. of wake periods separating consecutive sleep periods
                       max_wakeDuration = 1.5, # max hours of wake periods between consecutive sleep periods
                       max_wakeDuration_exclude = c("23:00","06:00"), # time limits between which max_wakeDur is NOT applied
                       lastSleep_startTime = "11:00", # max StartHour of the last combined sleep period
                       doPlot=TRUE)

## 
## 
## ---
## RECODING SLEEP PERIODS (combine = TRUE)
## Original number of cases = 5402
## 
## Combining consecutive sleep periods...
## 
##  - Removing 339 cases with StartHour outside the 18:00 - 11:00 interval
## 
##  -  66 identified groups of consecutive sleep periods:
##    Removing 67 cases (integrated with previous sleep periods)
##    Removing further 75 cases of non-combined sleep starting between 06:00 and 11:00 
## 
## 
## Updated number of cases = 4921
## 
## Plotting 66 cases of combined sleep periods...
## 
## Case 1 : Subject s007 on 2019-05-18

## 
## 
## Case 2 : Subject s019 on 2019-04-30

## 
## 
## Case 3 : Subject s021 on 2019-05-05

## 
## 
## Case 4 : Subject s023 on 2019-04-23

## 
## 
## Case 5 : Subject s025 on 2019-05-09

## 
## 
## Case 6 : Subject s025 on 2019-05-20

## 
## 
## Case 7 : Subject s026 on 2019-05-10

## 
## 
## Case 8 : Subject s026 on 2019-05-17

## 
## 
## Case 9 : Subject s026 on 2019-06-23

## 
## 
## Case 10 : Subject s026 on 2019-08-14

## 
## 
## Case 11 : Subject s027 on 2019-07-03

## 
## 
## Case 12 : Subject s028 on 2019-06-11

## 
## 
## Case 13 : Subject s029 on 2019-05-04

## 
## 
## Case 14 : Subject s030 on 2019-06-30

## 
## 
## Case 15 : Subject s031 on 2019-06-15

## 
## 
## Case 16 : Subject s033 on 2019-07-28

## 
## 
## Case 17 : Subject s034 on 2019-07-07

## 
## 
## Case 18 : Subject s034 on 2019-07-29

## 
## 
## Case 19 : Subject s050 on 2019-10-11

## 
## 
## Case 20 : Subject s050 on 2019-11-08

## 
## 
## Case 21 : Subject s050 on 2019-11-09

## 
## 
## Case 22 : Subject s050 on 2019-11-14

## 
## 
## Case 23 : Subject s050 on 2019-11-17

## 
## 
## Case 24 : Subject s050 on 2019-11-18

## 
## 
## Case 25 : Subject s050 on 2019-11-20

## 
## 
## Case 26 : Subject s050 on 2019-11-21

## 
## 
## Case 27 : Subject s050 on 2019-11-22

## 
## 
## Case 28 : Subject s050 on 2019-11-23

## 
## 
## Case 29 : Subject s050 on 2019-11-24

## 
## 
## Case 30 : Subject s052 on 2019-10-12

## 
## 
## Case 31 : Subject s055 on 2019-10-19

## 
## 
## Case 32 : Subject s055 on 2019-11-12

## 
## 
## Case 33 : Subject s055 on 2019-11-19

## 
## 
## Case 34 : Subject s056 on 2019-11-28

## 
## 
## Case 35 : Subject s059 on 2019-11-22

## 
## 
## Case 36 : Subject s065 on 2020-01-27

## 
## 
## Case 37 : Subject s065 on 2020-02-07

## 
## 
## Case 38 : Subject s065 on 2020-02-10

## 
## 
## Case 39 : Subject s065 on 2020-02-25

## 
## 
## Case 40 : Subject s071 on 2020-02-08

## 
## 
## Case 41 : Subject s072 on 2019-12-22

## 
## 
## Case 42 : Subject s072 on 2020-01-04

## 
## 
## Case 43 : Subject s072 on 2020-01-28

## 
## 
## Case 44 : Subject s072 on 2020-02-05

## 
## 
## Case 45 : Subject s074 on 2020-02-10

## 
## 
## Case 46 : Subject s080 on 2020-03-21

## 
## 
## Case 47 : Subject s080 on 2020-04-09

## 
## 
## Case 48 : Subject s081 on 2020-03-08

## 
## 
## Case 49 : Subject s083 on 2020-04-02

## 
## 
## Case 50 : Subject s083 on 2020-04-07

## 
## 
## Case 51 : Subject s085 on 2020-09-21

## 
## 
## Case 52 : Subject s091 on 2020-09-10

## 
## 
## Case 53 : Subject s091 on 2020-09-26

## 
## 
## Case 54 : Subject s092 on 2020-10-14

## 
## 
## Case 55 : Subject s093 on 2020-11-04

## 
## 
## Case 56 : Subject s102 on 2020-11-09

## 
## 
## Case 57 : Subject s103 on 2020-10-27

## 
## 
## Case 58 : Subject s104 on 2020-11-07

## 
## 
## Case 59 : Subject s108 on 2020-12-17

## 
## 
## Case 60 : Subject s109 on 2020-11-13

## 
## 
## Case 61 : Subject s112 on 2020-12-20

## 
## 
## Case 62 : Subject s114 on 2021-03-24

## 
## 
## Case 63 : Subject s115 on 2021-03-12

## 
## 
## Case 64 : Subject s120 on 2021-03-27

## 
## 
## Case 65 : Subject s120 on 2021-04-01

## 
## 
## Case 66 : Subject s120 on 2021-04-03

ORIGINAL vs. COMBINED DATA

Here, we visualize the differences in sleep timing and duration between the original data and those including combined cases.

ORIGINAL TIMING

Here, we visualize the distribution of StartTime (red dots) and TIB (black lines from StartTime to EndTime) in the original data by plotting TIB intervals for each subject (note that more than one TIB is plotted for each subject). We can see that in the original data several cases start after 6 AM and before 6 PM.

In the first plot, all sleep periods detected for a given participant are shown in the same line.

# plotting one line per subject
qplot(data=sleepLog,
  ymin=StartHour,ymax=EndHour,x=ID,geom="linerange") +
    geom_point(aes(y=StartHour),col="red") +
  geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
  coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") +  ylab("") +
  scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
                   breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
                              as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
          ggtitle("Original TIBs")

In this second plot, each sleep period is visualized on a different line.

# plotting one line per sleep period
qplot(data=sleepLog,
  ymin=StartHour,ymax=EndHour,x=as.factor(paste(ID,ActivityDate,sep=".")),geom="linerange") +
    geom_point(aes(y=StartHour),col="red") +
  geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
  coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") +  ylab("") +
  scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
                   breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
                              as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
          ggtitle("Original TIBs")

COMBINED TIMING

Here, we visualize the distribution of StartTime (red dots) and TIB (black lines from StartTime to EndTime) in the processed data (combined sleep periods are showed in blue). We can see that now no cases start between 6 AM and 6 PM.

In the first plot, all sleep periods detected for a given participant are shown in the same line.

qplot(data=sleepLog.new,
  ymin=StartHour,ymax=EndHour,x=ID,geom="linerange",size=I(1),col=combined) + geom_point(aes(x = ID,y=StartHour),col="red") +
  geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
  coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") +  ylab("") +
  scale_color_manual(values=c("black","lightblue")) +
  scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
                   breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
                              as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
          ggtitle("Combined TIBs")

In this second plot, each sleep period is visualized on a different line.

# plotting one line per sleep period
qplot(data=sleepLog.new,
  ymin=StartHour,ymax=EndHour,x=as.factor(paste(ID,ActivityDate,sep=".")),geom="linerange",size=I(1),col=combined) + 
  geom_point(aes(y=StartHour),col="red") +
  geom_hline(yintercept = as.POSIXct(paste(substr(Sys.time(),1,10),paste(c("06","18"),"00:00",sep=":")),tz="GMT"))+
  coord_flip() + theme_bw() + theme(panel.grid = element_blank(),axis.text.y=element_text(size=8)) + xlab("") +  ylab("") +
  scale_color_manual(values=c("black","lightblue")) +
  scale_y_datetime(position="right",labels = function(x) format(x, format = "%H:%M"),
                   breaks = c(as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT"),
                              as.POSIXct(paste(substr(Sys.time(),1,10),
                                               paste(c("00","06","11","18"),"00:00",sep=":")),tz="GMT")+1*24*60*60)) +
          ggtitle("Combined TIBs")

TIB & TST

Here, we visually compare the distribution of TIB and TST between the original (in yellow) and the recoded data (in red).

# TIB
ggplot(sleepLog,aes(TimeInBed/60)) + geom_histogram(fill=rgb(1,1,0,alpha=.5),col="black") + 
  geom_histogram(data=sleepLog.new,fill=rgb(1,0,0,alpha=.5),col="black") + 
  ggtitle("TIB (hours) in the original (yellow) and combined (red) sleep periods")

# TST
ggplot(sleepLog,aes(MinutesAsleep/60)) + geom_histogram(fill=rgb(1,1,0,alpha=.5),col="black") + 
  geom_histogram(data=sleepLog.new,fill=rgb(1,0,0,alpha=.5)) + 
  ggtitle("TST (hours) in the original (yellow) and combined (red) sleep periods")

Comments:

we can notice a decrease in the number of TIB < 5h, which have been combined to longer sleep periods. We also note that our procedure produced some outliers with extremely long TIB (> 11h)
similar results can be observed with TST

2.3.4. Updating ActivityDate

Then, we update the ActivityDate variable so that it indicates the previous day when the StartTime is between 00:00 and 06:00. This allows better clarifying the distinction between consecutive nocturnal sleep periods.

# setting times
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")

# updating ActivityDate
sleepLog.new[sleepLog.new$StartHour >= h00 & sleepLog.new$StartHour <= h06,"ActivityDate"] <-
  sleepLog.new[sleepLog.new$StartHour >= h00 & sleepLog.new$StartHour <= h06,"ActivityDate"] - 1

We can use the updated ActivityDate variable to check for double cases with the same ID and ActivityDate value.

# No. of duplicates IDday before updatingActivityDate
nrow(sleepLog.new[duplicated(sleepLog.new$IDday),]) # 630

## [1] 630

# No. of duplicates IDday after updatingActivityDate
sleepLog.new$IDday <- as.factor(paste(sleepLog.new$ID,sleepLog.new$ActivityDate,sep="_"))
nrow(sleepLog.new[duplicated(sleepLog.new$IDday),]) # 57

## [1] 57

# showing duplicates
dupl <- sleepLog.new[duplicated(sleepLog.new$IDday),"IDday"]
sleepLog.new[sleepLog.new$IDday%in%levels(as.factor(as.character(dupl))),
             c("ID","LogId","ActivityDate","StartTime","EndTime","SleepDataType")]

Comments:

in 57 cases (1%), there are two observations with the same ID and ActivityDate value
the inspection of these cases suggest that all of them are short early-evening naps (mostly SleepDataType = “classic”) followed by longer nocturnal sleep periods (mostly SleepDataType = “stages”), and thus, only the latter is kept for the analyses.

dupl <- sleepLog.new[sleepLog.new$IDday%in%levels(as.factor(as.character(dupl))),]

# shortNaps (mostly "classic")
(shortNaps <- dupl[seq(1,nrow(dupl)-1,2),])

summary(as.numeric(difftime(shortNaps$EndTime,shortNaps$StartTime,units="hours")))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.050   1.450   1.767   1.931   2.267   4.167

# longSleeps (mostly "stages")
(longSleeps <- dupl[seq(2,nrow(dupl),2),])

summary(as.numeric(difftime(longSleeps$EndTime,longSleeps$StartTime,units="hours")))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.133   5.767   6.550   6.701   7.883  10.700

# plotting
par(mfrow=c(2,2))
hist(as.numeric(difftime(shortNaps$EndTime,shortNaps$StartTime,units="hours")),main="TIB shortNaps (hours)",xlab="")
hist(shortNaps$StartHour,main="StartHour shortNaps",breaks=30,col="gray",xlab="")
hist(as.numeric(difftime(longSleeps$EndTime,longSleeps$StartTime,units="hours")),main="TIB longSleeps (hours)",xlab="")
hist(longSleeps$StartHour,main="StartHour longSleeps",breaks=40,col="gray",xlab="")

Here, we remove the 57 cases of early-evening naps recorded before the subsequent nocturnal sleep periods. Thus, now there are no further cases with the same ID and ActivityDate value.

# removing 57 cases of early-evening naps
memory_sleepLog.new <- sleepLog.new
sleepLog.new <- sleepLog.new[!(sleepLog.new$LogId%in%levels(as.factor(as.character(shortNaps$LogId)))),]

# printing info
cat("excluded",nrow(memory_sleepLog.new)-nrow(sleepLog.new),"cases of early-evening naps preceding nocturnal sleep")

## excluded 57 cases of early-evening naps preceding nocturnal sleep

2.3.5. Daylight Saving Time

In the San Francisco area, the Daylight Saving Time (DST) changed on March 10th (1h forward) and November 3rd, 2019 (1h backward), and again on March 8th (1h forward) and November 1st, 2020 (1h backward), and finally on March 14th, 2021. Here, we inspect the distributions of StartHour values in the 5 days preceding and the 5 days following each of these dates, in order to check whether time was automatically updated by the wristband.

# setting DST changing times
DST.changes <- as.Date(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"))

# selecting cases with ActivityDate = DST.changes + or - 5 days
DST <- as.data.frame(matrix(nrow=0,ncol=4))
for(i in 1:length(DST.changes)){
  DST <- rbind(DST,sleepLog.new[difftime(sleepLog.new$ActivityDate,DST.changes[i],units="days")>(-5) &
                          difftime(sleepLog.new$ActivityDate,DST.changes[i],units="days")<5,
                        c("ID","ActivityDate","StartTime","StartHour","EndTime","TimeInBed","Duration")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$ActivityDate%in%DST.changes,"DST"] <- TRUE

# computing time (hours) from midnight
DST$timeFrom00 <- as.POSIXct(paste(lubridate::hour(DST$StartTime), lubridate::minute(DST$StartTime)), format="%H %M") 
DST$timeFrom00 <- as.numeric(difftime(DST$timeFrom00,
                                      as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT"),units="hours"))
# subtracting 1 day to cases with timeFrom00 > 12
DST[DST$timeFrom00>12,"timeFrom00"] <- DST[DST$timeFrom00>12,"timeFrom00"] - 24

# plotting StartTime trends
for(i in 1:length(DST.changes)){
  DSTs <- c(substr(DST.changes[i],1,7),
            paste(substr(DST.changes[i],1,6),as.integer(substr(DST.changes[i],7,7))-1,sep=""))
  print(ggplot(data=DST[substr(DST$ActivityDate,1,7)%in%DSTs,],aes(x=ActivityDate,y=timeFrom00)) + 
          geom_line(aes(colour=ID)) + geom_point(aes(colour=ID),size=3) + ggtitle(DST.changes[i]) +
          geom_vline(xintercept=DST.changes[i]) +
          theme(axis.text.x=element_text(angle=45),legend.position = "none"))}

Comments:

the visual inspection of StartTime trends in those participants that recorded their sleep during the days around DST changes does not seem to suggest systematic shifts pairing with time changes
the only DST change that shows some substantial shift in StartTime is the first one (2019-03-10), with four participants out of eight showing an increasing upward trend of one-to-four hours
even the inspection of the TimeInBed (minutes) or Duration (in ms) automatically computed by the Fitbit and the difference between EndTime and StartTime does not highlight systematic biases associated with DST changes

# TimeInBed vs. EndTime-StartTime
DST$End_minus_Start <- as.numeric(difftime(DST$EndTime,DST$StartTime,units="mins"))
DST$timeDiff <- DST$TimeInBed - DST$End_minus_Start
DST$timeDiff2 <- DST$Duration/1000/60 - DST$End_minus_Start
DST[,c("StartTime","EndTime","TimeInBed","Duration","End_minus_Start","timeDiff","timeDiff2")]

2.3.6. Temporal continuity

Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.

par(mfrow=c(3,3))
for(ID in levels(sleepLog.new$ID)){ 
  plot((1:nrow(sleepLog.new[sleepLog.new$ID==ID,])~sleepLog.new[sleepLog.new$ID==ID,"StartTime"]),main=ID,xlab="",ylab="") }

Comments:

most participants show several clusters of missing data during the period of participation, with participants s038, s041, s048, s052, s063, s064, s086, s089, s090, s105, s109, s119, and s120 showing the longest and most frequent periods of missing data
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s040, s042, s053, s055, s058, s060, s062, s063, s064, and s094), coherently with what observed at the beginning of section 2.3.
these cases will be better discussed in the data cleaning section.

2.3.7. Saving dataset

Here, we update and save the recoded sleepLog dataset with the 4,921 included cases.

# updating and saving sleepLog dataset with combined TIBs
sleepLog_noncomb <- sleepLog
save(sleepLog_noncomb,file="DATA/datasets/sleepLog_nonComb.RData") # saving original dataset
sleepLog <- sleepLog.new 
save(sleepLog,file="DATA/datasets/sleepLog_combined.RData") # saving combined dataset

2.4. sleepEBE

sleepEBE data exported from Fitabase consist of 30-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘light,’ ‘deep,’ or REM sleep. Specifically, we focus on the SleepStage data column (i.e., accounting for short detecting awakenings/arousals to adjust wake missdetection).

Here, we recode the Time variable (i.e., the “Date and time within a defined sleep period in mm/dd/yy hh:mm:ss format”), based on which the ActivityDate variable is computed. To change the resolution from minutes (as exported from Fitabase) to seconds, we use argument add30 of the timeCheck function, which adds 30 seconds to each other epoch within the same LOG. Moreover, the day.withinNight argument is used to keep the same ActivityDate value for those epochs recorded before and after midnight, within the same LogId.

p <- timeCheck(data=sleepEBE,ID="ID",day="ActivityDate",hour="Time",
               input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M",
               add30=TRUE,LogId="LogId",day.withinNight=TRUE)

## 4243037 observations in 4049 days from 93 participants:
## 
## - mean No. of days/participant = 43.54  SD = 12.55  min = 4  max = 62 
## - mean data collection duration (days) = 73.82 - SD = 36.14  min = 5  max = 331 
## 
## - mean No. of missing days per participant = 30.28  SD = 36.17  min = 1  max = 278 
## - mean No. of consecutive missing days per participant = 15.3  SD = 32.61  min = 2  max = 268
## 
## - No. of duplicated cases by hour (same ID and hour) = 0

Comments:

the ActivityDate variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 40-60 nonmissing days of data, with only one participant showing less than 10 days, and further six participants showing less than 20 days
no cases have the same participant’s identifier and temporal coordinate
the No. (substantial) of missing days and consecutive missing days is in line with that shown for sleepLog data

2.4.1. LogId

As in the sleepLog data, sleepEBE data are associated with specific LogId identifying separate sleep periods for each participant.

# LogId as factor
sleepEBE <- p # dataset processed with timeCheck
sleepEBE$LogId <- as.factor(sleepEBE$LogId)

# sanity check: no cases with double logId and the same day
sleepEBE$dayLog <- as.factor(paste(sleepEBE$LogId,sleepEBE$ActivityDate,sep="_"))
dayLog <- sleepEBE[!duplicated(sleepEBE$dayLog),]
cat("sanity check:",length(which(summary(dayLog$LogId)==2))==0)

## sanity check: TRUE

We can notice that the number of LogIds from sleepEBE data (N = 4,573) is lower than that showed by sleepLog data (N = 5,402). This difference is partially accounted by cases of SleepDataType = “classic” (N = 857), not included in sleepEBE data, in addition to some cases (N = 41) only included in sleepEBE data.

# sleepLog (870 sleepLog only)
data.frame(NsleepLog=nrow(sleepLog_noncomb),NsleepLog.Stages=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType!="classic",]),
           NsleepLog.Classic=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="classic",]),
           NsleepLog_NOsleepEBE=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId)]))

# sleepEBE (41 sleepEBE only)
stagesLogs <- levels(as.factor(as.character(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="stages","LogId"])))
classicLogs <- levels(as.factor(as.character(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="classic","LogId"])))
data.frame(NsleepEBE=nlevels(sleepEBE$LogId),
           NsleepEBE_INsleepLog=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]),
           NsleepEBE_INsleepLog.Stages=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%stagesLogs]),
           NsleepEBE_INsleepLog.Classic=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%classicLogs]),
           NsleepEBE_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]))

Comments:

note that the only LogId value that is included in both sleepEBE and sleepLog “classic” is the same double case highlighted in section 2.3.1. Coherently with the retained case in sleepLog data, this case shows a TIB of 11.3h.

# duration (hours) of the single case included in both sleepEBE and sleepLog classic LogIds
nrow(sleepEBE[sleepEBE$LogId=="24433907842",])/2/60

## [1] 11.30833

2.4.2. Daylight Saving Time

Here, we inspect EBE Time values in those epochs immediately preceding or following the DST changes highlighted in section 2.3.5. Note that DST times in the San Francisco area always changed at 2 AM

# setting DST changing times
DST.changes <- as.POSIXct(paste(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"),
                                "02:00:00",sep=" "))

# selecting cases with ActivityDate = DST.changes + or - 1 minute
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
  DST <- rbind(DST,sleepEBE[difftime(sleepEBE$Time,DST.changes[i],units="mins")>(-1) &
                              difftime(sleepEBE$Time,DST.changes[i],units="mins")<1,c("ID","Time")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]

Comments:

the inspection of the temporal coordinates of the epochs immediately preceding or following DST changes suggests that time is not updated by the Fitbit device during DST changes, coherently with what concluded for sleepLog data in section 2.3.5
indeed, there are no 1-hours ‘holes’ in the sleepEBE data corresponding to DST changes, but the Time is continuously updated by adding 30 sec from one epoch to the following one

2.4.3. Temporal continuity

The temporal continuity discussed above can be also inspected throughout the whole sleepEBE dataset, by counting the No. of consecutive epochs within the same LogId value whose StartTime values differ by more than 60 sec (i.e., 30 sec as expected + further 30 due to issues related to time rounding). This is done with the checkTimeContinuity function.

show checkTimeContinuity

checkTimeContinuity <- function(data,temporalDiff,doPlot=FALSE){ 
  nHoles <- 0
  for(LOG in levels(as.factor(as.character(data$LogId)))){
    LogData <- data[data$LogId==LOG,c("ID","Time")]
    LogData$Time.LAG <- dplyr::lag(LogData$Time, n = 1, default = NA)
    LogData$diffTime <- as.numeric(difftime(LogData$Time,LogData$Time.LAG,units="secs"))
    LogData$IDtime <- as.factor(paste(LogData$ID,LogData$Time))
    diffs <- na.omit(LogData[LogData$diffTime>temporalDiff,])
    if(nrow(diffs)>0){
      diffs <- LogData[LogData$ID%in%levels(as.factor(as.character(diffs$ID))) &
                         LogData$Time>=na.omit(LogData[LogData$diffTime>temporalDiff,"Time"])-120 &
                         LogData$Time<=na.omit(LogData[LogData$diffTime>temporalDiff,"Time"])+120,]
      print(diffs)
      nHoles <- nHoles + 1 }}
  cat(nHoles,"consecutive epochs separated by more than 60 secs")
  
  if(doPlot==TRUE){
    par(mfrow=c(3,3))
    for(ID in levels(data$ID)){ plot((1:nrow(data[data$ID==ID,])~data[data$ID==ID,"Time"]),main=ID) }}
  }

# sorting sleepEBE by ID, ActivityDate, and Time
sleepEBE <- sleepEBE[order(sleepEBE$ID,sleepEBE$ActivityDate,sleepEBE$Time),]

# checking temporal continuity setting temporalDiff to 60 sec
checkTimeContinuity(data=sleepEBE,temporalDiff=60)

##           ID                Time            Time.LAG diffTime
## 2714657 s081 2020-03-08 09:00:00 2020-03-08 07:59:30     3630
## 2714658 s081 2020-03-08 09:00:30 2020-03-08 09:00:00       30
## 2714659 s081 2020-03-08 09:01:00 2020-03-08 09:00:30       30
## 2714660 s081 2020-03-08 09:01:30 2020-03-08 09:01:00       30
## 2714661 s081 2020-03-08 09:02:00 2020-03-08 09:01:30       30
##                           IDtime
## 2714657 s081 2020-03-08 09:00:00
## 2714658 s081 2020-03-08 09:00:30
## 2714659 s081 2020-03-08 09:01:00
## 2714660 s081 2020-03-08 09:01:30
## 2714661 s081 2020-03-08 09:02:00
## 1 consecutive epochs separated by more than 60 secs

# one case with diffTime > 60 sec
sleepEBE[which(rownames(sleepEBE)%in%as.character(2714654:2714660)),c("ID","LogId","ActivityDate","Time")]

Comments:

only in one case (LogId = 26201445747) there are two consecutive epochs separated by more than 60 seconds, namely 1 hour, and this case is observed precisely on March 8th, 2020 (DST change)
nevertheless, no other ‘holes’ are observed in corrispondence of DST changes, coherently with our conclusions in the section above
none of the other cases show time shifts of more than 60 secs

In a number of cases (28%), there is a time shift of 60 secs between the last and the preceding epoch. Here, we correct these cases by subtracting 30 seconds from the last epoch.

# counting and correcting cases with 60 secs between the last and the preceding epoch
n60 <- 0
for(LOG in levels(as.factor(as.character(sleepEBE$LogId)))){
  LogData <- sleepEBE[sleepEBE$LogId==LOG,c("ID","Time")]
  if(as.numeric(difftime(tail(LogData$Time,1),tail(LogData$Time,2)[1],units="secs"))==60){
    n60 <- n60 + 1
    sleepEBE[sleepEBE$LogId==LOG,"Time"][nrow(LogData)] <- sleepEBE[sleepEBE$LogId==LOG,"Time"][nrow(LogData)] - 30 }}
n60 # No. of corrected cases

## [1] 1289

# re-checking temporal continuity setting temporalDiff to 30 sec
checkTimeContinuity(data=sleepEBE,temporalDiff=30)

##           ID                Time            Time.LAG diffTime
## 2714657 s081 2020-03-08 09:00:00 2020-03-08 07:59:30     3630
## 2714658 s081 2020-03-08 09:00:30 2020-03-08 09:00:00       30
## 2714659 s081 2020-03-08 09:01:00 2020-03-08 09:00:30       30
## 2714660 s081 2020-03-08 09:01:30 2020-03-08 09:01:00       30
## 2714661 s081 2020-03-08 09:02:00 2020-03-08 09:01:30       30
##                           IDtime
## 2714657 s081 2020-03-08 09:00:00
## 2714658 s081 2020-03-08 09:00:30
## 2714659 s081 2020-03-08 09:01:00
## 2714660 s081 2020-03-08 09:01:30
## 2714661 s081 2020-03-08 09:02:00
## 1 consecutive epochs separated by more than 60 secs

Comments:

all cases were effectively corrected
now, no more cases have one or more couples of consecutive epochs differing more than 30 sec, with the only exception of SleepLog 26201445747 (see section 2.5.3)

Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.

par(mfrow=c(3,3))
for(ID in levels(sleepEBE$ID)){ 
  plot((1:nrow(sleepEBE[sleepEBE$ID==ID,])~sleepEBE[sleepEBE$ID==ID,"Time"]),main=ID,xlab="",ylab="",cex=0.5) }

Comments:

most participants show several clusters of missing data during the period of participation, with participants s038, s040, s041, s048, s052, s063, s064, s086, s089, s090, s105, s109, s119, and s120 showing the longest and most frequent periods of missing data, partially coherently with what reported in section 2.3.6 for sleepLog data
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s042, s053, s055, s058, s062, s063, s064, s074, and s094), partially coherently with what reported in section 2.3.6 for sleepLog data
these cases will be better discussed in the data cleaning section.

2.4.3. Saving dataset

Here, we save the processed sleepEBE dataset to be used in the following steps.

save(sleepEBE,file="DATA/datasets/sleepEBE_timeProcessed.RData")

2.5. classicEBE

classicEBE data exported from Fitabase consist of 60-sec epoch-by-epoch categorization of sleep periods as either ‘wake’ or ‘sleep.’ classicEBE data are processed to integrate sleepEBE with those cases (currently not included) with SleepDataType = “classic”.

Here, we recode the date variable (here renamed as Time (i.e., the “Date and minute of that day within a defined sleep period in mm/dd/yy hh:mm:ss format”), based on which the ActivityDate variable is computed.

colnames(classicEBE)[which(colnames(classicEBE)=="date")] <- "Time"
colnames(classicEBE)[colnames(classicEBE)=="logId"] <- "LogId"
p <- timeCheck(data=classicEBE,ID="ID",day="ActivityDate",hour="Time",
               input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M",
               day.withinNight=TRUE,LogId="LogId")

## 2505034 observations in 4663 days from 93 participants:
## 
## - mean No. of days/participant = 50.14  SD = 11.95  min = 6  max = 86 
## - mean data collection duration (days) = 77.67 - SD = 38.65  min = 7  max = 331 
## 
## - mean No. of missing days per participant = 27.53  SD = 41.14  min = 0  max = 275 
## - mean No. of consecutive missing days per participant = 15.84  SD = 37.03  min = 0  max = 268
## 
## - No. of duplicated cases by hour (same ID and hour) = 0

Comments:

the ActivityDate variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 55-65 nonmissing days of data, coherently with sleepEBE and sleepLog data
no cases have the same participant’s identifier and temporal coordinate
the No. (substantial) of missing days is slightly lower than that shown by sleepEBE and sleepLog data, whereas the No. of consecutive missing days is similar across the three datasets

2.5.1. LogId

As in the sleepLog data, classicEBE data are associated with specific LogId identifying separate sleep periods for each participant.

# LogId as factor
classicEBE <- p # dataset processed with timeCheck
classicEBE$LogId <- as.factor(classicEBE$LogId)

# sanity check: no cases with double logId and the same day
classicEBE$dayLog <- as.factor(paste(classicEBE$LogId,classicEBE$ActivityDate,sep="_"))
dayLog <- classicEBE[!duplicated(classicEBE$dayLog),]
cat("sanity check:",length(which(summary(dayLog$LogId)==2))==0)

## sanity check: TRUE

We can notice that the number of LogIds from classicEBE data (5,759) is higher than that showed by both sleepLog data (N = 5,402) and sleepEBE data (N = 4,573).

# sleepLog (1 sleepLog only)
data.frame(NsleepLog=nrow(sleepLog_noncomb),NStages=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType!="classic",]),
           NsClassic=nrow(sleepLog_noncomb[sleepLog_noncomb$SleepDataType=="classic",]),
           NsleepLog_NOsleepEBE=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId)]),
           NsleepLog_NOclassicEBE=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(classicEBE$LogId)]),
           NsleepLog_NOstageORclassic=length(levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId)
                                                                            & !levels(sleepLog_noncomb$LogId
                                                                                      )%in%levels(classicEBE$LogId)]))

# sleepEBE (41 sleepEBE only)
data.frame(Nstages=nlevels(sleepEBE$LogId),
           Nstages_INsleepLog=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]),
           Nstages_INsleepLog.Stages=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%stagesLogs]),
           Nstages_INsleepLog.Classic=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%classicLogs]),
           Nstages_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]))

# classicEBE (377 classicEBE only)
data.frame(Nclassic=nlevels(classicEBE$LogId),
           Nclassic_INsleepLog=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]),
           Nclassic_INsleepLog.Stages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%stagesLogs]),
           Nclassic_INsleepLog.Classic=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%classicLogs]),
           NclassicE_NOsleepLog=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]))

# classicEBE vs. sleepEBE 
data.frame(Nstages_INclassic=length(levels(sleepEBE$LogId)[levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nstages_NOclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nclassic_INsleepEBE=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
           Nclassic_NOstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))

# only sleepEBE and/or classicEBE
data.frame(NstagesANDclassic_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nstages_NOsleepLogORclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nclassic_NOsleepLogORstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
           NclassicANDsleepLogNostages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))

Comments:

we can notice only one case only included in sleepLog but not included in sleepEBE or classicEBE data (N = 1)
41 cases are included in sleepEBE but not in sleepLog
336 cases are only included in classicEBE but not in sleepEBE or sleepLog
869 cases are included in classicEBE and sleepLog but not in sleepEBE
in contrast, no cases are only included in sleepEBE or in both sleepEBE and classicEBE but not in sleepLog

2.5.2. Summary of sleep data

From the above, we can identify three main groups of cases which are uniquely included in a given dataset but not in the others:

a) uniqueLogId

Those cases only included in sleepLog but not in sleepEBE or classicEBE (N = 1) will be removed from the analyses (see data cleaning)

# identifying and showing 1 case only included in sleepLog data
uniqueLogId <- levels(sleepLog_noncomb$LogId)[!levels(sleepLog_noncomb$LogId)%in%levels(sleepEBE$LogId) &
                                                  !levels(sleepLog_noncomb$LogId)%in%levels(classicEBE$LogId)]
length(uniqueLogId)

## [1] 1

sleepLog[sleepLog$LogId%in%uniqueLogId,c("ID","LogId","ActivityDate","StartTime","EndTime","Duration")]

b) uniqueEBElogs

Those cases only included in sleepEBE but not in sleepLog (N = 41) will be processed separately based on the number of epochs included in sleepEBE. Note that these are all cases of nocturnal sleep with TIB between 5.7 and 11.25 hours.

# summarizing TIB and StartTime of 41 cases only included in sleepEBE data
uniqueEBElogs <- levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)]
length(uniqueEBElogs)

## [1] 41

n <- nrow(sleepEBE[sleepEBE$LogId==uniqueEBElogs[1],])
start <- head(sleepEBE[sleepEBE$LogId==uniqueEBElogs[1],"Time"],1)
for(i in 2:length(uniqueEBElogs)){ 
  n <- c(n,nrow(sleepEBE[sleepEBE$LogId==uniqueEBElogs[i],]))
  start <- c(start,head(sleepEBE[sleepEBE$LogId==uniqueEBElogs[i],"Time"],1))}
StartHour <- as.POSIXct(paste(lubridate::hour(start),lubridate::minute(start)),format="%H %M",tz="GMT")

# plotting
par(mfrow=c(2,1))
hist(n/2/60,breaks=35,col="black",main="TIB (hours) in uniqueEBElogs")
hist(StartHour,breaks=35,col="black",xlab="",main="StartTime in uniqueEBElogs")

c) uniqueClassiclogs

Those cases only included in classicEBE but not in sleepLog or sleepEBE (N = 336) will be also processed separately, based on the number of epochs included in classicEBE. Most of these cases seem to be cases of nocturnal sleep, with TIB from 2 to 13h.

# summarizing TIB and StartTime of 336 cases only included in classicEBE
uniqueClassiclogs <- levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
                                                !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]
length(uniqueClassiclogs)

## [1] 336

n <- nrow(classicEBE[classicEBE$LogId==uniqueClassiclogs[1],])
start <- head(classicEBE[classicEBE$LogId==uniqueClassiclogs[1],"Time"],1)
for(i in 2:length(uniqueClassiclogs)){ 
  n <- c(n,nrow(classicEBE[classicEBE$LogId==uniqueClassiclogs[i],]))
  start <- c(start,head(classicEBE[classicEBE$LogId==uniqueClassiclogs[i],"Time"],1))}
StartHour <- as.POSIXct(paste(lubridate::hour(start),lubridate::minute(start)),format="%H %M",tz="GMT") 

par(mfrow=c(2,1))
hist(n/60,breaks=35,col="black",main="TIB (hours) in uniqueClassiclogs")
hist(StartHour,breaks=35,col="black",xlab="",main="StartTime in uniqueClassiclogs")

d) ClassicAndSleepLog

Those cases only included in both classicEBE and* sleepLog but not in sleepEBE (N = 869) will be also processed separately, based on the number of epochs included in classicEBE. Only a minority of these cases seems to be cases of nocturnal sleep, with TIB from 5 to 13h, whereas most cases are naps (TIB < 5h).

# summarizing TIB and StartTime of 869 cases only included in classicEBE and sleepLog
ClassicAndSleepLog <- levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
                                                !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]
length(ClassicAndSleepLog)

## [1] 869

par(mfrow=c(2,1))
hist(sleepLog_noncomb[sleepLog_noncomb$LogId%in%ClassicAndSleepLog,"TimeInBed"]/60,breaks=35,col="black",
     main="TIB (hours) in ClassicAndSleepLog")
hist(sleepLog_noncomb[sleepLog_noncomb$LogId%in%ClassicAndSleepLog,"StartHour"],breaks=35,col="black",xlab="",
     main="StartTime in ClassicAndSleepLog")

2.5.3. Daylight Saving Time

# setting DST changing times
DST.changes <- as.POSIXct(paste(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"),
                                "02:00:00",sep=" "),tz="GMT")

# selecting cases with ActivityDate = DST.changes + or - 2 minutes
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
  DST <- rbind(DST,classicEBE[difftime(classicEBE$Time,DST.changes[i],units="mins")>(-2) &
                              difftime(classicEBE$Time,DST.changes[i],units="mins")<2,c("ID","Time","LogId")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]

Comments:

the inspection of the temporal coordinates of the epochs immediately preceding or following DST changes suggests that time is not updated by the Fitbit device during DST changes, coherently with what concluded for sleepLog data in section 2.3.5
indeed, there are no 1-hours ‘holes’ in the sleepEBE data corresponding to DST changes, but the Time is continuously updated by adding 0 sec from one epoch to the following one

2.5.4. Temporal continuity

# sorting sleepEBE by ID, LogId, and Time
classicEBE <- classicEBE[order(classicEBE$ID,classicEBE$ActivityDate,classicEBE$Time),]

# checking temporal continuity
checkTimeContinuity(data=classicEBE,temporalDiff=60)

## 0 consecutive epochs separated by more than 60 secs

Comments:

none of the cases shows time shifts longer than 60 secs, including case LogId = 26201445747, which showed a shift of 1 hour between 7:59 and 9:00 in sleep EBE data

Here, we better inspect sleepLog, sleepEBE, and classicEBE times for this case.

# computing and showing sleep times for LogId 26201445747
times <- data.frame(dataType=c("LogId","sleepEBE","classicEBE"),
           start=c(sleepLog[sleepLog$LogId=="26201445747","StartTime"],head(sleepEBE[sleepEBE$LogId=="26201445747","Time"],1),
                   head(classicEBE[classicEBE$LogId=="26201445747","Time"],1)),
           end=c(sleepLog[sleepLog$LogId=="26201445747","EndTime"],tail(sleepEBE[sleepEBE$LogId=="26201445747","Time"],1),
                 tail(classicEBE[classicEBE$LogId=="26201445747","Time"],1)),
           duration=c(sleepLog[sleepLog$LogId=="26201445747","TimeInBed"],nrow(sleepEBE[sleepEBE$LogId=="26201445747",])/2,
                      nrow(classicEBE[classicEBE$LogId=="26201445747",])))
times$timeDiff <- difftime(times$end,times$start,units="mins")
times

Comments:

both sleepEBE and classicEBE show shorter TIB than sleepLog
since classicEBE times are closer to sleepLog times, with no missing epochs (in contrast to sleepEBE data), we keep only these

Here, we discard LogId 26201445747 epochs from sleepEBE.

sleepEBE <- sleepEBE[sleepEBE$LogId!="26201445747",] # removing case from sleepEBE
sleepEBE$LogId <- as.factor(as.character(sleepEBE$LogId)) # resetting LogIds
ClassicAndSleepLog <- c(ClassicAndSleepLog,"26201445747")

Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.

par(mfrow=c(3,3))
for(ID in levels(classicEBE$ID)){ 
  plot((1:nrow(classicEBE[classicEBE$ID==ID,])~classicEBE[classicEBE$ID==ID,"Time"]),main=ID,xlab="",ylab="",cex=0.5) }

Comments:

most participants show several clusters of missing data during the period of participation, with participants s038, s040, s041, s048, s052, s063, s064, s089, and s090 showing the longest and most frequent periods of missing data, partially coherently with what reported in section 2.3.6 for sleepLog data
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s040, s042, s052, s053, s055, s060, s062, s063, and s064, partially coherently with what reported in section 2.3.6 for sleepLog data
these cases will be better discussed in the data cleaning section.

2.5.5. Saving dataset

Here, we save the processed classicEBE dataset to be used in the following steps. We also update the sleepEBE dataset, and we save the special LogId cases.

# saving updated classicEBE and sleepEBE datsets
save(classicEBE,file="DATA/datasets/classicEBE_timeProcessed.RData")
save(sleepEBE,file="DATA/datasets/sleepEBE_timeProcessed2.RData")

# saving LogId special cases
LogId_special <- list(uniqueLogId,uniqueEBElogs,uniqueClassiclogs,ClassicAndSleepLog)
save(LogId_special,file="DATA/datasets/LogId_special.RData")

2.6. HR.1min

HR.1min data exported from Fitabase consist of 60-sec epoch-by-epoch heart rate data recorded by the Fitbit device. This variable will be used both for recomputing both diurnal and nocturnal HR values.

Here, we recode the Time variable (i.e., the “Date and hour value in mm/dd/yyyy hh:mm:ss format”), based on which the ActivityDate variable is computed.

# standardizing StartTime format and converting as POSIXct
HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min$Time))=="M","time2"] <- # timestamps with AM/PM specification
  as.POSIXct(HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min))=="M","Time"],format="%m/%d/%Y %I:%M:%S %p",tz="GMT") 
HR.1min[is.na(HR.1min$time2),"time2"] <-
  as.POSIXct(HR.1min[is.na(HR.1min$time2),"Time"],format="%m/%d/%Y %I:%M:%S %p",tz="GMT") # cases requiring time zone
HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min$Time))!="M","time2"] <-
  as.POSIXct(HR.1min[substr(HR.1min$Time,nchar(HR.1min$Time),nchar(HR.1min$Time))!="M","Time"],
             format="%m/%d/%Y %H:%M",tz="GMT") # timestamps without AM/PM specification
HR.1min[is.na(HR.1min$time2),"time2"] <-
  as.POSIXct(HR.1min[is.na(HR.1min$time2),"Time"],format="%m/%d/%Y %H:%M",tz="GMT") # timestamps requiring time zone specification
HR.1min$Time <- HR.1min$time2 # keeping only the corrected timestamps
HR.1min$time2 <- NULL

# recoding day and hour, and checking time and missing data points
p <- timeCheck(data=HR.1min,ID="ID",day="ActivityDate",hour="Time",
               input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")

## 6986307 observations in 5800 days from 93 participants:
## 
## - mean No. of days/participant = 62.37  SD = 9.32  min = 12  max = 80 
## - mean data collection duration (days) = 78.39 - SD = 38.42  min = 25  max = 332 
## 
## - mean No. of missing days per participant = 16.02  SD = 40.16  min = 0  max = 266 
## - mean No. of consecutive missing days per participant = 12.9  SD = 36.76  min = 0  max = 267
## 
## - No. of duplicated cases by hour (same ID and hour) = 0

Comments:

the ActivityDate variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 60-70 nonmissing days of data, with only a few participants showing less than 20 days
no cases have the same participants identifier and temporal coordinate
the No. (substantial) of missing days and consecutive missing days is substantially lower than with that shown for sleepLog, sleepEBE and classicEBE

2.6.1. Daylight Saving Time

HR.1min <- p[order(p$ID,HR.1min$Time),]
# setting DST changing times
DST.changes <- as.POSIXct(paste(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"),
                                "02:00:00",sep=" "),tz="GMT")

# selecting cases with ActivityDate = DST.changes + or - 3 minutes
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
  DST <- rbind(DST,HR.1min[difftime(HR.1min$Time,DST.changes[i],units="mins")>(-3) &
                             difftime(HR.1min$Time,DST.changes[i],units="mins")<3,c("ID","Time")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]

Comments:

the inspection of the temporal coordinates of the epochs immediately preceding or following DST changes suggests that time is sometimes updated by the Fitbit device during DST changes, contrarily with what concluded for sleepLog, sleepEBE and classicEBE data
specifically, time is updated (i.e., 1 hour forward) only in March but not in November DST changes

Here, we adjust those cases that show 1-hour shifts forward in March 2019-2021 by subtracting 1h to all Time values after DST changes in March.

# subracting 1h to cases with 1h shift forward on March 2019-2021
for(ID in levels(as.factor(as.character(DST$ID)))){
  if(nrow(DST[DST$ID==ID,])==2){ # selecting only cases with time shifts associated with DST changes 
    if(substr(DST[DST$ID==ID,"Time"][2],1,10)%in%substr(DST.changes,1,10)[seq(1,5,2)]){ # selecting only DST changes in March
      HR.1min[HR.1min$ID==ID & HR.1min$Time>DST[DST$ID==ID,"Time"][2],"Time"] <-
        HR.1min[HR.1min$ID==ID & HR.1min$Time>DST[DST$ID==ID,"Time"][2],"Time"] - 1*60*60 }}}

# sanity check
DST <- as.data.frame(matrix(nrow=0,ncol=2))
for(i in 1:length(DST.changes)){
  DST <- rbind(DST,HR.1min[difftime(HR.1min$Time,DST.changes[i],units="mins")>(-3) &
                             difftime(HR.1min$Time,DST.changes[i],units="mins")<3,c("ID","Time")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$Time%in%DST.changes,"DST"] <- TRUE
DST[order(DST$ID,DST$Time),]

Comments:

now, no more cases have Time shifts of 1 hour corresponding to DST changes in March

2.6.2. Temporal continuity

Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.

par(mfrow=c(3,3))
for(ID in levels(HR.1min$ID)){ 
  plot((1:nrow(HR.1min[HR.1min$ID==ID,])~HR.1min[HR.1min$ID==ID,"Time"]),main=ID,xlab="",ylab="",cex=0.5) }

Comments:

most participants show several clusters of missing data during the period of participation, with participants s038, s040, s041, s052, s060, s063, s089, and s090 showing the longest and most frequent periods of missing data
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s028, s029, s030, s031, s033, s040, s042, s052, s053, s055, s060, s062, s063, and s064, partially coherently with what reported in section 2.3.6 for sleepLog data
a final bunch of epochs several hours/days after the previous ones is observed in a relevant number of participants (i.e., s002, s006, s007, s013, s022, s023, s025, s026, s028, s029, s030, s031, s033, s038, s040, s042, s053, s055, s060, s063, and s064)
participants s038 and s089 show a very few nonmissing days, respectively 13 and six
these cases will be better discussed in the data cleaning section.

2.6.3. Saving dataset

Here, we save the processed HR.1min dataset to be used in the following steps.

save(HR.1min,file="DATA/datasets/HR.1min_timeProcessed.RData")

2.7. dailyDiary

dailyDiary data were recorded with Survey Sparrow (SurveySparrow Inc.), and include the daily diary reports on psychological distress and other psychosocial variables self-reported each evening by participants. dailyDiary data are stored in a dataset with one row per day, with the StartedTime and SubmittedTime variables indicating the survey start and submission time, respectively. Thus, in this dataset we only need to recode the StartedTime variable, based on which the .

# recoding day and hour, and checking time and missing data points
dailyDiary <-  timeCheck(data=dailyDiary,ID="ID",day="ActivityDate",hour="StartedTime",
                         input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")

## 5133 observations in 4302 days from 93 participants:
## 
## - mean No. of days/participant = 46.26  SD = 9.01  min = 23  max = 72 
## - mean data collection duration (days) = 63.28 - SD = 8.47  min = 38  max = 111 
## 
## - mean No. of missing days per participant = 17.02  SD = 10.27  min = 0  max = 73 
## - mean No. of consecutive missing days per participant = 4.14  SD = 2.59  min = 0  max = 17
## 
## - No. of duplicated cases by hour (same ID and hour) = 43

# updating SubmittedTime
subTime <- timeCheck(data=dailyDiary,ID="ID",day="ActivityDate",hour="SubmittedTime",printInfo = FALSE,
                     input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M")
subTime <- subTime[order(subTime$ID,subTime$StartedTime),] # sorting by StartedTime (now sorted by SubmittedTime)
dailyDiary$SubmittedTime <- subTime$SubmittedTime

Comments:

the ActivityDate variable has been successfully created with the ‘yyyy-mm-dd’ format
the earliest and the latest time points match with the temporal boundaries of data collection
most participants have ~ 40-60 nonmissing days of data, with only a few participants showing less than 20 days
43 cases have the same participants identifier and temporal coordinate
the overall No. of days is lower than that shown by sleepLog data, although the No. (substantial) of missing days and consecutive missing days is also lower than with that shown for sleepLog. Specifically, despite the substantial No. of missing days (i.e., 78% with 10+ missing days vs. 76% of sleepLog), the number of consecutive missing days is lower, with only three participants (s052) showing 10+ consecutive missing days. This and other cases will be better discussed in the data cleaning section.

(dailyDiary_compliance <-  
  timeCheck(data=dailyDiary,ID="ID",day="ActivityDate",hour="StartedTime",returnInfo=TRUE,printInfo=FALSE,
            input.dayFormat="%m/%d/%Y",output.dayFormat="%Y-%m-%d",output.hourFormat="%m/%d/%Y %H:%M"))

# plotting missing days
par(mfrow=c(1,2))
hist(dailyDiary_compliance$nMissingDays,main="No. of missing days",breaks=30)
hist(dailyDiary_compliance$maxdayDiff,main="Max No. of consecutive missing days",breaks=30)

2.7.1. Duplicated responses

First, we better inspect the 43 duplicated responses observed above.

dailyDiary$IDhour <- as.factor(paste(dailyDiary$ID,dailyDiary$StartedTime,sep="_"))
dup <- dailyDiary[duplicated(dailyDiary$IDhour),c("ID","ActivityDate","StartedTime","IDhour")]
cat("Detected",nrow(dup),"cases of double responses recorded between",
    as.character(dup[dup$StartedTime==min(dup$StartedTime),"StartedTime"]),"and",
    as.character(dup[dup$StartedTime==max(dup$StartedTime),"StartedTime"]))

## Detected 43 cases of double responses recorded between 2019-03-28 21:56:00 and 2021-03-30 13:46:00

dailyDiary[dailyDiary$IDhour%in%levels(as.factor(as.character(dup$IDhour))),
           c("ID","StartedTime","SubmittedTime",colnames(dailyDiary)[c(3,10,11)])]

Comments:

each duplicated case consists of two responses with the same StartedTime and SubmittedTime values
critically, although no missing responses are systematically shown by duplicated cases, the responses are different within the same couple of duplicated cases

Here, we remove all duplicated cases by keeping only the first one.

# excluding double responses (keeping only the first one)
new.data <- dailyDiary[!duplicated(dailyDiary$IDhour),]
cat("Excluded",nrow(dailyDiary)-nrow(new.data),"double responses")

## Excluded 43 double responses

# checking again for double responses (no more cases)
new.data$IDhour <- as.factor(paste(new.data$ID,new.data$IDhour,sep="_"))
cat("Detected",nrow(new.data)-nlevels(new.data$IDhour),"cases of double responses")

## Detected 0 cases of double responses

# updating dataset
dailyDiary <- new.data

2.7.2. CompletionStatus

Then, we inspect the CompletionStatus variable looking for cases of "Partial Completion".

# printing info
cat(nrow(dailyDiary[is.na(dailyDiary$SubmittedTime),]),"cases with missing SubmittedTime and",
    nrow(dailyDiary[dailyDiary$CompletionStatus=="Partially Completed",]),"cases of Partial Completion")

## 13 cases with missing SubmittedTime and 13 cases of Partial Completion

# showing 13 cases of Partial Completion
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed",
           c("ID","StartedTime","SubmittedTime","CompletionStatus",colnames(dailyDiary)[c(3,10,11)])]

Comments:

in 13 cases the CompletionStatus is "Partially Completed", and the SubmittedTime is missing
however, all of these cases does not show missing values in the focal variables (stress, negative mood, and worry), and thus we keep them by associating the median surveyDuration in the sample (see below)

2.7.3. surveyDuration

# creating and plotting surveyDuration (min)
dailyDiary$surveyDuration <- as.numeric(difftime(dailyDiary$SubmittedTime,dailyDiary$StartedTime,units="min"))

# interpolating surveyDuration in cases of Partial Completion
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","surveyDuration"] <- median(dailyDiary$surveyDuration,na.rm=TRUE)
dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","SubmittedTime"] <-
  dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","StartedTime"] +
  dailyDiary[dailyDiary$CompletionStatus=="Partially Completed","surveyDuration"] # interpolating SubmissionTime

barplot(prop.table(table(dailyDiary$surveyDuration)),col="black",xlab="",main="Survey Duration (min)")

Comments:

most responses took about 1 min (32%) or less (66%), with only 91 cases (2%) showing a surveyDuration > 1 min
in 17 cases the surveyDuration was longer than 15 min, among which 6 cases that submitted the responses more than 16h after the StartedTime*** `

Here, we exclude these 6 cases because we don’t know if the ratings were referred to the current or the following day.

# summary of surveyDuration
summary(as.factor(dailyDiary$surveyDuration))

##    0    1    2    3    4    5    6    7    8   12   14   16   27   40   45   53 
## 3368 1631   49   11    6    1    1    3    1    1    1    1    1    1    1    1 
##   70  161  253  439  754  758 1297 1362 1424 1432 2286 2740 
##    1    1    1    1    1    1    1    1    1    1    1    1

# durations > 15 min (17)
dailyDiary[dailyDiary$surveyDuration > 15,c("ID","StartedTime","SubmittedTime","surveyDuration","CompletionStatus")]

# excluding 6 cases that submitted the responses on the following day
memory <- dailyDiary
dailyDiary <- dailyDiary[dailyDiary$surveyDuration < 1000,]
cat("Excluded",nrow(memory)-nrow(dailyDiary),"cases with surveyDuration > 17h")

## Excluded 6 cases with surveyDuration > 17h

2.7.4. StartedHour

Then, to better inspect timing and duration of the recorded sleep periods, we recode the StartedTime and SubmittedTime variables to create StartHour and EndHour, indicating only the time (and not the date). Note that StartedTime and SubmittedTime have a minute resolution (not seconds).

dailyDiary <- StartTime_rec(data=dailyDiary,start="StartedTime",end="SubmittedTime",doPlot=TRUE,returnData=TRUE)

Comments:

most diaries were responded between 9 PM (i.e., 1h after they were received) and midnight. Note that most SartTime values derived from the Fitabase sleepLog data were later than Survey Sparrow StartedTime, which looks fine

2.7.5. Updating ActivityDate

Then, we update the ActivityDate variable so that it indicates the previous day when the StartTime is between 00:00 and 06:00 (N = 1,848, 36%). This allows better clarifying the distinction between consecutive daily reports.

# No. of surveys started between 00 and 20
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h20 <- as.POSIXct(paste(substr(Sys.time(),1,10),"20:00:00"),tz="GMT")
cat(nrow(dailyDiary[dailyDiary$StartHour>=h00 & dailyDiary$StartHour<h20,c("ID","StartedTime","StartHour","ActivityDate")]),
    "cases with StartTime between midnight and 8 PM") ## 1848 cases with StartTime between midnight and 8 PM

## 1848 cases with StartTime between midnight and 8 PM

# updating ActivityDate
dailyDiary[dailyDiary$StartHour >= h00 & dailyDiary$StartHour <= h20,"ActivityDate"] <-
  dailyDiary[dailyDiary$StartHour >= h00 & dailyDiary$StartHour <= h20,"ActivityDate"] - 1

We can use the updated ActivityDate variable to check for double cases with the same ID and ActivityDate value.

# No. of duplicates IDday before updatingActivityDate
nrow(dailyDiary[duplicated(dailyDiary$IDday),]) # 788

## [1] 788

# No. of duplicates IDday after updatingActivityDate
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))
nrow(dailyDiary[duplicated(dailyDiary$IDday),]) # 139

## [1] 139

# showing duplicates
dailyDiary <- dailyDiary[order(dailyDiary$ID,dailyDiary$StartedTime),] # re-sorting by ID and time
rownames(dailyDiary) <- 1:nrow(dailyDiary)
dupl <- dailyDiary[duplicated(dailyDiary$IDday),c("IDday","ActivityDate")]
cat(nrow(dupl),"duplicated cases from",as.character(min(dupl$ActivityDate)),"to",as.character(max(dupl$ActivityDate)))

## 139 duplicated cases from 2019-02-04 to 2021-04-17

dailyDiary[dailyDiary$IDday%in%levels(as.factor(as.character(dupl$IDday))),c("IDday","StartedTime","SubmittedTime",
                                                                             colnames(dailyDiary)[c(3,10)])]

Comments:

in 139 cases (3%), there are from two (N = 127) or three/four observations(N = 12) with the same ID and ActivityDate value, probably due to technical problems with Survey Sparrow
duplicated cases are observed during all the data collection (i.e., not specific of a limited period of time)
in some of these cases (e.g., participant s005 on 2019-04-01), the StartTime values differ by 2-3 min or even less, but responses are different

Here, for each of these groups of duplicated cases, we only keep the first case (i.e., the one with the earlier StartedTime), whereas we exclude 139 (3%) double responses.

# excluding double responses (keeping only the first one)
memory <- dailyDiary
dailyDiary <- dailyDiary[!duplicated(dailyDiary$IDday),]
cat("Excluded",nrow(memory)-nrow(dailyDiary),"double responses")

## Excluded 139 double responses

# checking again for double responses (no more cases)
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))
cat("Detected",nrow(dailyDiary)-nlevels(dailyDiary$IDday),"cases of double responses")

## Detected 0 cases of double responses

We can also use the updated ActivityDate to check the No. of cases with StartedTime after sleepLog EndTime (N = 739) (i.e., surveys answered on the following day), and the differences between the two time points (ranging from 0.5 min to 15.8h)

# checking No. of cases with StartedTime after wake up time (739)
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))
diaryAndsleep <- na.omit(plyr::join(dailyDiary[,c("IDday","StartedTime")],sleepLog[,c("IDday","EndTime")],by="IDday",type="left"))
cat(nrow(diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,c("StartedTime","EndTime")]),
    "cases in which participants started filling the diary after they woke up \n(",
    round(100*nrow(diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,])/nrow(diaryAndsleep),1),
    "% of cases with matching ID and ActivityDate between dailyDiary and sleepLog)")

## 739 cases in which participants started filling the diary after they woke up 
## ( 18 % of cases with matching ID and ActivityDate between dailyDiary and sleepLog)

# summarizing differences between sleepLog EndTime and dailyDiary StartedTime in these 739 cases
summary(as.numeric(difftime(diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,"StartedTime"],
                            diaryAndsleep[diaryAndsleep$StartedTime>diaryAndsleep$EndTime,"EndTime"],units="mins")))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50   40.75  139.00  197.54  292.50  948.00

2.7.6. Daylight Saving Time

In the San Francisco area, the Daylight Saving Time (DST) changed on March 10th (1h forward) and November 3rd, 2019 (1h backward), and again on March 8th (1h forward) and November 1st, 2020 (1h backward), and finally on March 14th, 2021. Here, we inspect the distributions of StartHour values in the 5 days preceding and the 5 days following each of these dates, in order to check whether time was automatically updated by the wristband.

# setting DST changing times
DST.changes <- as.Date(c("2019-03-10","2019-11-03","2020-03-08","2020-11-01","2021-03-14"))

# selecting cases with ActivityDate = DST.changes + or - 5 days
DST <- as.data.frame(matrix(nrow=0,ncol=4))
for(i in 1:length(DST.changes)){
  DST <- rbind(DST,dailyDiary[difftime(dailyDiary$ActivityDate,DST.changes[i],units="days")>(-5) &
                          difftime(dailyDiary$ActivityDate,DST.changes[i],units="days")<5,
                        c("ID","ActivityDate","StartedTime","StartHour","SubmittedTime","EndHour","surveyDuration")])}
DST$DST <- FALSE # marking DST time changes with DST = TRUE
DST[DST$ActivityDate%in%DST.changes,"DST"] <- TRUE

# computing time (hours) from midnight
DST$timeFrom00 <- as.POSIXct(paste(lubridate::hour(DST$StartedTime), lubridate::minute(DST$StartedTime)), format="%H %M") 
DST$timeFrom00 <- as.numeric(difftime(DST$timeFrom00,
                                      as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT"),units="hours"))
# subtracting 1 day to cases with timeFrom00 > 12
DST[DST$timeFrom00>12,"timeFrom00"] <- DST[DST$timeFrom00>12,"timeFrom00"] - 24

# plotting StartTime trends
for(i in 1:length(DST.changes)){
  DSTs <- c(substr(DST.changes[i],1,7),
            paste(substr(DST.changes[i],1,6),as.integer(substr(DST.changes[i],7,7))-1,sep=""))
  print(ggplot(data=DST[substr(DST$ActivityDate,1,7)%in%DSTs,],aes(x=ActivityDate,y=timeFrom00)) + 
          geom_line(aes(colour=ID)) + geom_point(aes(colour=ID),size=3) + ggtitle(DST.changes[i]) +
          geom_vline(xintercept=DST.changes[i]) +
          theme(axis.text.x=element_text(angle=45),legend.position = "none"))}

Comments:

the visual inspection of StartedTime trends in those participants that recorded their sleep during the days around DST changes does not seem to suggest systematic shifts pairing with time changes
the only DST change that shows some substantial shift in StartedTime is the third one (2019-03-10), with four participants out of five showing an increasing upward trend of five-to-ten hours on the following day

2.7.7. Temporal continuity

Finally, we plot epochs order against Time for each participant in order to better inspect the pattern of missing data.

par(mfrow=c(3,3))
for(ID in levels(dailyDiary$ID)){ 
  plot((1:nrow(dailyDiary[dailyDiary$ID==ID,])~dailyDiary[dailyDiary$ID==ID,"StartedTime"]),main=ID,xlab="",ylab="") }

Comments:

in contrast to what shown for sleepLog, sleepEBE, and classicEBE data, dailyDiary data does not show evident cases of missing data clusters

2.7.8. Saving dataset

Here, we save the recoded dailyDiary dataset with the 4,945 included cases.

save(dailyDiary,file="DATA/datasets/dailyDiary_timeProcessed.RData")

3. Data recoding

Here, we remove the variables not considered for the analysis, and we recode those variables that are kept. Before recoding, we empty the working environment and reload the processed datsets.

rm(list=ls()) # emptying the working environment

# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")

# loading processed datasets
load("DATA/datasets/dailyAct_timeProcessed.RData") # dailyAct
load("DATA/datasets/hourlySteps_timeProcessed.RData") # hourlySteps
load("DATA/datasets/sleepLog_combined.RData") # sleepLog
load("DATA/datasets/sleepEBE_timeProcessed2.RData") # sleepEBE
load("DATA/datasets/classicEBE_timeProcessed.RData") # classicEBE
load("DATA/datasets/HR.1min_timeProcessed.RData") # HR.1min
load("DATA/datasets/dailyDiary_timeProcessed.RData") # dailyDiary
demos <- read.csv2("DATA/demographics.csv",header=TRUE) # demos

3.1. dailyAct

From dailyAct, we only keep the TotalSteps variable, and those counting the No. of minutes in each activity zone (VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, and SedentaryMinutes), whereas we discard RestingHeartRate (due to unclear computation) and the derived measures such as Calories.

# plotting all variables
par(mfrow=c(2,4))
for(Var in colnames(dailyAct)[4:ncol(dailyAct)]){ hist(dailyAct[,Var],main=Var,xlab="") }

# removing variables
toRemove <- c("TotalDistance","TrackerDistance","LoggedActivitiesDistance","VeryActiveDistance",
              "ModeratelyActiveDistance","LightActiveDistance","SedentaryActiveDistance",
              "Calories","Floors","CaloriesBMR","MarginalCalories","RestingHeartRate")
dailyAct[,toRemove] <- NULL

Then, we use the variables counting the No. of minutes in each activity zone for computing the aggregated ModerateVigorousMinutes, and TotalActivityMinutes, quantifying the No. of minutes in “very active” or “fairly active” physical activity, and the total No. of any activity minutes, respectively.

# computing ModerateVigorousMinutes
dailyAct$ModerateVigorousMinutes <- dailyAct$VeryActiveMinutes + dailyAct$FairlyActiveMinutes

# removing further unused variables
dailyAct$VeryActiveMinutes <- dailyAct$FairlyActiveMinutes <- NULL

# computing TotalActivityMinutes
dailyAct$TotalActivityMinutes <- dailyAct$ModerateVigorousMinutes + dailyAct$LightlyActiveMinutes + dailyAct$SedentaryMinutes

# plotting all included variables
par(mfrow=c(2,3))
for(Var in colnames(dailyAct)[4:ncol(dailyAct)]){ hist(dailyAct[,Var],main=Var,xlab="") }

# showing dataset (first 3 rows)
dailyAct[1:3,]

3.1.1. Saving dataset

Here, we save the recoded dailyAct dataset.

save(dailyAct,file="DATA/datasets/dailyAct_recoded.RData")

3.2. hourlySteps

From hourlySteps, we only remove the IDday and IDhour variables created above, whereas we keep the only included variable, namely StepTotal.

# removing IDday and IDhour variables, and sorting columns
hourlySteps <- hourlySteps[,c("ID","group","ActivityDate","ActivityHour","StepTotal")]

# plotting StepTotal
hist(hourlySteps$StepTotal,xlab="",breaks=100,
     main=paste("StepTotal ( min =",min(hourlySteps$StepTotal),", max =",max(hourlySteps$StepTotal),", median =",
                median(hourlySteps$StepTotal),")"))

# showing dataset (first 3 rows)
hourlySteps[1:3,]

3.2.1. Saving dataset

Here, we save the recoded hourlySteps dataset.

save(hourlySteps,file="DATA/datasets/hourlySteps_recoded.RData")

3.3. sleepLog

From dailyAct, we only keep the StartTime and EndTime variables to be used for computing sleep measures from the sleepEBE and classicEBE datasets. The sleep measures automatically recorded in Fitabase (i.e., MinutesAfterWakeUp, MinutesAsleep, MinutesToFallAsleep, and TimeInBed) are also kept for comparison.

# plotting all variables
par(mfrow=c(2,4))
for(Var in colnames(sleepLog)[5:ncol(sleepLog)]){ if(is.numeric(sleepLog[,Var]) | is.integer(sleepLog[,Var])){
  hist(sleepLog[,Var],main=Var,xlab="") }}

# removing variables
toRemove <- c("Duration","Efficiency","ClassicAsleepCount","ClassicAsleepDuration","ClassicAwakeCount","ClassicAwakeDuration",
              "ClassicRestlessCount","ClassicRestlessDuration","StagesWakeCount","StagesWakeDuration","StagesWakeThirtyDayAvg",
              "StagesLightCount","StagesLightDuration","StagesLightThirtyDayAvg",
              "StagesDeepCount","StagesDeepDuration","StagesDeepThirtyDayAvg",
              "StagesREMCount","StagesREMDuration","StagesREMThirtyDayAvg",
              "IsMainSleep",
              "lastTimeChar","StartTime2","IDday","IDhour","StartHour","EndHour", # ad-hoc created variables
              "combType","combSeq") # info on combined cases (keeping only combined variable)
sleepLog[,toRemove] <- NULL

Then, we use the TimeInBed, MinutesAfterWakeUp, MinutesToFallAsleep, MinutesAsleep for computing the fitabaseWASO variable, and we remove the MinutesAfterWakeUp variable (not considered). Finally, we sort, plot, and print the included variables.

# creating fitabaseWASO
sleepLog$fitabaseWASO <- sleepLog$TimeInBed - sleepLog$MinutesAsleep - sleepLog$MinutesAfterWakeUp - sleepLog$MinutesToFallAsleep

# sorting columns and removing MinutesAfterWakeUp
sleepLog <- sleepLog[,c("ID","group","ActivityDate","LogId","StartTime","EndTime","combined","combinedLogId","nCombined",
                        "SleepDataType","TimeInBed","MinutesAsleep","fitabaseWASO","MinutesToFallAsleep")]

# plotting all included variables
par(mfrow=c(1,4))
for(Var in colnames(sleepLog)[11:ncol(sleepLog)]){ hist(sleepLog[,Var],main=Var,xlab="") }

# showing dataset (first 3 rows)
sleepLog[1:3,]

3.3.1. Saving dataset

Here, we save the recoded sleepLog dataset.

save(sleepLog,file="DATA/datasets/sleepLog_recoded.RData")

3.4. sleepEBE

From sleepEBE, we only keep the SleepStage variable, accounting for short wakes, whereas we remove the Level and the ShortWakes variables. The SleepStage variable is converted as integer (i.e., 0 = wake, 1 = light, 2 = deep, 3 = REM).

# removing variables
toRemove <- c("Level","ShortWakes","IDday","IDhour","dayLog")
sleepEBE[,toRemove] <- NULL

# sorting columns
sleepEBE <- sleepEBE[,c("ID","group","ActivityDate","LogId","Time","SleepStage")]

# plotting SleepStage
sleepEBE$SleepStage <- as.factor(sleepEBE$SleepStage)
plot(sleepEBE$SleepStage)

# converting SleepStage as integer
sleepEBE$SleepStage <- as.integer(gsub("wake","0",gsub("light","1",gsub("deep","2",gsub("rem","3",sleepEBE$SleepStage)))))

# showing dataset (first 3 rows)
sleepEBE[1:3,]

3.4.1. Saving dataset

Here, we save the recoded sleepEBE dataset.

save(sleepEBE,file="DATA/datasets/sleepEBE_recoded.RData")

3.5. classicEBE

From classicEBE, we only keep the value variable, with possible values 1 = “asleep,” 2 = “restless,” and 3 = “awake.” Here, this variable is recoded with 0 = wake (i.e., both awake and restless) and 1 = “sleep”, coherently with what is done by other processing pipelines such as RAPIDS, and including the Fitabase pipeline, as highlighted in this thread.

# removing variables
toRemove <- c("IDday","IDhour","dayLog")
classicEBE[,toRemove] <- NULL

# plotting value
classicEBE$value <- as.factor(gsub("2","0",gsub("3","0",classicEBE$value)))
plot(classicEBE$value)

# converting value as integer
classicEBE$value <- as.integer(as.character(classicEBE$value))

# sorting columns
classicEBE <- classicEBE[,c("ID","group","ActivityDate","LogId","Time","value")]

# showing dataset (first 3 rows)
classicEBE[1:3,]

3.5.1. Saving dataset

Here, we save the recoded classicEBE dataset.

save(classicEBE,file="DATA/datasets/classicEBE_recoded.RData")

3.6. HR.1min

From HR.1min, we only keep the Value variable, which we rename as HR.

# removing variables
toRemove <- c("IDday","IDhour")
HR.1min[,toRemove] <- NULL

# renaming HR and sorting variables
colnames(HR.1min)[which(colnames(HR.1min)=="Value")] <- "HR"
HR.1min <- HR.1min[,c("ID","group","ActivityDate","Time","HR")]

# converting value as factor and plotting
hist(HR.1min$HR,xlab="HR (bpm)")

# showing dataset (first 3 rows)
HR.1min[1:3,]

3.6.1. Saving dataset

Here, we save the recoded classicEBE dataset.

save(HR.1min,file="DATA/datasets/HR.1min_recoded.RData")

3.7. dailyDiary

From dailyDiary, we only keep the self-reported variables, whereas we remove all variables describing survey or participants’ details.

# removing variables
toRemove <- c("TotalScore","CompletionStatus","IPAddress","Location","DMSLatLong","ChannelName","ChannelType","DeviceID",
              "DeviceName","Browser","OS",
              "ContactName","ContactPhone","ContactJobTitle","ContactEmail","ContactMobile",
              "IDday","IDhour","StartHour","EndHour")
dailyDiary[,toRemove] <- NULL

# renaming variables
colnames(dailyDiary)[which(colnames(dailyDiary)=="Howstressfulwasyourday"
                           ):which(colnames(dailyDiary)=="OtheregIamworriedaboutsomethingelsehappeningtomorrow")] <-
  c("dailyStress","stress_school","stress_family","stress_health","stress_COVID","stress_peers","stress_other",
    "eveningMood",
    "eveningWorry","worry_school","worry_family","worry_health","worry_peer","worry_COVID","worry_sleep","worry_other")

Then, we recode self-report variables from character to numeric values.

3.7.1. dailyStress

We start with dailyStress (i.e., “How stressful was your day?”), 1 = “Not at all stressful” to 5 = “Extremely stressful.” Only when participants reported dailyStress > 1, they were asked to indicate the sources of stress (yes/no).

# converting as dailyStress score as numeric
dailyDiary$dailyStress <- as.numeric(gsub("Not at all stressful","1",
                                          gsub("Not so stressful","2",
                                               gsub("Somewhat stressful","3",
                                                    gsub("Very stressful","4",
                                                         gsub("Extremely stressful","5",
                                                              dailyDiary$dailyStress))))))
# converting dailyStress sources as binary (0/1)
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,7)=="stress_")]){
  colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
  dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var=="","var"] <- "0"
  dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var!="0","var"] <- "1"
  dailyDiary$var <- as.numeric(dailyDiary$var)
  colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }

# sanity check: 4 cases with dailyStress = 1 but stressor specification (?)
dailyDiary$stress_total <- rowSums(dailyDiary[,which(substr(colnames(dailyDiary),1,7)=="stress_")],na.rm=TRUE) # stress_total
dailyDiary[!is.na(dailyDiary$dailyStress) & dailyDiary$dailyStress==1 & dailyDiary$stress_total!=0,
           c(which(colnames(dailyDiary)=="dailyStress"),which(substr(colnames(dailyDiary),1,7)=="stress_"))]

# sanity check: 12 cases with dailyStress > 1 but no stressor specification (?)
dailyDiary[!is.na(dailyDiary$dailyStress) & dailyDiary$dailyStress>1 & dailyDiary$stress_total==0,
           c(which(colnames(dailyDiary)=="dailyStress"),which(substr(colnames(dailyDiary),1,7)=="stress_"))]

# plotting distribution of dailyStress scores
hist(dailyDiary$dailyStress,breaks=50,col="black",xlab="",main="dailyStress (quite skewed)")

# plotting frequency of stressor categories
dailyDiary[,which(substr(colnames(dailyDiary),1,7)=="stress_")] <- 
  lapply(dailyDiary[,which(substr(colnames(dailyDiary),1,7)=="stress_")],as.factor)
par(mfrow=c(2,3))
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,7)=="stress_")]){
  colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
  barplot(prop.table(table(dailyDiary$var)),col="black",xlab="",main=Var)
  colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }

# showing summary
summary(dailyDiary[!is.na(dailyDiary$dailyStress) & dailyDiary$dailyStress>1,
                   which(substr(colnames(dailyDiary),1,7)=="stress_")])

##  stress_school stress_family stress_health stress_COVID stress_peers
##  0   :1154     0   :2385     0   :1908     0   : 660    0   :1983   
##  1   :2141     1   : 403     1   : 310     1   :  99    1   : 539   
##  NA's:  97     NA's: 604     NA's:1174     NA's:2633    NA's: 870   
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  stress_other stress_total
##  0   :2081    0:  12      
##  1   :1281    1:2286      
##  NA's:  30    2: 850      
##               3: 201      
##               4:  35      
##               5:   4      
##               6:   4

Comments:

dailyStress ratings show a negatively skewed distribution, with most of the ratings being 1 (32%) or 2 (31%). Most ratings were higher than 1 (68%)
the most frequently reported stressors were “school” (65%) and “other” (37%)
note that the number of missing values strongly variate across stressors (from 30 for “Other” to 2,617 for “COVID”)
in 4 cases, stressors were specified even if dailyStress was rated as 1 (note that stressors items were supposed to be showed only when dailyStress > 1)
in 11 cases, stressors were not specified even if dailyStress was rated as > 1

3.7.2. eveningWorry

The same is done for eveningWorry (i.e., “How worried do you feel right now?”), which is recorded from 1 = “Not at all worried” to 5 = “Extremely worried.” Only when participants reported eveningWorry > 1, they were asked to indicate the sources of worry (yes/no)

# converting eveningWorry score as numeric
dailyDiary$eveningWorry <- as.numeric(gsub("Not at all worried","1",
                                          gsub("Not so worried","2",
                                               gsub("Somewhat worried","3",
                                                    gsub("Very worried","4",
                                                         gsub("Extremely worried","5",
                                                              dailyDiary$eveningWorry))))))
# converting eveningWorry sources as binary (0/1)
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,6)=="worry_")]){
  colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
  dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var=="","var"] <- "0"
  dailyDiary[!is.na(dailyDiary$var) & dailyDiary$var!="0","var"] <- "1"
  dailyDiary$var <- as.numeric(dailyDiary$var)
  colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }

# sanity check: 1 case with eveningWorry = 1 but worry specification (?)
dailyDiary$worry_total <- rowSums(dailyDiary[,which(substr(colnames(dailyDiary),1,6)=="worry_")],na.rm=TRUE) # worry_total
dailyDiary[!is.na(dailyDiary$eveningWorry) & dailyDiary$eveningWorry==1 & dailyDiary$worry_total!=0,
           c(which(colnames(dailyDiary)=="eveningWorry"),which(substr(colnames(dailyDiary),1,6)=="worry_"))]

# sanity check: 26 cases with eveningWorry > 1 but no worry specification (?)
dailyDiary[!is.na(dailyDiary$eveningWorry) & dailyDiary$eveningWorry>1 & dailyDiary$worry_total==0,
           c(which(colnames(dailyDiary)=="eveningWorry"),which(substr(colnames(dailyDiary),1,6)=="worry_"))]

# plotting distribution of eveningWorry scores
hist(dailyDiary$eveningWorry,breaks=50,col="black",xlab="",main="eveningWorry (very skewed)")

# plotting frequency of worry categories
dailyDiary[,which(substr(colnames(dailyDiary),1,6)=="worry_")] <- 
  lapply(dailyDiary[,which(substr(colnames(dailyDiary),1,6)=="worry_")],as.factor)
dailyDiary$worry_total <- NULL # removing worry_total
par(mfrow=c(3,3))
for(Var in colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,6)=="worry_")]){
  colnames(dailyDiary)[which(colnames(dailyDiary)==Var)] <- "var"
  barplot(prop.table(table(dailyDiary$var)),col="black",xlab="",main=Var)
  colnames(dailyDiary)[which(colnames(dailyDiary)=="var")] <- Var }

# showing summary
summary(dailyDiary[!is.na(dailyDiary$eveningWorry) & dailyDiary$eveningWorry>1,
                   which(substr(colnames(dailyDiary),1,6)=="worry_")])

##  worry_school worry_family worry_health worry_peer  worry_COVID worry_sleep
##  0   :1090    0   :2335    0   :1866    0   :1989   0   : 766   0   :2074  
##  1   :2243    1   : 295    1   : 354    1   : 556   1   :  92   1   : 732  
##  NA's:  85    NA's: 788    NA's:1198    NA's: 873   NA's:2560   NA's: 612  
##  worry_other
##  0   :2010  
##  1   :1294  
##  NA's: 114

Comments:

eveningWorry ratings showed a negatively skewed distribution, with most of the ratings being 1 (32%) or 2 (29%), similarly to dailyStress. Most ratings were higher than 1 (68%)
the most frequently reported stressors were “school” (49%) and “other” (31%)
note that the number of missing values strongly variate across the sources of worry (from 88 for “School” to 2625 for “COVID”)
in 1 case, the sources of worry were specified even if eveningWorry was rated as 1 (note that worry items were supposed to be showed only when eveningWorry > 1)
in 26 cases, sources of worry were not specified even if eveningWorry was rated as > 1

3.7.3. eveningMood

Finally, we recode eveningMood (i.e., “How is your mood right now?”) from 1 = “Very bad” to 5 = “Very good.” Here, we can see that the variable was positively skewed, with most of the ratings being 3 (29%) or 4 (36%)

# converting as numeric
dailyDiary$eveningMood <- as.numeric(gsub("Very bad","1",
                                          gsub("Somewhat bad","2",
                                               gsub("Neither bad or good","3",
                                                    gsub("Somewhat good","4",
                                                         gsub("Very good","5",
                                                              dailyDiary$eveningMood))))))
# plotting
hist(dailyDiary$eveningMood,breaks=50,col="black",xlab="",main="eveningMood (slightly skewed)")

3.7.4. Saving dataset

Here, we sort the variables, we display the recoded dataset, and we save the recoded dailyDiary dataset.

# sorting variables
dailyDiary <- dailyDiary[,c("ID","group","ActivityDate","StartedTime","SubmittedTime","surveyDuration",
                            "dailyStress","eveningWorry","eveningMood",
                            colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,7)=="stress_")],
                            colnames(dailyDiary)[which(substr(colnames(dailyDiary),1,6)=="worry_")])]

# showing dataset (first 3 rows)
dailyDiary[1:3,]

# saving dataset
save(dailyDiary,file="DATA/datasets/dailyDiary_recoded.RData")

3.8. demos

From demos, we keep the participant’s identifier, sex, age, BMI, and insomnia group, with sex and insomnia group being recoded as factor.

# recoding participants' identifier
demos$ID <- as.factor(paste("s",substr(demos$id,5,7),sep=""))

# changing variable classes
demos$sex <- gsub("0","F",gsub("1","M",demos$sex))
demos[,c("sex","insomnia","DSMinsomnia","sub_insomnia")] <- 
  lapply(demos[,c("sex","insomnia","DSMinsomnia","sub_insomnia")],as.factor)
demos[,c("age","BMI")] <- lapply(demos[,c("age","BMI")],as.numeric) 

# sorting variables and removing unuseful columns
demos <- demos[,c("ID","sex","age","BMI","insomnia","DSMinsomnia","sub_insomnia")] # sorting columns

# plotting variables
par(mfrow=c(2,3)) 
for(Var in c(colnames(demos)[2:7])){ 
  if(is.numeric(demos[,Var])){ hist(demos[,Var],main=Var) }else{ plot(demos[,Var],main=Var) }}

Then, we recode the insomnia group variables in order to compute the insomnia.group variable accounting for both the DSMinsomnia (i.e., 1 if case of insomnia based on DSM criteria) and the sub.insomnia variable (i.e., 1 if case meeting all but one DSM criteria; note that sub.insomnia have missing values when DSMinsomnia = 1).

# creating insomnia.group variable
demos$insomnia.group <- "control" # creating insomnia.group = control vs. sub.insomnia vs. DSM.insomnia
demos[!is.na(demos$sub_insomnia) & demos$sub_insomnia==1,"insomnia.group"] <- "sub.ins" # sub.ins when sub_insomnia = 1
demos[demos$DSMinsomnia==1,"insomnia.group"] <- "DSM.ins" # DSM.ins when DSMinsomnia = 1
demos$DSMinsomnia <- demos$sub_insomnia <- NULL # removing unecessary variables
demos$insomnia.group <- as.factor(demos$insomnia.group)

# plotting
plot(demos$insomnia.group,main="Insomnia groups")

3.8.1. Saving dataset

Here, we show and save the recoded demos dataset.

# showing dataset (first 3 rows)
demos[1:3,]

# saving dataset
save(demos,file="DATA/datasets/demos_recoded.RData")

4. Data aggregation

Here, we aggregate the recoded dataset into the final datasets to be used for the analysis. Before recoding, we empty the working environment and reload the processed datsets.

rm(list=ls()) # emptying the working environment

library(lubridate) # required packages

# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")

# loading processed datasets
load("DATA/datasets/dailyAct_recoded.RData") # dailyAct
load("DATA/datasets/hourlySteps_recoded.RData") # hourlySteps
load("DATA/datasets/sleepLog_nonComb.RData") # sleepLog_nonComb
load("DATA/datasets/sleepLog_recoded.RData") # sleepLog
load("DATA/datasets/sleepEBE_recoded.RData") # sleepEBE
load("DATA/datasets/classicEBE_recoded.RData") # classicEBE
load("DATA/datasets/LogId_special.RData") # special cases of LogId
load("DATA/datasets/HR.1min_recoded.RData") # HR.1min
load("DATA/datasets/dailyDiary_recoded.RData")  # dailyDiary
load("DATA/datasets/demos_recoded.RData") # demos

4.1. dailyAct & hourlySteps

First, we aggregate the hourlySteps and dailyAct datasets by using the former for recomputing the TotalSteps count in the latter. The two datasets are aggregated by using the IDday variable (i.e., accouning for participants’ ID and ActivityDate)

# creating common ID x day identifier
hourlySteps$IDday <- as.factor(paste(hourlySteps$ID,hourlySteps$ActivityDate,sep="_")) # subject x day identifier
dailyAct$IDday <- as.factor(paste(dailyAct$ID,dailyAct$ActivityDate,sep="_"))

# checking whether all IDday values in hourlySteps are included in dailyAct (TRUE)
cat("Sanity check:",length(levels(hourlySteps$IDday)[!(levels(hourlySteps$IDday)%in%levels(dailyAct$IDday))])==0)

## Sanity check: TRUE

# checking whether all IDday values in dailyAct are included in hourlySteps (FALSE)
cat("Sanity check:",length(levels(dailyAct$IDday)[!(levels(dailyAct$IDday)%in%levels(hourlySteps$IDday))])==0)

## Sanity check: FALSE

# IDday only included in dailyAct but not in hourlySteps (13)
levels(dailyAct$IDday)[!(levels(dailyAct$IDday)%in%levels(hourlySteps$IDday))]

##  [1] "s001_2019-04-04" "s041_2019-09-07" "s042_2019-08-14" "s047_2019-10-16"
##  [5] "s047_2019-10-17" "s047_2019-10-18" "s047_2019-10-19" "s047_2019-10-20"
##  [9] "s047_2019-10-21" "s089_2020-10-29" "s116_2021-04-08" "s119_2021-04-22"
## [13] "s120_2021-04-09"

# marking these cases as hourlySteps=FALSE
dailyAct$hourlySteps = TRUE
dailyAct[!(dailyAct$IDday%in%levels(hourlySteps$IDday)),"hourlySteps"] <- FALSE

# recomputing total steps per day
for(i in 1:nrow(dailyAct)){
  if(dailyAct[i,"hourlySteps"]==TRUE){
    dailyAct[i,"TotalSteps2"] <- sum(hourlySteps[as.character(hourlySteps$IDday)==as.character(dailyAct[i,"IDday"]),
                                                 "StepTotal"])}}
# sanity check (TRUE)
cat("sanitycheck:",nrow(dailyAct[is.na(dailyAct$TotalSteps2),])==13)

## sanitycheck: TRUE

Comments:

all ID-ActivityDate combinations included in hourlySteps are included in dailyAct, whereas 13 cases (0.2%) are only included in dailyAct but not in hourlySteps
with the exception of those 13 cases, all TotalSteps values have been successfully recomputed from hourlySteps

4.1.1. Sanity checks

Here, we inspect the differences between automatically computed and manually recomputed TotalSteps.

# sanity check (how many different rows?)
dailyAct[is.na(dailyAct$TotalSteps2),"TotalSteps2"] <- dailyAct[is.na(dailyAct$TotalSteps2),"TotalSteps"] # interp. 13 cases
dailyAct$TotalSteps_diff <- dailyAct$TotalSteps - dailyAct$TotalSteps2 # computing difference between original and recomputed 
cat(nrow(dailyAct[dailyAct$TotalSteps!=dailyAct$TotalSteps2,]),"differences (", # 1,308 diff (21.73%)
    round(100*nrow(dailyAct[dailyAct$TotalSteps!=dailyAct$TotalSteps2,])/nrow(dailyAct),1),"% ) ranging from",
    min(dailyAct$TotalSteps_diff[dailyAct$TotalSteps_diff!=0]),"to",
    max(dailyAct$TotalSteps_diff[dailyAct$TotalSteps_diff!=0]),"steps\n-",
    nrow(dailyAct[dailyAct$TotalSteps_diff<10,]),"cases with a max difference of 10 or less steps (",
    round(100*nrow(dailyAct[dailyAct$TotalSteps_diff<=10,])/nrow(dailyAct),1),"% ) \n-",
    nrow(dailyAct[dailyAct$TotalSteps_diff<100,]),"cases with a max difference of 100 or less steps (",
    round(100*nrow(dailyAct[dailyAct$TotalSteps_diff<=100,])/nrow(dailyAct),1),"% )")

## 1308 differences ( 21.7 % ) ranging from 1 to 14220 steps
## - 4949 cases with a max difference of 10 or less steps ( 82.7 % ) 
## - 5606 cases with a max difference of 100 or less steps ( 93.2 % )

# plotting differences
par(mfrow=c(1,3))
hist(dailyAct$TotalSteps,breaks=100,main="Automatically scored \nTotalSteps per day",xlab="steps")
hist(dailyAct$TotalSteps2,breaks=100,main="Manually scored \nTotalSteps per day",xlab="steps")
hist(dailyAct[dailyAct$TotalSteps_diff!=0,"TotalSteps_diff"],breaks=100,
     main="differences between automatically\nvs. manually determined daily TotalSteps",xlab="steps")

Comments:

manually and automatically scored total steps per day are different only in a minority, although substantial, number of cases (i.e., 22%)
all differences are positive, meaning that automatically scored steps are always equal to or higher than manually scored steps
most differences are close to zero (i.e., 80% < 200 steps), whereas a minority of them (6%) is higher than 1000 steps/day

4.1.2. Saving dataset

Here, we save the aggregated dataset. Note that only manually recomputed TotalSteps are kept in the dailyAct dataset.

# removing TotalSteps2 and TotalSteps_diff
dailyAct$TotalSteps <- dailyAct$TotalSteps2
dailyAct$TotalSteps2 <- dailyAct$TotalSteps_diff <- NULL

# showing dataset (first 3 rows)
dailyAct[1:3,]

# saving dataset
save(dailyAct,file="DATA/datasets/dailyAct_aggregated.RData")

4.2. sleep & classicEBE

Here, we integrate the 870 cases only included in classicEBE and sleepLog, but not in sleepEBE, from the classicEBE to the sleepEBE dataset (saved as ClassicAndSleepLog cases, see section 2.5.2). This is done to facilitate the sleep measures computation below. For consistency between classicEBE value and sleepEBE SleepStages variables, the former was recoded as 0 = wake and 1 = sleep (corresponding to sleepStages light) (see section 3.5).

# ClassicAndSleepLog cases (N = 870)
ClassicAndSleepLog <- LogId_special[[4]]
length(ClassicAndSleepLog)

## [1] 870

cat("sanity check:",
    length(ClassicAndSleepLog)==length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
                                                                  !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))

## sanity check: TRUE

# preparing datasets for aggregation
sleepEBE$SleepDataType <- "stages" # creating SleepDataType column to mark the data origin
classicEBE$SleepDataType <- "classic"
memory <- sleepEBE # saving current dataset for comparison
sleepEBE$LogId <- as.character(sleepEBE$LogId) # LogId back to character
classicEBE$LogId <- as.character(classicEBE$LogId)

Here, the 870 ClassicAndSleepLog are integrated within the sleepEBE dataset. Since classicEBE was recorded in 60-sec epochs whereas sleepEBE was recorded in 30-sec epochs, the former are duplicated before merging.

# aggregating data
for(LOG in ClassicAndSleepLog){
  
  # selecting LOG-related epochs
  classicLog <- classicEBE[classicEBE$LogId==LOG,] 
  
  # duplicating each row, and adding 30 secs to each other epoch to have 30-sec epochs
  classicLog_dup <- classicLog[rep(1:nrow(classicLog), rep(2,nrow(classicLog))),]
  classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] <- classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] + 30 
  
  # changing column name from "value" to "SleepStage"
  colnames(classicLog_dup)[which(colnames(classicLog_dup)=="value")] <- "SleepStage"
  
  # merging
  sleepEBE <- rbind(sleepEBE,classicLog_dup[,c("ID","group","LogId","Time","SleepStage","ActivityDate","SleepDataType")]) }

# back to LogId as factor, and sorting data by ID, ActivityDate and Time
sleepEBE$LogId <- as.factor(sleepEBE$LogId)
classicEBE$LogId <- as.factor(classicEBE$LogId)
sleepEBE$SleepDataType <- as.factor(sleepEBE$SleepDataType)
sleepEBE <- sleepEBE[order(sleepEBE$ID,sleepEBE$ActivityDate,sleepEBE$Time),]

4.2.1. Sanity checks

Here, we check whether our procedure effectively integrated ClassicANDSleepLog cases with sleepEBE data. This is done by using the same lines of code used in section 2.5.1.

# sanity check
data.frame(NstagesANDclassic_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nstages_NOsleepLogORclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nclassic_NOsleepLogORstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
           NclassicANDsleepLogNostages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))

# is the difference between the No. of cases in the original and dataset aggregated equal to the No. of ClassicAndSleepLog cases?
cat("sanity check:",(nrow(sleepEBE)-nrow(memory))==(nrow(classicEBE[classicEBE$LogId%in%ClassicAndSleepLog,])*2))

## sanity check: TRUE

# is the difference between the No. of LogIds in the original and dataset aggregated equal to the No. of ClassicAndSleepLog?
cat("sanity check:",(nlevels(sleepEBE$LogId)-nlevels(memory$LogId))==length(ClassicAndSleepLog))

## sanity check: TRUE

# new No. of sleepEBE LogId
cat("New No. of sleepEBE LogId:",nlevels(sleepEBE$LogId))

## New No. of sleepEBE LogId: 5442

Comments:

now, no more cases are included in classicEBE and sleepLog but not in sleepEBE, suggesting that data aggregation was effective
the new No. of sleepEBE LogId values is 5,442

Then, we compare the distributions of sleep and wake values in the original and aggregated cases.

par(mfrow=c(1,3))
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="stages","SleepStage"]))
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"]))
plot(as.factor(as.factor(gsub("2","1",gsub("3","1",sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"])))))

Comments:

the proportion of epochs characterized as wake (0) and sleep (1) is similar between the original sleepEBE data and the integrated ClassicAndSleepLog cases

4.2.2. Saving dataset

Here, we save the aggregated dataset.

# saving dataset
save(sleepEBE,file="DATA/datasets/sleepEBEclassic_aggregated.RData")

4.3. sleepEBE & sleepLog

Here, we recompute summary sleep measures by using EBE data. This will be done separately for:

those cases included in both sleepLog and sleepEBE (N = 5,401), whose sleep measures are computed within the StartTime and EndTime boundaries of the combined sleep periods identified in section 2.3.3
those cases only included in sleepEBE (and classicEBE) (N = 41), whose sleep measures are computed using the LogId variable to identify the epochs belonging to the same sleep period
those cases only included in classicEBE (N = 336), whose sleep measures are computed using the LogId variable to identify the epochs belonging to the same sleep period

In each group of cases, sleep measures are computed in line with the definitions reported by Menghini et al (2021a), using a modified version of the ebe2sleep R function from the associated public repository.

show ebe2sleep

ebe2sleep <- function(SLEEPdata=NA,EBEdata=NA,epochLength=30,
                      idBased=FALSE,idCol="LogId", # new arguments added to consider TIB boundaries or LogId values
                      stagesCol="SleepStage",staging=TRUE,stages=c(wake=0,light=1,deep=2,REM=3),digits=2,
                      classicDataTypeCol=NA, classicStages=NA, # arguments added to include epochs from classicEBE
                      sleep.measures=c("EBEDataType","nEpochs","nFinalWake","missing_start","missing_middle",
                                       "TIB","TST","SE","SO","WakeUp","midSleep",
                                       "SOL","WASO","nAwake","fragIndex","light","deep","rem"),nAwake_min=5,
                      lastWake_exclude=TRUE,missing_asWake=TRUE){
  
  # 1. preparing data
  # ..............................................................................................................................
  colnames(EBEdata) <- gsub(idCol,"idCol",colnames(EBEdata)) # setting idCol
  EBEdata$idCol <- as.factor(as.character(EBEdata$idCol))
  if(idBased==FALSE){ SLEEPdata[,sleep.measures] <- NA} # target columns (default NA)  
  
  
  # 2. ebe2sleep function (modified from Menghini et al 2021a)
  # ..........................................................................................................................
  EBE2SLEEP <- function(data,staging,stages,nAwake_min2){
    # renaming variables
    colnames(data) <- gsub(stagesCol,"stages",colnames(data))
    data$stages <- as.integer(as.character(data$stages))
    
    # setting stages as 0 = wake, 1 = light, 2 = deep, 3 = REM
    if(staging==TRUE){ 
      data$stages <- as.integer(as.character(factor(data$stages,levels=as.numeric(stages),labels=c(0,1,2,3))))

      } else { # setting stages as 0 = wake, 1 = sleep when staging = FALSE
        if(length(stages)>2){ stop("only two elements should be used in the stages argument when staging = FALSE,
                                   e.g., stages = c(wake = 0, sleep = 1)") }
        data$stages <- as.integer(as.character(factor(data$stages,levels=as.numeric(stages),labels=c(0,1)))) }
  
    # function to compute sleep measures
    sleepMeasures <- function(data,nAwake_min3){
    
      # TIB = number of minutes between lights on and lights off
      TIB <- nrow(data)*epochLength/60
      # TST = number of minutes scored as sleep
      TST <- nrow(data[data$stages!=0,])*epochLength/60
      # SE = percentage of sleep time over TIB
      SE <- 100*TST/TIB
      # SOL = number of minutes scored as wake before the first epoch scored as sleep
      SOL = 0
      for(i in 1:nrow(data)){ if(data[i,"stages"]==0){ SOL = SOL + 1 } else { break } }
      SOL <- SOL*epochLength/60
      # WASO = number of minutes scored as wake after the first epoch scored as sleep
      WASO <- data[i:nrow(data),]
      WASO <- nrow(WASO[WASO$stages==0,])*epochLength/60
      # SO = time of the first sleep epoch
      SO = as.POSIXct(as.character(data[i,"Time"]),tz="GMT")
      # WakeUp = time of the first epoch of final wake
      if(tail(data$stages,1)!=0){ # when last epoch = sleep, WakeUp = last epoch + epochLength
        WakeUp <- as.POSIXct(as.character(tail(data$Time,1)),tz="GMT") + epochLength
      } else{
        for(i in nrow(data):1){ if(data[i,"stages"]!=0){ break } }
        WakeUp <- as.POSIXct(as.character(data[i,"Time"]),tz="GMT") }
      # midSleep = halfway points between SO and WakeUp
      midSleep <- SO + difftime(WakeUp,SO,units="secs")/2
      # nAwake = number of awakenings longer than nAwake_min3; stageShift = number of sleep stage shiftings (including wake)
      nAwake <- stageShift <- 0
      for(i in which(data$Time==SO):(nrow(data)-nAwake_min3*60/epochLength+1)){ if(i==1){ i <- i + 1 } 
        if(data[i-1,"stages"]!=0 & sum(data[i:(i+nAwake_min3*60/epochLength-1),"stages"])==0){ 
          nAwake <- nAwake + 1 }
        if(data[i,"stages"]!=data[i-1,"stages"]){ stageShift <- stageShift + 1 }} 
      # fragIndex = number of sleep stage shifting (including wake) per hour
      fragIndex <- stageShift/as.numeric(difftime(WakeUp,SO,units="hours"))
    
      if(staging==TRUE){
        
        # Light = number of minutes scored as Light sleep (N1 + N2)
        Light <- nrow(data[data$stages==1,])*epochLength/60
        # Deep = number of minutes scored as Light sleep (N3)
        Deep <- nrow(data[data$stages==2,])*epochLength/60
        # REM = number of minutes scored as REM sleep
        REM <- nrow(data[data$stages==3,])*epochLength/60
        
        c("EBEDataType","nEpochs","nFinalWake","missing_start","missing_middle",
                                       "TIB","TST","SE","SO","WakeUp","midSleep",
                                       "SOL","WASO","nAwake","fragIndex","light","deep","rem")
        
        # sleep stages metrics when staging = TRUE
        return(data.frame(TIB=TIB,TST=TST,SE=SE,
                          SO=as.POSIXct(as.character(SO),tz="GMT"),WakeUp=as.POSIXct(as.character(WakeUp),tz="GMT"),
                          midSleep=midSleep,
                          SOL=SOL,WASO=WASO,nAwake=nAwake,fragIndex=fragIndex,
                          light=Light,deep=Deep,rem=REM))
      
        # only sleep/wake metrics when staging = FALSE
      } else{ return(data.frame(TIB=TIB,TST=TST,SE=SE,
                                SO=as.POSIXct(as.character(SO),tz="GMT"),WakeUp=as.POSIXct(as.character(WakeUp),tz="GMT"),
                                midSleep=midSleep,
                                SOL=SOL,WASO=WASO,nAwake=nAwake,fragIndex=fragIndex,
                                light=NA,deep=NA,rem=NA)) }}
    
      sleep.metrics <- sleepMeasures(data,nAwake_min3=nAwake_min2)
  
    # rounding values and returning dataset
    nums <- vapply(sleep.metrics, is.numeric, FUN.VALUE = logical(1))
    sleep.metrics[,nums] <- round(sleep.metrics[,nums], digits = digits)
    return(sleep.metrics)}
  
  # 3. # iteratively computing sleep measures for each considered sleep period
  # ..........................................................................................................................
  
  # 3.1. based on sleepLog TIB boundaries
  # ..........................................................................................
  if(idBased==FALSE){
    require(tcltk)
    pb <- tkProgressBar("Computing Sleep metrics:", "%",0, 100, 50) # progress bar
    rownames(SLEEPdata) <- 1:nrow(SLEEPdata)
    for(i in 1:nrow(SLEEPdata)){ info <- sprintf("%d%% done", round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100))
      setTkProgressBar(pb, round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100), sprintf("Computing sleep metrics:", info), info)
  
      # 3.1.1. Selecting EBE data within SLEEPdata boundaries
      ebe <- EBEdata[EBEdata$ID==SLEEPdata[i,"ID"] & # same ID & bounded between StartTime and EndTime
                       EBEdata$Time >= SLEEPdata[i,"StartTime"] & EBEdata$Time <= SLEEPdata[i,"EndTime"],]
      nEpochs <- nrow(ebe) # number of epochs
      if(nEpochs>0){
    
        # 3.1.2. excluding the last group of wake epochs or considering them as wake?
        nFinalWake <- 0
        if(lastWake_exclude==TRUE){
          SLEEPdata[i,"EndTime"] <- tail(ebe$Time,1) # updating EndTime
          if(ebe[nrow(ebe),"SleepStage"]==0){ # excluding final wake epochs
            for(j in nrow(ebe):1){ if(ebe[j,"SleepStage"]=="0"){ ebe <- ebe[1:(j-1),] } else{ break }}
            SLEEPdata[i,"EndTime"] <- ebe[nrow(ebe),"Time"] } # updating EndTime
          nFinalWake <- nEpochs - nrow(ebe) # counting excluded nFinalWake
          nEpochs <- nrow(ebe) }
      
        # 3.1.3. Counting missing epochs
        # (a) missing epochs at the beginning
        missing_start <- 0
        if(difftime(head(ebe$Time,1),SLEEPdata[i,"StartTime"],units="mins")!=0){ 
          missing_start <- as.numeric(difftime(head(ebe$Time,1),SLEEPdata[i,"StartTime"],units="mins"))*2 }
        # (b) missing epochs at the end (only if the last group of wake epochs is included)
        missing_end <- 0
        if(lastWake_exclude==FALSE){
          if(difftime(SLEEPdata[i,"EndTime"],tail(ebe$Time,1),units="mins")!=0){
            missing_end <- as.numeric(difftime(SLEEPdata[i,"EndTime"],tail(ebe$Time,1),units="mins"))*2 }}
        # (c) missing epochs in the middle (only for combined sleep periods)
        missing_middle <- 0
        
        # 2+ Logs ..............................................
        ebe$idCol <- as.factor(as.character(ebe$idCol))
        LOGs <- levels(ebe$idCol)
        if(length(LOGs)>1){ 
          # creating table with LogId and StartTime
          LOGtimes <- head(ebe[ebe$idCol==LOGs[1],c("idCol","Time")],1)
          for(LOGorder in 2:length(LOGs)){ 
            LOGtimes <- rbind(LOGtimes,head(ebe[ebe$idCol==LOGs[LOGorder],c("idCol","Time")],1)) }
          LOGtimes <- LOGtimes[order(LOGtimes$Time),] # sorting sleep periods by StartTime
          LOGtimes$newLog <- paste("LOG",1:nrow(LOGtimes),sep="") 
          # ebe$idCol <- as.character(ebe$idCol)
          for(LOGorder in 1:nrow(LOGtimes)){ 
            ebe[as.character(ebe$idCol)==LOGtimes[LOGorder,"idCol"],"newLog"] <- LOGtimes[LOGorder,"newLog"] }
          LOGs <- levels(as.factor(ebe$newLog))
          # computing missing_middle
          d <- 0
          for(k in 2:length(LOGs)){
            d <- d + as.numeric(difftime(head(ebe[ebe$newLog==LOGs[k],"Time"],1),
                                         tail(ebe[ebe$newLog==LOGs[k-1],"Time"],1),units="mins"))*2 }
          missing_middle <- missing_middle + d 
          
          # 3.1.4. Computing and adding sleep measures
          # .................................................................
        
            # Are both "classic" and "stage" cases included in EBEdata?
          if(!is.na(classicDataTypeCol)){
            dataTypes <- levels(as.factor(as.character(ebe[,classicDataTypeCol])))
            
              # a) 2+ Logs = "stages" --> staging = TRUE, stages = stages
            if(length(dataTypes)==1 & dataTypes[1] == "stages"){ EBEDataType = "stages"
              new.data <- EBE2SLEEP(data=ebe,staging=TRUE,stages=stages,nAwake_min2=nAwake_min)
              
              # b) 2+ Logs = "classic" --> staging = FALSE, stages = classicStages
            } else if(length(dataTypes)==1 & dataTypes[1] == "classic"){ EBEDataType = "classic"
              new.data <- EBE2SLEEP(data=ebe,staging=FALSE,stages=classicStages,nAwake_min2=nAwake_min)
              
              # c) both "stages" and "classic" sleep data types --> separately processing and summing "stages" and "classic"
            } else if(length(dataTypes)==2){ EBEDataType = "mixed"
              colnames(ebe) <- gsub(classicDataTypeCol,"dataType",colnames(ebe))
              ebe$dataType <- as.factor(as.character(ebe$dataType))
              new.data <- EBE2SLEEP(data=ebe[ebe$dataType=="stages",],staging=TRUE,stages=stages, # "stages" epochs
                                    nAwake_min2=nAwake_min) 
              new.data.classic <- EBE2SLEEP(data=ebe[ebe$dataType=="classic",],staging=FALSE,stages=classicStages, # "classic" epochs
                                            nAwake_min2=nAwake_min) 
              # updating variables
              new.data[,c("TIB","TST","WASO")] <- # summing sleep measures
                new.data[,c("TIB","TST","WASO")] + new.data.classic[,c("TIB","TST","WASO")] 
              new.data$SE <- round(100*new.data$TST/new.data$TIB,digits) # recomputing SE
              new.data$SOL <- ifelse(new.data$SO < new.data.classic$SO, new.data$SOL, new.data.classic$SOL) # first SOL
              new.data$SO <- min(new.data$SO,new.data.classic$SO) # first SO
              new.data$WakeUp <- max(new.data$WakeUp,new.data.classic$WakeUp) # last WakeUp
              new.data$midSleep <- new.data$SO + difftime(new.data$WakeUp,new.data$SO,units="secs")/2 # recomputing midSleep
              new.data$light <- new.data$deep <- new.data$rem <- NA } # invalid sleep stage durations
            
            # d) Unspecified classicDataTypeCol (i.e., only "classic" or "stages" cases) --> staging = staging, stages = stages
            } else { EBEDataType = ifelse(staging,"stages","classic")
              new.data <- EBE2SLEEP(data=ebe,staging=staging,stages=stages,nAwake_min2=nAwake_min) }
          
          # Only 1 Log ...........................................
        } else {
          
          # d) Unspecified classicDataTypeCol (i.e., only "classic" or "stages" cases) --> staging = staging, stages = stages
          if(is.na(classicDataTypeCol)){ EBEDataType = ifelse(staging,"stages","classic")
            new.data <- EBE2SLEEP(data=ebe,staging=staging,stages=stages,nAwake_min2=nAwake_min)
            
            # e) # Only 1 Log = "stages" --> staging = TRUE, stages = stages
          } else { EBEDataType = "stages"
            if(ebe[1,classicDataTypeCol]=="stages"){
              new.data <- EBE2SLEEP(data=ebe,staging=TRUE, stages=stages,nAwake_min2=nAwake_min)
              
              # f) Only 1 Log = "classic" --> staging = FALSE, stages = classicStages
            } else { EBEDataType = "classic"
              new.data <- EBE2SLEEP(data=ebe,staging=FALSE,stages=classicStages,nAwake_min2=nAwake_min) }}}
        
        SLEEPdata[i,sleep.measures] <- cbind(data.frame(EBEDataType=EBEDataType,nEpochs=nEpochs,nFinalWake=nFinalWake,
                                                        missing_start=missing_start,missing_middle=missing_middle),
                                             new.data)
      
        } else{ # when no epochs are identified in the sleep period -> nEpochs=0, sleep measures = NA
          SLEEPdata[i,"nEpochs"] <- 0 }}
    close(pb)
    
  } else { 
    # 3.2. based on LogId
    # ........................................................................................
    
    # setting empty data.frame
    sleep.measures <- c("ID","group","LogId",sleep.measures)
    SLEEPdata <- data.frame(matrix(ncol=length(sleep.measures),nrow=0))
    colnames(SLEEPdata) <- sleep.measures
    
    # iteratively computing sleep measures for each level of idCol
    for(ID in levels(EBEdata$idCol)){ 
      ebe <- EBEdata[EBEdata$idCol==ID,] # selecting data based on ID
      nEpochs <- nrow(ebe)
      
      # 3.2.1. excluding the last group of wake epochs?
      nEpochs <- nrow(ebe)
      nFinalWake <- 0
      if(lastWake_exclude==TRUE){
        for(j in nrow(ebe):1){ 
          if(ebe[j,stagesCol]=="0"){ ebe <- ebe[1:j,] } else { break }}
        nFinalWake = nEpochs - nrow(ebe)
        nEpochs <- nrow(ebe) }
      
      # Unspecified classicDataTypeCol (i.e., only "classic" or "stages" cases) --> staging = staging, stages = stages
      if(is.na(classicDataTypeCol)){ 
      EBEDataType = ifelse(staging,"stages","classic")
      new.data <- EBE2SLEEP(data=ebe,staging=staging,stages=stages,nAwake_min2=nAwake_min)
      
      # if both "classic" and "stages" cases are included
      } else {
      EBEDataType = head(ebe[,classicDataTypeCol],1)
      
      # if EBEDataType = "stages" --> staging = TRUE, stages = stages
      if(EBEDataType=="stages"){
        new.data <- EBE2SLEEP(data=ebe,staging=TRUE,stages=stages,nAwake_min2=nAwake_min)
        
        # if EBEDataType = "classic" --> staging = FALSE, stages = classicStages
      } else {
        new.data <- EBE2SLEEP(data=ebe,staging=FALSE,stages=classicStages,nAwake_min2=nAwake_min)
      }
    }
      # updating dataset
      SLEEPdata <- rbind(SLEEPdata,
                         cbind(data.frame(ID=head(ebe$ID,1),group=head(ebe$group,1),ActivityDate=head(ebe$ActivityDate,1),LogId=ID,
                                          StartTime=head(ebe$Time,1),EndTime=tail(ebe$Time,1),
                                          EBEDataType=EBEDataType,nEpochs=nEpochs,nFinalWake=nFinalWake,
                                          missing_start=NA,missing_middle=NA),
                                          new.data)) }}
  
  # WakeUp and midSleep as POSIXct
  SLEEPdata$WakeUp <- as.POSIXct(SLEEPdata$WakeUp,origin="1970-01-01",tz="GMT")
  SLEEPdata$midSleep <- as.POSIXct(SLEEPdata$midSleep,origin="1970-01-01",tz="GMT")
  
  # 4. Recomputing sleep metrics by considering missing epochs as wake? (only when idBased = FALSE)
  # ..........................................................................
  if(idBased==FALSE & missing_asWake==TRUE){
    SLEEPdata$TIB <- SLEEPdata$TIB +  SLEEPdata$missing_middle*epochLength/60 + 
      SLEEPdata$missing_start*epochLength/60 # TIB = tot No.epochs + missing_middle + missing_start
    SLEEPdata$EndTime <- SLEEPdata$EndTime - 60*missing_end*epochLength/60 # EndTime is recoded by removing nFinalWake
    SLEEPdata$SE <- round(100*SLEEPdata$TST/SLEEPdata$TIB,2) # recomputing SE
    SLEEPdata$SOL <- SLEEPdata$SOL + SLEEPdata$missing_start*epochLength/60 # SOL = SOL + No. missing epochs at the beginning
    SLEEPdata$SO <- as.POSIXct(SLEEPdata$SO,origin="1970-01-01",tz="GMT") # SO from timestamp code to date and hour
    SLEEPdata$WASO <- SLEEPdata$WASO + SLEEPdata$missing_middle*epochLength/60 # WASO = WASO + No. missing_middle
    SLEEPdata$nAwake <- SLEEPdata$nAwake + ifelse(missing_middle>0,1,0)} # adding 1 nAwake if missing_middle > 0
  
  return(SLEEPdata) }

4.3.1. sleepLog & sleepEBE

Here, we use the boundaries of the combined sleep periods identified in section 2.2.1 to generate a dataset of sleep measures from raw EBE data. The following parameters are set:

missing epochs, either at the beginning, at the end, or in the middle (i.e., 61 cases of combined sleep periods), are considered as wake epochs
the last group of wake epochs is excluded from the computation of TIB and the other sleep measures (i.e., not considered as WASO)
the first group of wake epochs is included and considered as SOL
only sleep/wake measures (and not sleep stages measures) are computed from the 566 cases integrated from classicEBE

# running function
sleepLog <- ebe2sleep(SLEEPdata=sleepLog,EBEdata=sleepEBE, # sleepLog and EBE data
                      idCol="LogId",idBased=FALSE, # not based on LogId but on sleepLog TIB limits
                      stagesCol="SleepStage",staging=TRUE,stages=c(wake=0,light=1,deep=2,REM=3),digits=2, # sleep staging info
                      classicDataTypeCol="SleepDataType",classicStages=c(wake=0,sleep=1), # new arguments to include Classic cases
                      epochLength=30, # epoch length (secs)
                      sleep.measures=c("EBEDataType","nEpochs","nFinalWake","missing_start","missing_middle",
                                       "TIB","TST","SE","SO","WakeUp","midSleep",
                                       "SOL","WASO","nAwake","fragIndex","light","deep","rem"), # sleep.measures to be computed
                      nAwake_min=5, # minimum minutes of wake epochs to count nAwake
                      lastWake_exclude=TRUE, # excluding last epochs of wake?
                      missing_asWake=TRUE) # considering missing values as wake

## Loading required package: tcltk

4.3.2. uniqueEBElogs

Then, we use the same function to compute sleep metrics from the 41 cases of sleepEBE not included in sleepLog data. For these cases, we use the LogId variable to identify the epochs belonging to the same sleep period.

# selecting cases of uniqueEBElogs
uniqueEBElogs <- LogId_special[[2]]
length(uniqueEBElogs) # 41

## [1] 41

# computing sleep measures
(sleepLog.uniqueEBE <- ebe2sleep(SLEEPdata=sleepLog,EBEdata=sleepEBE[sleepEBE$LogId%in%uniqueEBElogs,], # selecting uniqueEBElogs
                                 idBased=TRUE,idCol="LogId", epochLength=30, # based on LogId rather than sleepLog TIB
                                 stagesCol="SleepStage"))[1:3,] # showing first 3 lines

Then, as done for sleepLog in section 2.3.4, we need to update the ActivityDate variable so that it indicates the previous day when the StartTime is between 00:00 and 06:00. This allows better clarifying the distinction between consecutive nocturnal sleep periods.

library(lubridate)
sleepLog.uniqueEBE$StartHour <- as.POSIXct(paste(hour(sleepLog.uniqueEBE$StartTime), # computing StartHour
                                                 minute(sleepLog.uniqueEBE$StartTime)), format = "%H %M",tz="GMT") 

# setting times
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")

# updating ActivityDate
sleepLog.uniqueEBE[sleepLog.uniqueEBE$StartHour >= h00 & sleepLog.uniqueEBE$StartHour <= h06,"ActivityDate"] <-
  sleepLog.uniqueEBE[sleepLog.uniqueEBE$StartHour >= h00 & sleepLog.uniqueEBE$StartHour <= h06,"ActivityDate"] - 1

# sanity check: no duplicated epochs (0)
sleepLog.uniqueEBE$StartHour <- NULL
which(duplicated(paste(sleepLog.uniqueEBE$ID,sleepLog.uniqueEBE$ActivityDate,sep="_"))==TRUE)

## integer(0)

sleepLog.uniqueEBE[1:3,] # showing first 3 rows

Comments:

the ActivityDate variable has been effectively recoded
in none of the 41 cases there are two sleep periods with the same ID and ActivityDate variables

4.3.3. uniqueClassiclogs

Then, we use the same function to compute sleep metrics from the 336 cases of classicEBE not included in sleepLog data. For these cases, we use the LogId variable to identify the epochs belonging to the same sleep period. Note that with classicEBE the epoch length should be set at 60 seconds.

# selecting cases of uniqueEBElogs
uniqueClassiclogs <- LogId_special[[3]]
length(uniqueClassiclogs) # 336

## [1] 336

# computing sleep measures
(sleepLog.uniqueClassic  <- ebe2sleep(SLEEPdata=sleepLog,EBEdata=classicEBE[classicEBE$LogId%in%uniqueClassiclogs,], # uniqueEBElogs
                                      idBased=TRUE,idCol="LogId",epochLength=60, # epochLenght = 60 sec
                                      staging=FALSE,stages=c(wake=0,sleep=1), # staging = FALSE
                                      stagesCol="value"))[1:3,] # showing first 3 lines

sleepLog.uniqueClassic$StartHour <- as.POSIXct(paste(hour(sleepLog.uniqueClassic$StartTime), # computing StartHour
                                                 minute(sleepLog.uniqueClassic$StartTime)), format = "%H %M",tz="GMT") 

# setting times
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")

# updating ActivityDate
sleepLog.uniqueClassic[sleepLog.uniqueClassic$StartHour >= h00 & sleepLog.uniqueClassic$StartHour <= h06,"ActivityDate"] <-
  sleepLog.uniqueClassic[sleepLog.uniqueClassic$StartHour >= h00 & sleepLog.uniqueClassic$StartHour <= h06,"ActivityDate"] - 1

# sanity check: duplicated epochs (2 cases of diurnal naps)
sleepLog.uniqueClassic$StartHour <- NULL
which(duplicated(paste(sleepLog.uniqueClassic$ID,sleepLog.uniqueClassic$ActivityDate,sep="_"))==TRUE)

## [1] 66 74

sleepLog.uniqueClassic$IDday <- as.factor(paste(sleepLog.uniqueClassic$ID,sleepLog.uniqueClassic$ActivityDate,sep="_"))
sleepLog.uniqueClassic[sleepLog.uniqueClassic$IDday %in%
                         as.character(sleepLog.uniqueClassic[duplicated(sleepLog.uniqueClassic$IDday),"IDday"]),]

sleepLog.uniqueClassic[1:3,] # showing first 3 rows

Comments:

the ActivityDate variable has been effectively recoded
in two cases there are two sleep periods with the same ID and ActivityDate variables, but these case will be later removed since they are cases of diurnal sleep periods (see the data cleaning section)

4.3.4. Sanity checks

Here, we inspect the number of missing data in the generated dataset, and we visually compare the obtained sleep metrics distributions with those automatically recorded in Fitabase, and those uniquely included in EBE data (not in SleepLog data).

MISSING DATA

First, we inspect the sleep metrics obtained by considering all epochs between the StartTime and EndTime recoded from sleepLog data.

# plotting
par(mfrow=c(2,2))
hist(sleepLog$nEpochs,breaks=30,main="No. of included nonmissing epochs",xlab="")
hist(sleepLog$nFinalWake,breaks=30,main="No. of final wake epochs\n(excluded)",xlab="")
hist(sleepLog$missing_start,breaks=30,main="No. of missing epochs at the beginning\n(included as wake)",xlab="")
hist(sleepLog$missing_middle,breaks=30,main="No. of missing epochs in the middle\n(included as wake)",xlab="")

Comments and details:

the No. of nonmissing epochs between sleepLog StartTime and EndTime shows a similar shape than sleepLog-based TimeInBed, ranging from 118 epochs (59 min) to 1,888 epochs (15.7 hours). All cases have nEpochs > 0.

# summary of nonmissing epochs
summary(sleepLog$nEpochs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   118.0   809.0   931.0   923.3  1049.0  1888.0

# No. of cases with missing nEpochs (0)
cat(nrow(sleepLog[is.na(sleepLog$nEpochs),]),"cases with missing nEpochs")

## 0 cases with missing nEpochs

# No. of sleepLog data with no corresponding sleepEBE data (0)
cat(nrow(sleepLog[sleepLog$nEpochs==0,]),"cases with zero nEpochs (i.e., sleepLog cases with no corresponding EBE data")

## 0 cases with zero nEpochs (i.e., sleepLog cases with no corresponding EBE data

the No. of wake/missing epochs at the end (excluded from TIB) ranges from 0 (36.2%) to 123 (2h), with a mean number of 8.7 min of excluded final wake epochs

# No. of cases with NO final wake epochs (2005 , 41.2%)
cat(nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$nFinalWake==0,]),"cases (",
    round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$nFinalWake==0,])/nrow(sleepLog[sleepLog$nEpochs!=0,]),1),
    "% ) with NO final wake epochs\n\nSummary of nFinalWake:")

## 1761 cases ( 36.2 % ) with NO final wake epochs
## 
## Summary of nFinalWake:

summary(sleepLog[sleepLog$nEpochs!=0,"nFinalWake"]) # nFinalWake (max 122 min)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   3.000   8.679  13.000 123.000

the No. of missing epochs in the middle (included in TIB, and considered as WASO) ranges from 0 (98.8%) to 535 (4.5h). All cases with missing epochs > 0 (N = 61) are cases of combined sleep periods

# No. of missing epochs in the middle > 0 (61)
cat(nrow(sleepLog[!is.na(sleepLog$missing_middle) & sleepLog$missing_middle>0,]),"cases (",
    round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_middle>0,])/nrow(sleepLog[sleepLog$nEpochs!=0,]),1),
    "% ) with 1 or more missing epochs in the middle (i.e., considered as wake)\n",
    "of which",nrow(sleepLog[sleepLog$missing_middle>0 & sleepLog$combined==TRUE,]),"(",
    round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_middle>0 & sleepLog$combined==TRUE,])/
            nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_middle>0,]),1),
    "% ) are cases of combined sleep periods\n\nSummary of missing_middle:")

## 61 cases ( 1.3 % ) with 1 or more missing epochs in the middle (i.e., considered as wake)
##  of which 61 ( 100 % ) are cases of combined sleep periods
## 
## Summary of missing_middle:

summary(sleepLog[sleepLog$missing_middle>0,"missing_middle"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   136.0   163.0   188.1   248.0   535.0

the No. of missing epochs at the beginning (included in TIB and considered as SOL) ranges from 0 (79.7%) or 1 (20.2%) in all cases with the exception of 1 single case with 308 missing epochs (2.5h) at the beginning (LogId = 24848200932, i.e. case of combined sleep periods in which the first sleep period was not included in sleepEBE or classicEBE).

# missing epochs at the beginning > 1 
miN <- min(sleepLog$missing_start,na.rm=TRUE)
miN2 <- min(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_start>miN,"missing_start"],na.rm=TRUE)
cat(nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_start==miN,]),"cases with",min(sleepLog$missing_start,na.rm=TRUE), 
    "missing epochs at the beginning","(",round(100*nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$missing_start==miN,])/
                                                  nrow(sleepLog[sleepLog$nEpochs!=0 & sleepLog$nEpochs!=0,]),1),
    "% )\nWithout these cases, the min No. of missing epochs at the beginning would be",miN2,
    " (",nrow(sleepLog[sleepLog$missing_start==miN2,]),
    "cases,",round(100*nrow(sleepLog[sleepLog$missing_start==miN2,])/
                    nrow(sleepLog[sleepLog$nEpochs>0,]),1),
    "% )\nwith",nrow(sleepLog[sleepLog$nEpochs>0 & sleepLog$missing_start>miN2,]),
    "case showing more than",miN2,"missing epochs at the beginning (i.e., =",
    sleepLog[sleepLog$nEpochs>0 & sleepLog$missing_start>miN2,"missing_start"],"missing epochs)")

## 3879 cases with 0 missing epochs at the beginning ( 79.7 % )
## Without these cases, the min No. of missing epochs at the beginning would be 1  ( 984 cases, 20.2 % )
## with 1 case showing more than 1 missing epochs at the beginning (i.e., = 308 missing epochs)

Here, we better inspect the case with LogId marked as uniqueLogId (24848200932): a case of combined sleep periods in which only the second period (and thus, the second LogId) was included in sleepEBE data. Thus, only for this specific case, we do not consider the first period, but we recompute the sleep measures based on sleepEBE data only.

# what happened to that case marked as uniqueLogId? -> it was a case of combined sleep periods (the first part was not included)
(uniqueLogId <- LogId_special[[1]]) # case 24848200932

## [1] "24848200932"

sleepLog[sleepLog$LogId==uniqueLogId,c("ID","LogId","StartTime","EndTime","TimeInBed","TIB","combined","missing_start")]

combCase <- levels(as.factor(as.character(sleepEBE[sleepEBE$ID=="s056" & # case 24848200933
                                                     sleepEBE$Time>=sleepLog[sleepLog$LogId==uniqueLogId,"StartTime"] &
                                         sleepEBE$Time<=sleepLog[sleepLog$LogId==uniqueLogId,"EndTime"],"LogId"])))

# removing uniqueLogId from sleepLog dataset
sleepLog <- sleepLog[sleepLog$LogId!=uniqueLogId,]

# adding case to sleepLog.uniqueEBE
sleepLog.uniqueEBE <- rbind(sleepLog.uniqueEBE,
                            ebe2sleep(EBEdata=sleepEBE[sleepEBE$LogId=="24848200933",], # selecting case 24848200933
                                      idBased=TRUE,idCol="LogId", epochLength=30))

# re-plotting missing data after the adjustment
par(mfrow=c(2,2))
hist(sleepLog$nEpochs,breaks=30,main="No. of included nonmissing epochs",xlab="")
hist(sleepLog$nFinalWake,breaks=30,main="No. of final wake epochs\n(excluded)",xlab="")
hist(sleepLog$missing_start,breaks=30,main="No. of missing epochs at the beginning\n(included as wake)",xlab="")
hist(sleepLog$missing_middle,breaks=30,main="No. of missing epochs in the middle\n(included as wake)",xlab="")

Comments:

now, the maximum No. of missing epochs at the beginning is 1

Finally, we better inspect the cases with missing_middle > 0 or nFinalWake > 0.

# plotting No. of cases with missing_start > 2, missing_middle > 0 and nFinalWake > 0
par(mfrow=c(1,2))
hist(sleepLog[sleepLog$nFinalWake>0,"nFinalWake"],breaks=30,
     main="No. of excluded final wake epochs\nhigher than 0",xlab="")
hist(sleepLog[sleepLog$missing_middle>0,"missing_middle"],
     breaks=30,main="No. of missing epochs in the middle\nhigher than 0",xlab="")

TIB check

Here, we inspect the differences between TIB values (computed from the No. of available epochs) and the difference between sleepLog EndTime and StartTime. Note that by setting the argument lastWake_exclude = TRUE the function automatically updated EndTime values by setting it to the last sleep epoch’s Time value. Thus, the two variables should almost perfectly match.

# recomputing TIB as the difference (in min) between EndTime and StartTime values
sleepLog$TIB_r <- as.numeric(difftime(sleepLog$EndTime,sleepLog$StartTime,units="mins"))

# computing differences between TIB_r and TIB
sleepLog$TIB_diff <- sleepLog$TIB_r - sleepLog$TIB

# plotting
hist(sleepLog$TIB_diff,breaks=100,xlab="",main="EndTime-StartTime difference - TIB values")

# summarizing (all negative differences from -1.5 to -0.5)
summary(sleepLog$TIB_diff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5000 -0.5000 -0.5000 -0.5064 -0.5000 -0.5000

The same is done for the difference between sleepLog WakeUp and StartTime.

# recomputing TIB as the difference (in min) between EndTime and StartTime values
sleepLog$TIB_r2 <- as.numeric(difftime(sleepLog$WakeUp,sleepLog$StartTime,units="mins"))

# computing differences between TIB_r and TIB
sleepLog$TIB_diff2 <- sleepLog$TIB_r2 - sleepLog$TIB

# plotting
hist(sleepLog$TIB_diff2,breaks=100,xlab="",main="WakeUp-StartTime difference - TIB values")

# summarizing (only a few negative differences from -1 to -0.5)
summary(sleepLog$TIB_diff2)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.000000  0.000000  0.000000 -0.006375  0.000000  0.000000

# all combined cases
cat("All combined:",nrow(sleepLog[sleepLog$TIB_diff2<0 & sleepLog$combined==FALSE,])==0)

## All combined: TRUE

Finally, we inspect the differences between WakeUp - StartTime and EndTime - StartTime, as well as the differences between WakeUp and EndTime

# computing differences between TIB_r2 and TIB_r
sleepLog$TIB_diff3 <- sleepLog$TIB_r2 - sleepLog$TIB_r
sleepLog$WakeUpMINUSStartTime <- as.numeric(difftime(sleepLog$WakeUp,sleepLog$EndTime,units="mins"))

# plotting
par(mfrow=c(2,1))
hist(sleepLog$TIB_diff3,breaks=100,xlab="",main="WakeUp-StartTime - EndTime-StartTime differences")
hist(sleepLog$WakeUpMINUSStartTime,breaks=100,xlab="",main="WakeUp-EndTime differences")

# summarizing (only a few negative differences from -1 to -0.5)
summary(sleepLog$TIB_diff3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.5     0.5     0.5     0.5     0.5     0.5

summary(sleepLog$WakeUpMINUSStartTime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.5     0.5     0.5     0.5     0.5     0.5

Comments:

all differences between EndTime - StartTime and TIB are negative differences from -1.5 to -0.5 min
in contrast, only 61 differences between WakeUp - StartTime and TIB are negative differences from -1 to -0.5; these are all cases of combined sleep periods
all differences between WakeUp - StartTime and EndTime - StartTime are equal to 0.5; coherently, WakeUp time is always 30 sec after EndTime, which is right, since WakeUp indicates the first wake epoch, whereas EndTime indicates the last sleep epoch (i.e., the final wake epochs were excluded)

Here, we remove these small discrepancies by matching EndTime with WakeUp time (i.e., by adding 30 sec to the former), and by matching TIB values with EndTime - StartTime differences.

# adding 30 sec to each EndTime
sleepLog$EndTime <- sleepLog$EndTime + 30

# doing the same for sleepLog.uniqueEBE and sleepLog.uniqueClassic
sleepLog.uniqueEBE$EndTime <- sleepLog.uniqueEBE$EndTime + 30
sleepLog.uniqueClassic$EndTime <- sleepLog.uniqueClassic$EndTime + 30

# recomputing TIB and SE
sleepLog$TIB <- as.numeric(difftime(sleepLog$EndTime,sleepLog$StartTime,units="mins"))
sleepLog$SE <- 100*sleepLog$TST/sleepLog$TIB

# removing check variables
sleepLog$TIB_r <- sleepLog$TIB_r2 <- sleepLog$TIB_diff <- sleepLog$TIB_diff2 <- sleepLog$TIB_diff3 <-
  sleepLog$WakeUpMINUSStartTime <- NULL

# sanity checks (all zero)
cat("-",nrow(sleepLog[difftime(sleepLog$WakeUp,sleepLog$EndTime)!=0,]),"differences between WakeUp and EndTime",
    "\n-",nrow(sleepLog[sleepLog$TIB-as.numeric(difftime(sleepLog$EndTime,sleepLog$StartTime,units="mins"))!=0,]),
    "differences between TIB and EndTime-StartTime differences",
    "\n-",nrow(sleepLog[sleepLog$TIB-as.numeric(difftime(sleepLog$WakeUp,sleepLog$StartTime,units="mins"))!=0,]),
    "differences between TIB and WakeUp-StartTime differences")

## - 0 differences between WakeUp and EndTime 
## - 0 differences between TIB and EndTime-StartTime differences 
## - 0 differences between TIB and WakeUp-StartTime differences

Comments:

now, WakeUp and EndTime perfectly match, both identifying the time of the first wake epoch (“lights-on”)
similarly, TIB values now perfectly match with the differences between EndTime and StartTime

EBE vs SleepLog

Here, we visualize the differences between the distribution of sleep metrics obtained from sleepEBE data and that of sleep measures automatically recorded in Fitabase (sleepLog).

TIB

Here, we compare the distribution of TIB values computed from sleepEBE with that of TimeInBed values, computed as sleepLog EndTime - StartTime.

# plotting TIB distributions
hist(sleepLog$TimeInBed/60,col="yellow",breaks=35,xlab="TIB (hours)",
     main=paste("SleepLog- (yellow; max TIB =",round(max(sleepLog$TimeInBed/60),1),
                "hours) \nand EBE-derived Time in Bed (red; max TIB =",round(max(sleepLog$TIB/60,na.rm=TRUE),1),"hours)"))
hist(sleepLog$TIB/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

Then, we replicate the plot by focusing on cases with no missing data.

# Excluding combined cases
par(mfrow=c(1,3))
hist(sleepLog[sleepLog$combined==FALSE,"TimeInBed"]/60,col="yellow",breaks=35,xlab="TIB (hours)",
     main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"TIB"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding cases with nFinalWake > 0
hist(sleepLog[sleepLog$nFinalWake==0,"TimeInBed"]/60,col="yellow",breaks=35,xlab="TIB (hours)",
     main="excluding cases with nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"TIB"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding EBEDataType = "classic" sleep
hist(sleepLog[sleepLog$EBEDataType!="classic","TimeInBed"]/60,col="yellow",breaks=35,xlab="TIB (hours)",
     main="excluding cases with \n'classic' EBEDataType")
hist(sleepLog[sleepLog$EBEDataType!="classic","TIB"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

# plotting TIB differences
par(mfrow=c(2,2))
diff <- sleepLog$TimeInBed-sleepLog$TIB
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Differences Log-based - EBE-based TIB (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$nFinalWake==0,"TimeInBed"] - sleepLog[sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Excluding cases of missing data at the end\nmin=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"TimeInBed"] - sleepLog[sleepLog$combined==FALSE,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TimeInBed"] -
  sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))

Finally, we inspect the No. of cases with no missing_start, no missing_middle, and no FinalWake, but with a difference of more than 1 hour with sleepLog TimeInBed. These are just two cases, whose sleep measures are recomputed based on sleepEBE data only.

# selecting cases with missing_start, missing_middle and nFinalWake = 0, but TIB diff > 60
(LogId <- as.character(sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0 &
                                  sleepLog$TimeInBed - sleepLog$TIB > 60,"LogId"]))

## [1] "21318629605" "26400046971"

# removing uniqueLogId from sleepLog dataset
sleepLog <- sleepLog[!(sleepLog$LogId%in%LogId),]

# showing EBEDataTye (first stages, second classic)
rbind(head(sleepEBE[sleepEBE$LogId==LogId[1],],1),tail(sleepEBE[sleepEBE$LogId==LogId[1],],1),
      head(sleepEBE[sleepEBE$LogId==LogId[2],],1),tail(sleepEBE[sleepEBE$LogId==LogId[2],],1))

# adding cases to sleepLog.uniqueEBE
specialCases <- rbind(ebe2sleep(EBEdata=sleepEBE[sleepEBE$LogId%in%LogId[1],],idBased=TRUE,idCol="LogId",
                                epochLength=30,stagesCol="SleepStage"),
                      ebe2sleep(EBEdata=classicEBE[classicEBE$LogId%in%LogId[2],],idBased=TRUE,idCol="LogId",
                                epochLength=60,staging=FALSE,stages=c(wake=0,sleep=1),stagesCol="value"))
specialCases[2,"ActivityDate"] <- specialCases[2,"ActivityDate"] - 1 # updating ActivityDate when StartTime > midnight 
sleepLog.uniqueEBE <- rbind(sleepLog.uniqueEBE,specialCases)
sleepLog.uniqueEBE <- sleepLog.uniqueEBE[order(sleepLog.uniqueEBE$ID,sleepLog.uniqueEBE$ActivityDate,
                                               sleepLog.uniqueEBE$StartTime),]

# re-plotting TIB differences
par(mfrow=c(2,2))
diff <- sleepLog$TimeInBed-sleepLog$TIB
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Differences Log-based - EBE-based TIB (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$nFinalWake==0,"TimeInBed"] - sleepLog[sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Excluding cases of missing data at the end\nmin=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"TimeInBed"] - sleepLog[sleepLog$combined==FALSE,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TimeInBed"] -
  sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TIB"]
hist(diff,breaks=55,xlab="TIB differences (min)",cex.main=.7,
     main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))

Comments:

now, all substantial differences between sleepLog and sleepEBE TIB (i.e., 2,353 differences > 1 min) are due to cases of wake/missing epochs at the end

SO and SOL

Here, we compare the sleep onset time computed based on EBE data with the StartTime recorded in Fitabase.

par(mfrow=c(1,3))
diff <- as.numeric(difftime(sleepLog$SO, sleepLog$StartTime, units="mins"))
hist(diff,breaks=30,
     main=paste("Differences Sleep Onset time - sleepLog StartTime (min):\nmin =",
                min(diff),", mean =",round(mean(diff),1),", median =",median(diff),", max =",max(diff)))
hist(sleepLog$SOL,breaks=30,
     main=paste("SOL (min):\nmin =",min(sleepLog$SOL),
                ", mean =",round(mean(sleepLog$SOL),1),", median =",median(sleepLog$SOL),", max =",max(sleepLog$SOL)))
diffvsSOL <- diff - sleepLog$SOL
hist(diffvsSOL,breaks=30,
     main=paste("Differences (SO - StartTime) - SOL (min):\nmin =",
                min(diffvsSOL),", mean =",round(mean(diffvsSOL),1),", median =",median(diffvsSOL),", max =",max(diffvsSOL)))

Comments:

differences between EBE-based sleep onset and sleepLog StartTime range from 0 to 96 min, with 13% being = 0, 56% being equal to or less than 5 min, 81% being equal to or less than 10 min, and only 8% being > 15 min
these differences exactly match with the corresponding SOL values

Here, we inspect the EBE data of those cases with SO-StartTime differences higher than 60 min (N = 5):

# 5 cases with SO - StartTime > 60 min
SOdiffs <- sleepLog[as.numeric(difftime(sleepLog$SO,sleepLog$StartTime,units="mins"))>60,]
nrow(SOdiffs) # 5

## [1] 5

SOdiffs$SOvsStartTime <- as.numeric(difftime(SOdiffs$SO,SOdiffs$StartTime,units="mins"))
SOdiffs$SOvsSOL <- SOdiffs$SOvsStartTime - SOdiffs$SOL
SOdiffs[,c("ID","StartTime","SO","SOL","SOvsStartTime","SOvsSOL","combined","MinutesToFallAsleep")]

# plotting EBE data
par(mfrow=c(2,3))
for(i in 1:nrow(SOdiffs)){
  plot(sleepEBE[sleepEBE$LogId==as.character(SOdiffs[i,"LogId"]),"SleepStage"],xlab="epoch",ylab="sleep stage",pch=20,
       main=SOdiffs[i,"LogId"])}

Comments:

each of the highlighted cases is actually associated with a No. of initial epochs scored as wake higher than 100 (i.e., 50 min)
in conclusions, SO and SOL values seem to be correctly computed

Here, we inspect the distribution of SOL values compared to the MinutesToFallAsleep variable encoded in Fitabase.

# plotting SOL distributions
hist(sleepLog$SOL,col=rgb(.9,0,0,alpha=.5),breaks=35,xlab="SOL (hours)",
     main=paste("SleepLog- (yellow; max SOL =",round(max(sleepLog$MinutesToFallAsleep),1),
                "min) \nand EBE-derived Sleep Onset Latency (red; max SOL =",round(max(sleepLog$SOL,na.rm=TRUE),1),"min)"))
hist(sleepLog$MinutesToFallAsleep,col=rgb(1,1,0,alpha=.5),breaks=35,add=TRUE)

Then, we replicate the plot by focusing on cases with no missing data.

# Excluding combined cases
par(mfrow=c(2,2))
hist(sleepLog[sleepLog$combined==FALSE,"MinutesToFallAsleep"],col="yellow",breaks=35,xlab="SOL (hours)",
     main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"SOL"],add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding cases with missing_start > 0 or nFinalWake > 0
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"MinutesToFallAsleep"]/60,col="yellow",breaks=35,
     xlab="SOL (hours)",main="excluding cases with \nmissing_start or nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"SOL"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding EBEDataType = "classic" sleep
hist(sleepLog[sleepLog$EBEDataType!="classic","MinutesToFallAsleep"]/60,col="yellow",breaks=35,xlab="SOL (hours)",
     main="excluding cases with \n'classic' EBEDataType")
hist(sleepLog[sleepLog$EBEDataType!="classic","SOL"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

# plotting SOL differences
par(mfrow=c(2,2))

diff <- sleepLog$MinutesToFallAsleep-sleepLog$SOL
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
     main=paste("Differences Log-based - EBE-based SOL (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesToFallAsleep"] -
  sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"SOL"]
hist(diff,breaks=55,xlab="SOL differences (min)",cex.main=.7,
     main=paste("Excluding cases of missing data at the beginning/en\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"MinutesToFallAsleep"] -
  sleepLog[sleepLog$combined==FALSE,"SOL"]
hist(diff,breaks=55,xlab="SOL differences (min)",cex.main=.7,
     main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE & 
                   sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesToFallAsleep"] -
  sleepLog[sleepLog$combined==FALSE &
             sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"SOL"]
hist(diff,breaks=55,xlab="SOL differences (min)",cex.main=.7,
     main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))

Comments:

EBE-based SOL shows more variability and higher values than the TimeToFallAsleep variable in sleepLog data
differences are only partially reduced by excluding cases with missing data at the beginning

TST

Here, we compare the total sleep time computed based on EBE data (TIB) with the MinutesAsleep variable recorded in Fitabase.

# plotting TST distributions
hist(sleepLog$MinutesAsleep/60,col="yellow",breaks=35,xlab="TST (hours)",
     main=paste("SleepLog- (yellow; max TST =",round(max(sleepLog$MinutesAsleep/60),1),
                "hours) \nand EBE-derived Time in Bed (red; max TST =",round(max(sleepLog$TST/60,na.rm=TRUE),1),"hours)"))
hist(sleepLog$TST/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

Then, we replicate the plot by focusing on cases with no missing data.

# Excluding combined cases
par(mfrow=c(2,2))
hist(sleepLog[sleepLog$combined==FALSE,"MinutesAsleep"]/60,col="yellow",breaks=35,xlab="TST (hours)",
     main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"TST"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

# Excluding cases with missing_start > 0 or nFinalWake > 0
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"MinutesAsleep"]/60,col="yellow",breaks=35,xlab="TST (hours)",
     main="excluding cases with \nmissing_start or nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"TST"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding EBEDataType = "classic" sleep
hist(sleepLog[sleepLog$EBEDataType!="classic","MinutesAsleep"]/60,col="yellow",breaks=35,xlab="TST (hours)",
     main="excluding cases with \n'classic' EBEDataType")
hist(sleepLog[sleepLog$EBEDataType!="classic","TST"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

# plotting TST differences
par(mfrow=c(2,2))

diff <- sleepLog$MinutesAsleep-sleepLog$TST
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
     main=paste("Differences Log-based - EBE-based TST (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesAsleep"] -
  sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TST"]
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
     main=paste("Excluding cases of missing data at the beginning/en\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"MinutesAsleep"] -
  sleepLog[sleepLog$combined==FALSE,"TST"]
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
     main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE & 
                   sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"MinutesAsleep"] -
  sleepLog[sleepLog$combined==FALSE & 
             sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"TST"]
hist(diff,breaks=55,xlab="TST differences (min)",cex.main=.7,
     main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))

Comments:

TST shows similar results than TIB, but differences between Fitabased- and EBE-derived TST are unaffected by cases of combined sleep, and only partially affected by cases of missing epochs at the beginning/end
mean differences range from -151 to 30 min, with most differences between -10 and 10 min (99%)

WASO

Here, we compare the distribution of Wake After Sleep Onset values computed from EBE data (WASO) with those computed from sleepLog data as: TimeInBed - MinutesAfterWakeUp (not considered for computing EBE-derived sleep measures) - MinutesToFallAsleep (SOL) - MinutesAsleep (TST).

# plotting WASO distributions
sleepLog$fitabaseWASO <- sleepLog$TimeInBed - sleepLog$MinutesAsleep - sleepLog$MinutesToFallAsleep
hist(sleepLog$fitabaseWASO/60,col="yellow",breaks=35,xlab="WASO (hours)",
     main=paste("SleepLog- (yellow; max WASO =",round(max(sleepLog$fitabaseWASO/60),1),
                "hours) \nand EBE-derived WASO (red; max WASO =",round(max(sleepLog$WASO/60,na.rm=TRUE),1),"hours)"))
hist(sleepLog$WASO/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

# Excluding combined cases
par(mfrow=c(2,2))
hist(sleepLog[sleepLog$combined==FALSE,"fitabaseWASO"]/60,col="yellow",breaks=35,xlab="WASO (hours)",
     main="excluding combined cases")
hist(sleepLog[sleepLog$combined==FALSE,"WASO"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding cases with missing_start > 0 or nFinalWake > 0
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,"fitabaseWASO"]/60,col="yellow",breaks=35,
     xlab="WASO (hours)",main="excluding cases with \nmissing_start or nFinalWake > 0")
hist(sleepLog[sleepLog$missing_start<800 & sleepLog$nFinalWake==0,
              "WASO"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)
# Excluding SleepDataType = "classic" sleep
hist(sleepLog[sleepLog$SleepDataType!="classic","fitabaseWASO"]/60,col="yellow",breaks=35,xlab="WASO (hours)",
     main="excluding cases with \n'classic' SleepDataType")
hist(sleepLog[sleepLog$SleepDataType!="classic","WASO"]/60,add=TRUE,col=rgb(.9,0,0,alpha=.5),breaks=35)

# plotting WASO differences
par(mfrow=c(2,2))

diff <- sleepLog$fitabaseWASO-sleepLog$WASO
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
     main=paste("Differences Log-based - EBE-based WASO (min)\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"fitabaseWASO"] -
  sleepLog[sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"WASO"]
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
     main=paste("Excluding cases of missing data at the beginning/en\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE,"fitabaseWASO"] -
  sleepLog[sleepLog$combined==FALSE,"WASO"]
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
     main=paste("Excluding cases of combined sleep periods\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))
diff <- sleepLog[sleepLog$combined==FALSE & 
                   sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"fitabaseWASO"] -
  sleepLog[sleepLog$combined==FALSE & sleepLog$missing_start==0 & sleepLog$nFinalWake==0,"WASO"]
hist(diff,breaks=55,xlab="WASO differences (min)",cex.main=.7,
     main=paste("Excluding both\n min=",min(diff,na.rm=TRUE),", max =",max(diff,na.rm=TRUE),
                ", mean =",round(mean(diff,na.rm=TRUE),1),", median =",median(diff,na.rm = TRUE)))

Comments:

the distribution of EBE-derived WASO values shows a similar shape than that of Fitabase-derived WASO values, although EBE-derived WASO shows a higher No. of cases from 1 to 2 hours. The distribution of EBE-derived WASO is centered on a slightly lower value (median = 46.5 min) than that of fitabase-derived WASO (median = 56 min)
differences between sleepLog and sleepEBE WASO range from -26 to 156 min, and are at least partially due to missing/wake epochs at the beginning/end

As a further check, we inspect whether EBE-based WASO corresponds to EBE-based TIB - SOL - TST.

# manually recomputing WASO
sleepLog$WASO_rec <- sleepLog$TIB - sleepLog$SOL - sleepLog$TST

# plotting differences > |0.5| min
diff <- sleepLog[sleepLog$WASO_rec - sleepLog$WASO > abs(0.5),"WASO_rec"] - 
  sleepLog[sleepLog$WASO_rec - sleepLog$WASO > abs(0.5),"WASO"]
hist(diff,xlab="WASO_rec - WASO",breaks=10,
     main=paste(length(diff),"cases with WASO_rec being 0.5 or more minutes higher than WASO\nmin =",
                min(diff),", max =",max(diff)))

# all combined cases
sleepLog[sleepLog$WASO_rec-sleepLog$WASO>abs(0.5),c("ID","LogId","StartTime","TIB","SOL","TST","WASO","WASO_rec",
                                                    "combined")]

Comments:

only in 24 cases, the difference between EBE-based WASO and the manually-recomputed WASO is higher than 30 seconds (i.e., due to StartTime rounding), ranging from 1 to 20 min
all of these cases are cases of combined sleep periods

Here, we assign the manually recomputed values (i.e., TIB - SOL - WASO) to each of these 24 cases.

# assigning recomputed WASO to these cases
sleepLog[sleepLog$WASO_rec-sleepLog$WASO>abs(0.5),"WASO"] <- sleepLog[sleepLog$WASO_rec-sleepLog$WASO>abs(0.5),"WASO_rec"]
sleepLog$WASO_rec <- NULL

Similarly, we manually recompute sleep efficiency (SE) as 100*TST/TIB.

sleepLog$SE <- round(100*sleepLog$TST/sleepLog$TIB,2)

uniqueEBElog

Here, we visualize the distribution of sleep metrics obtained from the 41 cases of EBE data not included in sleepLog data, to which we added further 3 cases** described above (N = 44).

# showining 44 cases
nrow(sleepLog.uniqueEBE)

## [1] 44

par(mfrow=c(2,2))
hist(sleepLog.uniqueEBE$TIB/60,breaks=35,main="TIB (hours)")
hist(sleepLog.uniqueEBE$TST/60,breaks=35,main="TST (hours)")
hist(sleepLog.uniqueEBE$WASO,breaks=35,main="WASO (min)")
hist(sleepLog.uniqueEBE$SOL,breaks=35,main="SOL (min)")

# plotting StartHour
sleepLog.uniqueEBE$StartHour <- as.POSIXct(paste(lubridate::hour(sleepLog.uniqueEBE$StartTime), 
                                                 lubridate::minute(sleepLog.uniqueEBE$StartTime)),format="%H %M",tz="GMT")
par(mfrow=c(1,1))
hist(sleepLog.uniqueEBE$StartHour,breaks=35,col="black")

Comments:

no cases with StartTime < 18:00 and > 06:00 are included among the 44 uniqueEBElog cases.
the distributions of sleep metrics derived from the 44 cases are in line with the other cases obtained from EBE data included in sleepLog data

Then, coherently with what done for sleepLog cases, we match EndTime with WakeUp values (currently separated by -0.5 to 0.5 min due to temporal approximations).

# plotting WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueEBE$WakeUp,sleepLog.uniqueEBE$EndTime,units="mins")),
     breaks=30,main="WakeUp - EndTime")

# matching EndTime with WakeUp
sleepLog.uniqueEBE$EndTime <- sleepLog.uniqueEBE$WakeUp

# plotting again WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueEBE$WakeUp,sleepLog.uniqueEBE$EndTime,units="mins")),
     breaks=30,main="WakeUp - EndTime")

# recomputing TIB and SE
sleepLog.uniqueEBE$TIB <- as.numeric(difftime(sleepLog.uniqueEBE$EndTime,sleepLog.uniqueEBE$StartTime,units="mins"))
sleepLog.uniqueEBE$SE <- 100*sleepLog.uniqueEBE$TST/sleepLog.uniqueEBE$TIB

uniqueclassicEBE

Here, we visualize the distribution of sleep metrics obtained from the 336 cases of classic EBE data not included in sleepLog data. From these cases, we exclude all cases with StartTime < 18:00 and > 06:00, and all cases with a TST < 3 hours.

# showining 336 cases
nrow(sleepLog.uniqueClassic)

## [1] 336

par(mfrow=c(2,2))
hist(sleepLog.uniqueClassic$TIB/60,breaks=35,main="TIB (hours)")
hist(sleepLog.uniqueClassic$TST/60,breaks=35,main="TST (hours)")
hist(sleepLog.uniqueClassic$WASO,breaks=35,main="WASO (min)")
hist(sleepLog.uniqueClassic$SOL,breaks=35,main="SOL (min)")

# plotting StartHour
sleepLog.uniqueClassic$StartHour <- as.POSIXct(paste(lubridate::hour(sleepLog.uniqueClassic$StartTime), 
                                                 lubridate::minute(sleepLog.uniqueClassic$StartTime)),format="%H %M",tz="GMT")
par(mfrow=c(1,1))
hist(sleepLog.uniqueClassic$StartHour,breaks=35,col="black")

Comments:

a few cases show StartTime < 18:00 and > 06:00, or TST < 3h. These cases will be removed in the data cleaning section below
the distributions of sleep metrics derived from the 336 cases are in line with the other cases obtained from EBE sleep data included in sleepLog data

Then, coherently with what done for sleepLog cases, we match EndTime with WakeUp values (currently separated by -1 to 1 min due to temporal approximations).

# plotting WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueClassic$WakeUp,sleepLog.uniqueClassic$EndTime,units="mins")),
     breaks=30,main="WakeUp - EndTime")

# matching EndTime with WakeUp
sleepLog.uniqueClassic$EndTime <- sleepLog.uniqueClassic$WakeUp

# plotting again WakeUp - EndTime differences
hist(as.numeric(difftime(sleepLog.uniqueClassic$WakeUp,sleepLog.uniqueClassic$EndTime,units="mins")),
     breaks=30,main="WakeUp - EndTime")

# recomputing TIB and SE
sleepLog.uniqueClassic$TIB <- as.numeric(difftime(sleepLog.uniqueClassic$EndTime,
                                                  sleepLog.uniqueClassic$StartTime,units="mins"))
sleepLog.uniqueClassic$SE <- 100*sleepLog.uniqueClassic$TST/sleepLog.uniqueClassic$TIB

4.3.5. Data merging

Here, we join the 44 uniqueEBElog and the 336 uniqueclassicEBE cases to the remaining 4,861 cases included in the sleepLog dataset.

# creating EBEonly logical column
sleepLog$EBEonly <- FALSE
sleepLog.uniqueEBE$EBEonly <- sleepLog.uniqueClassic$EBEonly <- TRUE
memory <- sleepLog

# merging datasets
sleepLog <- plyr::join(plyr::join(sleepLog,sleepLog.uniqueEBE,type="full"),sleepLog.uniqueClassic,type="full")

# sanity check
cat("sanity check:",nrow(sleepLog)-nrow(memory)==nrow(sleepLog.uniqueEBE)+nrow(sleepLog.uniqueClassic))

## sanity check: TRUE

# sorting by ID and date and removing the merged datasets
sleepLog <- sleepLog[order(sleepLog$ID,sleepLog$StartTime),]
rm(sleepLog.uniqueEBE,sleepLog.uniqueClassic)

4.3.5.1. Merging EBE

Once uniqueEBElog and uniqueclassicEBE have been merged with the remaining sleepLog data, we can integrate the remaining uniqueclassicEBE cases from classicEBE to sleepEBE, in order to have just one single dataset of EBE data to be used for the analysis. Here, we apply the same procedures used in section 4.2 to integrate the 336 cases of uniqueClassicEBE with the sleepEBE dataset.

# uniqueclassicEBE cases (N = 336)
uniqueclassicEBE <- LogId_special[[3]]
length(uniqueclassicEBE)

## [1] 336

cat("sanity check:", # sanity check
    length(uniqueclassicEBE)==length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId) &
                                                                  !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))

## sanity check: TRUE

# preparing datasets for aggregation
memory <- sleepEBE # saving current dataset for comparison
classicEBE$SleepDataType <- "classic"
sleepEBE$LogId <- as.character(sleepEBE$LogId) # LogId back to character
classicEBE$LogId <- as.character(classicEBE$LogId)

Here, the 336 uniqueclassicEBE are integrated within the sleepEBE dataset. Since classicEBE was recorded in 60-sec epochs whereas sleepEBE was recorded in 30-sec epochs, the former are duplicated before merging.

# aggregating data
for(LOG in uniqueclassicEBE){
  
  # selecting LOG-related epochs
  classicLog <- classicEBE[classicEBE$LogId==LOG,] 
  
  # duplicating each row, and adding 30 secs to each other epoch to have 30-sec epochs
  classicLog_dup <- classicLog[rep(1:nrow(classicLog), rep(2,nrow(classicLog))),]
  classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] <- classicLog_dup$Time[seq(2,nrow(classicLog_dup),by=2)] + 30 
  
  # changing column name from "value" to "SleepStage"
  colnames(classicLog_dup)[which(colnames(classicLog_dup)=="value")] <- "SleepStage"
  
  # merging
  sleepEBE <- rbind(sleepEBE,classicLog_dup[,c("ID","group","LogId","Time","SleepStage","ActivityDate","SleepDataType")]) }

# back to LogId as factor, and sorting data by ID, ActivityDate and Time
sleepEBE$LogId <- as.factor(sleepEBE$LogId)
classicEBE$LogId <- as.factor(classicEBE$LogId)
sleepEBE$SleepDataType <- as.factor(sleepEBE$SleepDataType)
sleepEBE <- sleepEBE[order(sleepEBE$ID,sleepEBE$ActivityDate,sleepEBE$Time),]

Here, we check whether our procedure effectively integrated ClassicANDSleepLog cases with sleepEBE data. This is done by using the same lines of code used in section 2.5.1.

# sanity check
data.frame(NstagesANDclassic_NOsleepLog=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nstages_NOsleepLogORclassic=length(levels(sleepEBE$LogId)[!levels(sleepEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(sleepEBE$LogId)%in%levels(classicEBE$LogId)]),
           Nclassic_NOsleepLogORstages=length(levels(classicEBE$LogId)[!levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]),
           NclassicANDsleepLogNostages=length(levels(classicEBE$LogId)[levels(classicEBE$LogId)%in%levels(sleepLog_noncomb$LogId)
                                                                       & !levels(classicEBE$LogId)%in%levels(sleepEBE$LogId)]))

# is the difference between the No. of cases in the original and dataset aggregated equal to the No. of ClassicAndSleepLog cases?
cat("sanity check:",(nrow(sleepEBE)-nrow(memory))==(nrow(classicEBE[classicEBE$LogId%in%uniqueclassicEBE,])*2))

## sanity check: TRUE

# is the difference between the No. of LogIds in the original and dataset aggregated equal to the No. of ClassicAndSleepLog?
cat("sanity check:",(nlevels(sleepEBE$LogId)-nlevels(memory$LogId))==length(uniqueclassicEBE))

## sanity check: TRUE

# new No. of sleepEBE LogId
cat("New No. of sleepEBE LogId:",nlevels(sleepEBE$LogId))

## New No. of sleepEBE LogId: 5778

Comments:

now, no more cases are included in classicEBE but not in sleepEBE, suggesting that data aggregation was effective
the new No. of sleepEBE LogId values is 5,778

Then, we compare the distributions of sleep and wake values in the original and aggregated cases.

par(mfrow=c(1,3))
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="stages","SleepStage"]),main="stages")
plot(as.factor(sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"]),main="stages (binary recoded)")
plot(as.factor(as.factor(gsub("2","1",gsub("3","1",sleepEBE[sleepEBE$SleepDataType=="classic","SleepStage"])))),main="classic")

Comments:

the proportion of epochs characterized as wake (0) and sleep (1) is similar between the original sleepEBE data and the integrated ClassicAndSleepLog cases

4.3.6. Saving datasets

Here, we save the aggregated datasets.

# saving dataset with EBE-based sleep measures
save(sleepLog,file="DATA/datasets/sleepLogEBE_aggregated.RData")

# saving dataset with all EBE data
save(sleepEBE,file="DATA/datasets/sleepEBEclassic_aggregated_full.RData")

4.4. HR.1min & sleepLog

Here, we use HR.1min data for computing average HR associated with sleep periods. Specifically, we computethe mean HR by the sleep stage, that is the average of those HR values associated with couples of consecutive sleep epochs classified with the same stage: NREM and REM (only for cases with SleepDataType = "stages"). Mean HR of all sleep and wake epochs is also recorded and separately computed for wake epochs prior and after SO, also for cases with SleepDataType = "classic".

The HRstage function is used to optimize the computation.

show HRstage

HRstage <- function(SLEEPdata=NA,HRdata=NA,EBEdata=NA,digits=3){ require(tcltk)
  
  # 1. preparing data
  # ...................................................................................................................
  # preparing SLEEPdata
  rownames(SLEEPdata) <- 1:nrow(SLEEPdata)
  # HR-by-time column
  SLEEPdata$stageHR_NREM <- SLEEPdata$stageHR_REM <- NA
  
  # preparing EBEdata (joining HR values only to those couples of consecutive epochs classified with the same stage)
  EBEdata <- plyr::join(EBEdata,HRdata[,c("ID","Time","HR")],by=c("ID","Time")) # joining HR values to EBEdata
  EBEdata$SleepStage_rec <- EBEdata$SleepStage + 1 # adding 1 to stages for avoiding zero values
  EBEdata$LogId <- as.character(EBEdata$LogId) # same LogId in cases with combined sleep periods
  for(comb in levels(as.factor(SLEEPdata$combinedLogId))){
    if(nchar(comb)==23){ EBEdata[EBEdata$LogId%in%strsplit(comb,split="_")[[1]],"LogId"] <-
      as.character(SLEEPdata[!is.na(SLEEPdata$combinedLogId) & SLEEPdata$combinedLogId==comb,"LogId"])
    } else { EBEdata[EBEdata$LogId==comb,"LogId"] <-
      as.character(SLEEPdata[!is.na(SLEEPdata$combinedLogId) & SLEEPdata$combinedLogId==comb,"LogId"]) }}
  EBEdata$LogId <- as.factor(EBEdata$LogId) 
  require(dplyr)
  EBEdata <- EBEdata %>%
    group_by(LogId) %>% # creating lagged variable within the same LogId
    mutate(SleepStage_rec.LAG = dplyr::lag(SleepStage_rec,n=1,default=NA),
           SleepStage_rec.LEAD = dplyr::lead(SleepStage_rec,n=1,default=NA))
  EBEdata <- as.data.frame(EBEdata)
  detach("package:dplyr", unload=TRUE)
  EBEdata$sameStage.LAG <- EBEdata$SleepStage_rec - EBEdata$SleepStage_rec.LAG # sameStage.LAG = difference epochs i and i-1
  EBEdata$sameStage.LEAD <- EBEdata$SleepStage_rec - EBEdata$SleepStage_rec.LEAD # sameStage.LEAD = difference epochs i and i-1
  # valid HR only when sameStage.LAG or sameStage.LEAD = 0
  EBEdata[!is.na(EBEdata$HR) & ((!is.na(EBEdata$sameStage.LAG) & EBEdata$sameStage.LAG == 0) |
               (!is.na(EBEdata$sameStage.LEAD) & EBEdata$sameStage.LEAD == 0)),"stageHR"] <- 
    EBEdata[!is.na(EBEdata$HR) & ((!is.na(EBEdata$sameStage.LAG) & EBEdata$sameStage.LAG == 0) |
                 (!is.na(EBEdata$sameStage.LEAD) & EBEdata$sameStage.LEAD == 0)),"HR"] 
  
  # 2. iteratively computing mean HR by sleep stage
  # ........................................................................................................................
  pb <- tkProgressBar("Computing mean HR:", "%",0, 100, 50) # progress bar
  for(i in 1:nrow(SLEEPdata)){ 
    info <- sprintf("%d%% done", round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100))
    setTkProgressBar(pb, round(which(rownames(SLEEPdata)==i)/nrow(SLEEPdata)*100), sprintf("Computing mean HR:", info), info)
    
    # data selection (between sleepLog StartTime and EndTime)
    HRday <- HRdata[HRdata$ID == SLEEPdata[i,"ID"] &
                      HRdata$Time >= SLEEPdata[i,"StartTime"] & HRdata$Time <= SLEEPdata[i,"EndTime"],]
    ebe <- EBEdata[EBEdata$ID==SLEEPdata[i,"ID"] & # same ID & bounded between StartTime and EndTime
                       EBEdata$Time >= SLEEPdata[i,"StartTime"] & EBEdata$Time <= SLEEPdata[i,"EndTime"],]
    if(nrow(ebe)>0){
      # sleep stage HR (only when EBEDataType is not "classic")
      if(SLEEPdata[i,"EBEDataType"]!="classic" & !is.na(SLEEPdata[i,"light"])){
        SLEEPdata[i,"stageHR_NREM"] <- round(mean(ebe[ebe$SleepStage==1 | ebe$SleepStage==2,"stageHR"], # mean HR NREM sleep
                                                  na.rm=TRUE),digits) 
        SLEEPdata[i,"nHR_NREM"] <- nrow(ebe[(ebe$SleepStage==1 | ebe$SleepStage==2) & 
                                              !is.na(ebe$stageHR),]) # No. HR epochs in light sleep
        SLEEPdata[i,"stageHR_REM"] <- round(mean(ebe[ebe$SleepStage==3,"stageHR"],na.rm=TRUE),digits) # REM
        SLEEPdata[i,"nHR_REM"] <- nrow(ebe[ebe$SleepStage==3 & !is.na(ebe$stageHR),]) }}}
  close(pb) # closing progress bar
  
  return(SLEEPdata) }

sleepLog <- HRstage(SLEEPdata=sleepLog,HRdata=HR.1min,EBEdata=sleepEBE,digits=3)

# saving dataset (not run, for saving computational time)
save(sleepLog,file="DATA/datasets/sleepLog_HRtimestage.RData")

4.4.1. Sanity checks

4.4.1.1. Missing data

Here, we inspect the No. and percentage of nonmissing data for each class of variables included in the sleepLog dataset: classic" (e.g.,TST),stages(e.g., "light"), andstageHR(e.g.,stageHR_NREM`).

# counting No. and % of missing data per class of variable
infoMiss <- data.frame(classic=nrow(sleepLog[!is.na(sleepLog$TST),]),stages=nrow(sleepLog[!is.na(sleepLog$light),]),
                       stageHR=nrow(sleepLog[!is.na(sleepLog$stageHR_NREM),]),
                       stagesANDstageHR=nrow(sleepLog[!is.na(sleepLog$light) & !is.na(sleepLog$stageHR_NREM),]))
infoMiss <- rbind(infoMiss,nrow(sleepLog)-infoMiss,round(100*infoMiss/nrow(sleepLog)),round(100-100*infoMiss/nrow(sleepLog)))
row.names(infoMiss) <- c("N","Missing","%","%missing")
infoMiss

Comments:

sleep stages information is missing in 778 (15%) sleepLog cases
sleep-stage HR information is missing in 809 (15%) sleepLog cases

Then, we inspect the differences between the expected and observed No. of HR epochs in each sleep period. That is, we subtract the total nHR_TIB value from the corresponding TIB (in minute).

# computing No. of epochs recorded during TIB
sleepLog$nHR_TST <- apply(sleepLog[,c("nHR_NREM","nHR_REM")],1,sum,na.rm=TRUE)

# computing and plotting differences between expected and observed No. of TIB epochs
sleepLog$nHR_TSTdiff <- sleepLog$TST - sleepLog$nHR_TST
hist(sleepLog$nHR_TSTdiff,breaks=100,main="TST - nHR_TST (minutes)",xlab="")

# printing info
nMiss <- c(0,10,20,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(sleepLog[sleepLog$nHR_TSTdiff>=nMiss[i],]),
                               "cases with >=",nMiss[i],"missing sleepHR epochs")}

## 
## - 4568 cases with >= 0 missing sleepHR epochs
## - 2273 cases with >= 10 missing sleepHR epochs
## - 953 cases with >= 20 missing sleepHR epochs
## - 824 cases with >= 50 missing sleepHR epochs
## - 806 cases with >= 75 missing sleepHR epochs
## - 797 cases with >= 100 missing sleepHR epochs
## - 777 cases with >= 150 missing sleepHR epochs
## - 765 cases with >= 200 missing sleepHR epochs
## - 556 cases with >= 400 missing sleepHR epochs

# removing columns
sleepLog$nHR_TST <- sleepLog$nHR_TSTdiff <- NULL

Comments:

a substantial No. of HR values (21%) were computed from recordings with 20+ missing epochs (i.e., 20+ min)

These cases will be processed in the data cleaning section.

4.4.1.2. HR by sleep stage

Here, we repeat the procedure above by visualizing the number of epochs and the mean HR values computed for each sleep stage, and for the wake epochs preceding and following seep onset.

# dataset in a long form (1 row per sleep stage)
library(tidyr)
sleepLog_long <- sleepLog[,c("ID","LogId","EBEDataType",paste("stageHR",c("NREM","REM"),sep="_"))] %>%
  pivot_longer(stageHR_NREM:stageHR_REM, names_to = "stage", values_to = "HR")  # long-form dataset of HR values
nEpochs <- sleepLog[,c("ID","LogId",paste("nHR",c("NREM","REM"),sep="_"))] %>%
  pivot_longer(nHR_NREM:nHR_REM, names_to = "stage", values_to = "nEpochs")  # long-form dataset of No. of nonmissing HR epochs
detach("package:tidyr", unload=TRUE)
sleepLog_long$nEpochs <- nEpochs$nEpochs
sleepLog_long$stage <- as.factor(gsub("meanHR_","",sleepLog_long$stage)) # sleep stage as factor
sleepLog_long[is.na(sleepLog_long$nEpochs),"nEpochs"] <- 0 # NA nEpochs are converted as zero

# plotting HR
p1 <- ggplot(sleepLog_long,(aes(x=stage,y=HR))) + geom_violin(fill="salmon") + 
  stat_summary(fun.y=mean, geom="point", shape=20, size=5, col="darkred") + ggtitle("Mean HR by sleep stage") +
  xlab("Sleep stage") + ylab("HR (bpm)") + geom_boxplot(width=0.4,alpha=0.2)

# plotting No. included epochs by sleep stage
p2 <- ggplot(sleepLog_long,(aes(x=stage,y=nEpochs))) + geom_boxplot() + 
  stat_summary(fun.y=mean, geom="point", shape=20, size=2) + ggtitle("No. of nonmissing 1-min HR epochs \nper sleep stage")+
  xlab("Sleep stage") + ylab("No. of nonmissing HR epochs")

# showing plots
grid.arrange(p1,p2,nrow=1)

Comments:

mean HR distributions show slightly lower HR for NREM than REM sleep
no extreme HR values are observed
the number of nonmissing HR epochs is higher for NREM sleep (from 104 to 1,152, mean = 558 when EBEDataType is not "classic") than for REM sleep (from 0 to 236, mean = 85 when EBEDataType is not "classic")

4.4.2. Saving datasets

Here, we save the aggregated dataset.

# saving dataset with EBE-based sleep measures
save(sleepLog,file="DATA/datasets/sleepLogEBEHR_aggregated.RData")

4.5. sleepLog & dailyAct

Here, we create the variable IDday to be used for merging the sleepLog and the dailyAct datasets. Note that sleepLog ActivityDate was recoded to be referred to the previous day when StartTime is between midnight and 6 AM (see sections 2.3.4, 4.3.2, and 4.3.3).

# creating common variable IDday
dailyAct$IDday <- as.factor(paste(dailyAct$ID,dailyAct$ActivityDate,sep="_"))
sleepLog$IDday <- as.factor(paste(sleepLog$ID,sleepLog$ActivityDate,sep="_"))

# sanity checks
cat("Sanity check:",nlevels(dailyAct$IDday)==nrow(dailyAct)) # dailyAct: no cases with the same IDday value

## Sanity check: TRUE

cat("Sanity check:",nrow(sleepLog)-nlevels(sleepLog$IDday)) # 4 double cases

## Sanity check: 4

sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],c("IDday","StartTime","EndTime","TST","stageHR_NREM")]

# correcting N = 1 sleepLog case with StartTime between 00:00 and 06:00 but ActivityDate not adjusted
sleepLog[as.numeric(substr(sleepLog$StartTime,12,13)) >= 0 & as.numeric(substr(sleepLog$StartTime,12,13)) <= 6 &
           substr(sleepLog$StartTime,9,10)==substr(sleepLog$ActivityDate,9,10),"ActivityDate"] <-
  sleepLog[as.numeric(substr(sleepLog$StartTime,12,13)) >= 0 & as.numeric(substr(sleepLog$StartTime,12,13)) <= 6 &
           substr(sleepLog$StartTime,9,10)==substr(sleepLog$ActivityDate,9,10),"ActivityDate"] - 1

# re-creating common variable IDday and sanity check
sleepLog$IDday <- as.factor(paste(sleepLog$ID,sleepLog$ActivityDate,sep="_"))
cat("Sanity check:",nrow(sleepLog)-nlevels(sleepLog$IDday)) # three double cases

## Sanity check: 3

sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],
         c("IDday","StartTime","EndTime","SleepDataType","EBEDataType","TST","stageHR_NREM")] # showing 3 double cases

# removing N = 3 day-time duplicated cases of sleepLog data
toRemove <- as.character(sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],
                                  ][seq(1,nrow(sleepLog[sleepLog$IDday%in%sleepLog$IDday[duplicated(sleepLog$IDday)],])-1,
                                        by=2),"LogId"])
sleepLog <- sleepLog[!(sleepLog$LogId %in% toRemove),]

# re-creating common variable IDday and sanity check
sleepLog$IDday <- as.factor(paste(sleepLog$ID,sleepLog$ActivityDate,sep="_"))
cat("Sanity check:",nrow(sleepLog)==nlevels(sleepLog$IDday)) # no more double cases

## Sanity check: TRUE

Comments:

1 sleepLog case was recoded since the StartTime value was between midnight and 6 AM but ActivityDate was not referred to the previous day
after the recoding of that case, three double cases (with the same ID and ActivityDate values) were included in the sleepLog dataset, and have been removed
no double cases are included in the dailyAct dataset

Then, we can join the two datasets by using the common variable IDday to create the fitbit dataset. Note that the type argument is set to “full” in order to include all the cases included in either one or the other dataset.

fitbit <- plyr::join(sleepLog,dailyAct,by="IDday",type="full") # joining
fitbit <- fitbit[order(fitbit$ID,fitbit$ActivityDate,fitbit$StartTime),] # sorting by ID and time
row.names(fitbit) <- 1:nrow(fitbit) # renaming rows

4.5.1. Sanity checks

4.5.1.1. Missing data

Here, we inspect the No. and percentage of nonmissing data for each class of variables included in the sleepLog dataset: "classic" (e.g., TST), "stages" (e.g., light), "HR" (e.g., stageHR_NREM), and "Act" (e.g., TotalSteps).

# counting No. and % of missing data per class of variable
Ncases <- nrow(fitbit) # total No. of cases
infoMiss <- data.frame(total=Ncases,Act=nrow(fitbit[!is.na(fitbit$TotalSteps),]),
                       classic=nrow(fitbit[!is.na(fitbit$TST),]),stages=nrow(fitbit[!is.na(fitbit$light),]),
                       HR=nrow(fitbit[!is.na(fitbit$stageHR_NREM),]),
                       classicANDAct=nrow(fitbit[!is.na(fitbit$TST) & !is.na(fitbit$TotalSteps),]),
                       stagesANDAct=nrow(fitbit[!is.na(fitbit$light) & !is.na(fitbit$TotalSteps),]),
                       HRANDAct=nrow(fitbit[!is.na(fitbit$stageHR_NREM) & !is.na(fitbit$TotalSteps),]))
infoMiss <- rbind(infoMiss,Ncases-infoMiss,round(100*infoMiss/Ncases),round(100-100*infoMiss/Ncases))
row.names(infoMiss) <- c("N","Missing","%","%missing")
infoMiss

Comments:

Considering the total No. of cases (i.e., N = 6,169 non missing either in sleepLog or in dailyAct):

Act information is missing in 150 cases (2%)
sleep classic information is missing in 931 (15%) cases
sleep stages information is missing in 1,706 (28%) cases
sleep-stage HR information is missing in 1,737 (28%) cases

4.5.3. Saving datasets

Here, we save the aggregated fitbit dataset.

# saving dataset with EBE-based sleep measures
save(fitbit,file="DATA/datasets/fitbitSleepAct_aggregated.RData")

4.6. fitbit & dailyDiary

Here, we create the variable IDday to be used for merging the dailyDiary and the fitbit datasets. Note that dailyDiary ActivityDate was recoded to be referred to the previous day when StartedTime is between midnight and 8 PM (see section 2.7.5).

# creating common variable IDday
dailyDiary$IDday <- as.factor(paste(dailyDiary$ID,dailyDiary$ActivityDate,sep="_"))

# sanity check
cat("Sanity check:",nrow(dailyDiary)==nlevels(dailyDiary$IDday)) # no double cases

## Sanity check: TRUE

Comments:

no double cases are included in the dailyDiary dataset

Then, we can join the two datasets by using the common variable IDday to create the ema dataset. Note that the type argument is set to “full” in order to include all the cases included in either one or the other dataset.

ema <- plyr::join(fitbit,dailyDiary,by="IDday",type="full") # joining
ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),] # sorting by ID and time
row.names(ema) <- 1:nrow(ema) # renaming rows

4.6.1. Sanity checks

4.6.1.1. Missing data

# counting No. and % of missing data per class of variable
Ncases <- nrow(ema) # total No. of cases
Ndiary <- ema[!is.na(ema$dailyStress) & !is.na(ema$eveningMood) & !is.na(ema$eveningWorry),] # Ndiary = non missing focal vars
infoMiss <- data.frame(total=Ncases,diary=nrow(Ndiary),Act=nrow(ema[!is.na(ema$TotalSteps),]),
                       classic=nrow(ema[!is.na(ema$TST),]),stages=nrow(ema[!is.na(ema$light),]),
                       HR=nrow(ema[!is.na(ema$stageHR_NREM),]),
                       classicANDdiary=nrow(Ndiary[!is.na(Ndiary$TST),]),
                       stagesANDdiary=nrow(Ndiary[!is.na(Ndiary$light),]),
                       HRANDNdiary=nrow(Ndiary[!is.na(Ndiary$stageHR_NREM),]),
                       ActANDdiary=nrow(Ndiary[!is.na(Ndiary$TotalSteps),]))
infoMiss <- rbind(infoMiss,Ncases-infoMiss,round(100*infoMiss/Ncases),round(100-100*infoMiss/Ncases))
row.names(infoMiss) <- c("N","Missing","%","%missing")
infoMiss

Comments:

Considering the total No. of cases (i.e., N = 6,219 non missing either in sleepLog, in dailyAct, or in dailyDiary):

diary information is missing in 1,286 cases (21%)
Act information is missing in 200 cases (3%)
sleep classic information is missing in 981 (16%) cases
sleep stages information is missing in 1,756 (28%) cases
sleep-stage HR information is missing in 1,787 (29%) cases

4.6.1.2. StartedTime

Then, we inspect the cases with dailyDiary StartedTime value (i.e., the time at which the survey was responded) and the corresponding sleepLog StartTime values (i.e., lights-off time according to sleepLog data).

ema$diary.timeDiff <- as.numeric(difftime(ema$StartTime,ema$StartedTime,units="hours"))
hist(ema$diary.timeDiff,breaks=100,main="hours between dailyDiary StartedTime and sleepLog StartTime")

# showing 1,000 cases with dailyDiary StartedTime AFTER sleepLog StartTime
ema[!is.na(ema$diary.timeDiff) & ema$diary.timeDiff<0,c("ID","ActivityDate","StartedTime","StartTime","EndTime")]

Comments:

in most cases (N = 3,402, 77%) dailyDiary StartedTime is before sleepLog StartTime
in 1,000 cases (23%) dailyDiary StartedTime is after sleepLog StartTime, above we can see that most of these cases are cases in which the diary was responded on the following day

4.6.2. Saving datasets

Here, we save the aggregated ema dataset.

# saving dataset with EBE-based sleep measures
save(ema,file="DATA/datasets/ema_aggregated.RData")

4.7. ema & demos

Finally, we join the ema dataset (with all daily varying measures) with the demos data (including demographic information of each participant) by using ID as the matching variable. In this case, we set the type argument as "left" in order to only include those participants that were involved in the EMA protocol.

ema <- plyr::join(ema,demos,by="ID",type="left") # joining datasets
ema <- ema[,c(1:2,(ncol(ema)-ncol(demos)+2):ncol(ema),3:(ncol(ema)-ncol(demos)+1))] # demographics vars at the beginning

4.7.1. Sanity checks

Here, we simply check the No. of demos participants that were not included in the ema dataset, and the No. of cases with missing demos values.

# No. demos cases not included (14)
nrow(demos[!(demos$ID %in% levels(as.factor(as.character(ema$ID)))),])

## [1] 14

# no cases with missing demos variables
cat("sanity check:",
    nrow(ema[is.na(ema$sex)|is.na(ema$BMI)|is.na(ema$age)|is.na(ema$insomnia)|is.na(ema$insomnia.group),])==0)

## sanity check: TRUE

Comments:

14 participants were not included because they did not participate to the EMA protocol
no missing data in any demos variable are included in the ema dataset

4.6.2. Saving datasets

Here, we save the aggregated ema dataset.

# saving dataset with EBE-based sleep measures
save(ema,file="DATA/datasets/emaRetro_aggregated.RData")

5. Data cleaning

Here, we summarize the compliance (No. of non missing data) for each core variable, and we filter the data based on variable-specific criteria.

rm(list=ls()) # emptying the working environment

library(lubridate) # loading required packages

Sys.setenv(tz="GMT") # setting system time zone to GMT (for consistent temporal synchronization)

# loading processed datasets
load("DATA/datasets/dailyAct_aggregated.RData") # dailyAct
load("DATA/datasets/hourlySteps_recoded.RData") # hourlySteps
load("DATA/datasets/sleepLog_nonComb.RData") # sleepLog_nonComb
load("DATA/datasets/sleepLogEBEHR_aggregated.RData") # sleepLog
load("DATA/datasets/LogId_special.RData") # special LogIds
load("DATA/datasets/HR.1min_recoded.RData") # HR.1min
load("DATA/datasets/emaRetro_aggregated.RData")  # ema
load("DATA/datasets/demos_recoded.RData") # demos

5.1. demos

The demos data include demographic information describing participants: sex, age, BMI, and insomnia groups.

Exclusion criteria based on demos data were applied in the recruitment phase, and no further criteria need to be applied. Exclusion criteria were past-history and/or current severe medical (e.g., cancer, epilepsy, heart diseases, diabetes) and/or mental (e.g., major depressive disorder) conditions, taking current medication known to affect sleep and/or cardiovascular function (e.g., hypnotics, antihypertensives), self-reporting heaving breathing-related and/or movement-related sleep disorders, time-zones traveling in the past month, current pregnancy, or breast-feeding (girls).

5.1.1. Compliance

No compliance information is needed to describe demos data since, as highlighted above, no missing data is included in the ema dataset.

cat("No. missing data in demos variables =",
    nrow(ema[is.na(ema$sex) | is.na(ema$age) | is.na(ema$BMI) | is.na(ema$insomnia) | is.na(ema$insomnia.group),])) # 0

## No. missing data in demos variables = 0

5.1.2. Data filtering

As highlighted above, 14 participants were not included because they did not participate to the EMA protocol. No more participants should be excluded based on demos variables.

# removing participants
memory <- demos
demos <- demos[demos$ID %in% levels(as.factor(as.character(ema$ID))),]

cat("No. excluded participants that did not participate to the EMA protocol =",
    nrow(memory)-nrow(demos)) # 14

## No. excluded participants that did not participate to the EMA protocol = 14

5.1.3. Saving dataset

As highlighted above, 14 participants were not included because they did not participate to the EMA protocol. No more participants should be excluded based on demos variables.

save(demos,file="DATA/datasets/demos_clean.RData")

5.2. sleepLog

sleepLog data include information describing individual sleep periods detected by the FC3 device, with more than one sleep period being possibly identified within the same day.

Inclusion criteria for sleepLog data were already introduced in section 2.3.3, consisting of the following conditions, identifying our definition of nocturnal sleep period:

Starting between 6 PM and 6 AM

Including at least 180 min (3 hours) of Total Sleep Time (TST)

Possibly interrupted by an indefinite number of wake periods with undefinite duration, but with the last sleep period starting before 11 AM

Possibly composed by consecutive sleep periods, but those periods between 6 PM and 11 PM, and between 6 AM and 11 AM are combined only when separated by less than 1.5 hour (otherwise considered as naps)

5.2.1. Compliance

Here, we compute the original No. of distinct sleep periods (LogId) identified by the FC3 device, and the original No. of distinct days with sleepLog minimal information (e.g., time in bed)

cat("Original No. of sleep logs =", # 5,402  +  41  +  336  =  5,779
    nlevels(sleepLog_noncomb$LogId) + length(LogId_special[[2]]) + length(LogId_special[[3]]))

## Original No. of sleep logs = 5779

cat("Original No. of sleepLog days =", # 4,764  +  41  + 336  =  4,764
    nlevels(sleepLog_noncomb$IDday) + length(LogId_special[[2]]) + length(LogId_special[[3]]))

## Original No. of sleepLog days = 4764

For estimating compliance, we need to compute the ratio between the No. of nonmissing days per each participant by the length of the recording periods originally required by the study, that is two months (60 days).

# creating compliance dataset
compliance <- demos[demos$ID %in% levels(ema$ID),]

# adding EBEonly cases to sleepLog_noncomb
sleep_noncomb <- rbind(sleepLog_noncomb[,c("ID","StartTime","EndTime","ActivityDate","LogId")],
                       ema[!is.na(ema$EBEonly) & ema$EBEonly==TRUE,c("ID","StartTime","EndTime","ActivityDate","LogId")])

# updating ActivityDate in sleep_noncomb
sleep_noncomb$StartHour <- as.POSIXct(paste(hour(sleep_noncomb$StartTime),
                                            minute(sleep_noncomb$StartTime)),format="%H %M",tz="GMT")
h00 <- as.POSIXct(paste(substr(Sys.time(),1,10),"00:00:00"),tz="GMT")
h06 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")
sleep_noncomb[sleep_noncomb$StartHour >= h00 & sleep_noncomb$StartHour <= h06,"ActivityDate"] <-
  sleep_noncomb[sleep_noncomb$StartHour >= h00 & sleep_noncomb$StartHour <= h06,"ActivityDate"] - 1
sleep_noncomb$IDday <- as.factor(paste(sleep_noncomb$ID,sleep_noncomb$ActivityDate,sep="_"))

# computing compliance (reference: 60 days)
Nsleep <- ema[!is.na(ema$TIB),]
for(i in 1:nrow(compliance)){
  compliance[i,"nSleep"] <- 
    nlevels(as.factor(as.character(sleep_noncomb[as.character(sleep_noncomb$ID)==compliance[i,"ID"],"IDday"]))) 
  compliance[i,"periodSleep"] <- 
    difftime(max(sleep_noncomb[as.character(sleep_noncomb$ID)==compliance[i,"ID"],"StartTime"]),
             min(sleep_noncomb[as.character(sleep_noncomb$ID)==compliance[i,"ID"],"StartTime"]),units="days") }
compliance$compl.sleep <- 100*compliance$nSleep/60 # % of days on two months

# printing info
cat("sleepLog data:\n- Original No. days/participants =",
    round(mean(compliance$nSleep),2)," ( SD =",round(sd(compliance$nSleep),2),
    ") \n- Original sleepLog compliance =",
    round(mean(compliance$compl.sleep),2),"% ( SD =",round(sd(compliance$compl.sleep),2),
    ") \n- No. of missing days =",
    round(mean(as.numeric(compliance$periodSleep)-compliance$nSleep),2),
    " ( SD =",round(sd(as.numeric(compliance$periodSleep)-compliance$nSleep),2),")")

## sleepLog data:
## - Original No. days/participants = 55.52  ( SD = 13.01 ) 
## - Original sleepLog compliance = 92.53 % ( SD = 21.69 ) 
## - No. of missing days = 22.35  ( SD = 41.39 )

Comments:

compared to the criterion of having two months of continuous recordings, sleepLog data show an original compliance of 92.53%
however, within the period of recording, an average of 22 missing days are observed

5.2.2. Data filtering

sleepLog data was already filtered multiple times in the sections above:

339 cases were excluded because StartTime were < 6 PM or > 11 AM in section 2.3.3
67 cases were combined to previous or following sleep periods in section 2.3.3
4 cases were manually excluded based on visual inspection (early-evening or late-morning naps) in section 2.3.3
75 cases were removed after sleep combination because StartTime was > 6 AM
57 cases were manually excluded as they were cases of early-evening naps (StartTime < 6 PM) recorded before the subsequent nocturnal sleep periods, in section 2.3.4.
3 cases were manually excluded as they were cases of diurnal sleep periods with the same IDday value of following nocturnal sleep periods, in section 4.5.

In summary, a total of 545 cases were excluded mainly due to condition 1 (i.e., StartTime < 6 PM or > 6 AM). Here, the count is 4 observation lower than expected, probably due to double cases.

# LogId back as factor, and printing info
ema$LogId <- as.factor(as.character(ema$LogId))
cat("No. of excluded cases with StartTime between 6 AM and 6 PM =",
    nrow(sleepLog_noncomb) + length(LogId_special[[2]]) + length(LogId_special[[3]]) - 
      nrow(ema[!is.na(ema$LogId),])) # 541 (4 more cases than expected (?)))

## No. of excluded cases with StartTime between 6 AM and 6 PM = 541

5.2.2.1. StartTime 6:00

Here, we remove further three sleepLog cases with StartTime > 6 AM.

# selecting sleepLogVars
sleepLogVars <- colnames(ema)[which(colnames(ema)=="LogId"):which(colnames(ema)=="nHR_REM")]

# re-computing StartHour and EndHour
ema$StartHour <- as.POSIXct(paste(hour(ema$StartTime),minute(ema$StartTime)),format="%H %M",tz="GMT")

# removing sleepLog variables from cases with StartTime between 6 AM and 6 PM
memory <- ema
h6 <- as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT")
h18 <- as.POSIXct(paste(substr(Sys.time(),1,10),"18:00:00"),tz="GMT")
ema[!is.na(ema$StartHour) & ema$StartHour > h6 & ema$StartHour < h18, sleepLogVars] <- NA

# print info (3 removed cases)
cat("No. excluded cases with StartTime between 6 AM and 6 PM =",
    nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))

## No. excluded cases with StartTime between 6 AM and 6 PM = 3

# showing 3 removed cases
memory[!is.na(memory$TIB) & is.na(ema$TIB),c("ID","ActivityDate","StartTime","EndTime","TST","EBEonly","EBEDataType")]

Comments:

only 3 cases were removed due to StartTime between 6 AM and 6 PM
all the 3 cases were cases of diurnal sleep periods computed from classicEBE data (i.e., cases of uniqueClassicLog; see section 4.3.3), with only two of these cases having a TST > 3h

As a further sanity check, we inspect the distribution of nonmissing StartHour times.

# recomputing StartHour
ema$StartHour <- as.POSIXct(paste(hour(ema$StartTime),minute(ema$StartTime)),format="%H %M",tz="GMT")

# sanity check (no more cases )
cat("Sanity check:",nrow(ema[!is.na(ema$StartHour) & ema$StartHour > h6 & ema$StartHour < h18,])==0)

## Sanity check: TRUE

# plotting StartHour
hist(ema$StartHour,breaks=100,col="black",xlab="",main="StartHour")

5.2.2.2. TST < 3h

Then, we filter sleepLog cases with a TST < 3 hours (i.e., condition 2 of our definition of nocturnal sleep period).

# removing sleepLog variables from cases with TST < 180 min (3h)
memory <- ema
ema[!is.na(ema$TST) & ema$TST < 180, sleepLogVars] <- NA

# print info (64 removed cases)
cat("No. excluded cases with StartTime between 6 AM and 6 PM =",
    nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))

## No. excluded cases with StartTime between 6 AM and 6 PM = 64

# sanity check
summary(ema$TST)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   180.0   363.2   416.0   414.3   466.5   886.0    1048

Comments:

64 cases were removed due to TST < 3h not matching with our definition of nocturnal sleep period
included TST values now range from 180 min (3h) to 886 (15h)

5.2.2.3. Isolated sleepLogs

Then, we better inspect cases of tempolarily isolated sleep periods, that is sleepLog data recorded substantially later than all other sleepLog data previously recorded from the same participant. As shown in section 2.3, cases with extreme No. of consecutive missing days are mainly due to such cases of isolated sleep periods.

# computing LAG values for ActivityDate
library(dplyr)
ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),]
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)

# computing and plotting time lags
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))
hist(Nsleep$lag,main="Consecutive missing days day(i) - day(i-1)",breaks=100)

# printing info
n <- c(10,15,20,30,50)
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
                           nrow(Nsleep[(!is.na(Nsleep$lag) & Nsleep$lag>n[i]),])) }

## 
## - No. cases with > 10 consecutive missing days = 41
## - No. cases with > 15 consecutive missing days = 32
## - No. cases with > 20 consecutive missing days = 25
## - No. cases with > 30 consecutive missing days = 14
## - No. cases with > 50 consecutive missing days = 8

# showing 41 cases with more than 10 missing days
Nsleep.vars <- c("ID","ActivityDate","lag",sleepLogVars)
isolatedSleep <- as.data.frame(matrix(nrow=0,ncol=length(Nsleep.vars)+1))
for(i in 1:nrow(Nsleep)){
  if(!is.na(Nsleep[i,"lag"]) & Nsleep[i,"lag"]>10){ 
    isolatedSleep <- rbind(isolatedSleep,Nsleep[(i-2):(i+2),c(Nsleep.vars,"dailyStress")]) } }
isolatedSleep[,c("ID","LogId","ActivityDate","lag","StartTime","EBEonly","dailyStress")]

Comments:

consecutive missing days of sleepLog data (i.e., No. of days between each and the preceding observation from the same participant) are higher than 10 only for 41 cases (0.7%), with even less cases showing consecutive missing days > than 15, 20, 30, and 50 days
about the half of these cases are mainly due to isolated final sleep periods recorded several days later than the previous sleep period, with no corresponding dailyDiary values
only in two of these cases (LogId 26112469243 and 26112469243) sleepLog data was computed based on EBEonly

Here, we use the isolatedSleep.rm function to progressively remove cases of isolated sleep periods.

show isolatedSleep.rm

isolatedSleep.rm <- function(SLEEPdata=NA,DayDiff.max=10,DayDiff.nDays=1,printInfo=TRUE,showData=FALSE){ 
  
  # preparing vector that will include all the filtered LogId values
  ISOLogs <- character()
  
  for(h in 1:10000){
    require(dplyr)
    SLEEPdata <- SLEEPdata %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
    SLEEPdata <- as.data.frame(SLEEPdata)
    detach("package:dplyr", unload=TRUE)
    
    # computing and plotting time lags
    SLEEPdata$lag <- as.numeric(difftime(SLEEPdata$ActivityDate,SLEEPdata$AD_lag,units="days"))
  
    # printing info
    if(printInfo==TRUE){ n <- c(10,15,20,30,50)
      for(i in 1:length(n)){ cat("\n - No. cases with >",n[i],"consecutive missing days =",
                                 nrow(SLEEPdata[(!is.na(SLEEPdata$lag) & SLEEPdata$lag>n[i]),])) }}
  
    # creating list of cases with each lag value > DayDiff.max and the previous and following DayDiff.nDays + 1 cases 
    sleep.vars <- c("ID","LogId","ActivityDate","lag","StartTime","EndTime","EBEonly")
    isolatedSleep <- list()
    for(i in 3:nrow(SLEEPdata)){ 
      if(!is.na(SLEEPdata[i,"lag"]) & SLEEPdata[i,"lag"]>DayDiff.max){ 
        isolatedSleep[[length(isolatedSleep)+1]] <- SLEEPdata[(i-DayDiff.nDays-1):(i+DayDiff.nDays+1),sleep.vars] } }
  
    # creating vector of isolated sleep cases' LogId values
    isolatedLogs <- character()
    for(k in 1:length(isolatedSleep)){
      for(i in 1:(nrow(isolatedSleep[[k]])-DayDiff.nDays)){ 
        if(!is.na(isolatedSleep[[k]][i,"lag"]) & isolatedSleep[[k]][i,"lag"]>DayDiff.max &
           isolatedSleep[[k]][i,"ID"] != isolatedSleep[[k]][i+DayDiff.nDays,"ID"]){
      isolatedLogs <- c(isolatedLogs,as.character(isolatedSleep[[k]][(i-DayDiff.nDays+1):i,"LogId"])) } }}
    isolatedLogs <- levels(as.factor(isolatedLogs))
    
    # data filtering and printing info
    if(length(isolatedLogs)>0){
      ISOLogs <- c(ISOLogs,isolatedLogs)
      memory <- SLEEPdata
      SLEEPdata <- SLEEPdata[!(SLEEPdata$LogId %in% isolatedLogs),] # removing cases
      if(printInfo==TRUE){
        cat("\n\nCycle No.",h,": Excluding",nrow(memory[!is.na(memory$TIB),])-nrow(SLEEPdata[!is.na(SLEEPdata$TIB),]),
          "cases of isolated sleep periods")}
      
      # when no more cases of isolated sleep periods
    } else{
      if(printInfo==TRUE){
        cat("\n\nNo more cases with >",n[i],"consecutive missing days due to isolated sleep periods",
            "\nTotal No. of isolated sleep periods to be filtered =",length(ISOLogs)) }
      if(showData==TRUE){
        cat("\nshowing all remaining cases with >",n[i],"consecutive missing days:\n")
        print(isolatedSleep) }
      break }}
   
  return(ISOLogs) }

# running function and selecting cases of isolated sleep periods at the end of participants' recording period
isoSleep <- isolatedSleep.rm(SLEEPdata=ema[!is.na(ema$TIB),],DayDiff.max=10,DayDiff.nDays=1,showData=FALSE)

## 
##  - No. cases with > 10 consecutive missing days = 41
##  - No. cases with > 15 consecutive missing days = 32
##  - No. cases with > 20 consecutive missing days = 25
##  - No. cases with > 30 consecutive missing days = 14
##  - No. cases with > 50 consecutive missing days = 8
## 
## Cycle No. 1 : Excluding 20 cases of isolated sleep periods
##  - No. cases with > 10 consecutive missing days = 21
##  - No. cases with > 15 consecutive missing days = 14
##  - No. cases with > 20 consecutive missing days = 9
##  - No. cases with > 30 consecutive missing days = 2
##  - No. cases with > 50 consecutive missing days = 2
## 
## Cycle No. 2 : Excluding 6 cases of isolated sleep periods
##  - No. cases with > 10 consecutive missing days = 15
##  - No. cases with > 15 consecutive missing days = 10
##  - No. cases with > 20 consecutive missing days = 5
##  - No. cases with > 30 consecutive missing days = 1
##  - No. cases with > 50 consecutive missing days = 1
## 
## No more cases with > 30 consecutive missing days due to isolated sleep periods 
## Total No. of isolated sleep periods to be filtered = 26

Comments:

the function identified 26 cases of isolated sleep periods accounting for 26 of the 41 cases of consecutive missing days > 10

Here, we remove these 26 cases from the ema dataset.

# removing sleepLog variables from cases with corresponding to isoSleep
memory <- ema
ema[!is.na(ema$TST) & ema$LogId %in% isoSleep, sleepLogVars] <- NA

# print info (26 removed cases)
cat("No. excluded cases of isolated sleep =",
    nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))

## No. excluded cases of isolated sleep = 26

Then, we further inspect the remaining cases with 10+ missing consecutive sleepLog days.

# computing LAG values for ActivityDate
library(dplyr)
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))

# showing 15 cases with more than 10 missing days
Nsleep.vars <- c("ID","ActivityDate","lag",sleepLogVars)
isolatedSleep <- as.data.frame(matrix(nrow=0,ncol=length(Nsleep.vars)))
for(i in 1:nrow(Nsleep)){
  if(!is.na(Nsleep[i,"lag"]) & Nsleep[i,"lag"]>10){ 
    isolatedSleep <- rbind(isolatedSleep,Nsleep[(i-3):(i+3),c(Nsleep.vars,"dailyStress")]) } }
isolatedSleep[,c("ID","LogId","ActivityDate","lag","StartTime","EBEonly","dailyStress")]

Comments:

some of the remaining cases of 10+ consecutive missing days are still due to isolated final sleep periods recorded several days later than the previous sleep period, with most of them having missing values for dailyDiary variables
however, these cases were not filtered above because of one or two cases with lag < 10 between them and the end of the data collection period for a given participant

Here, we use the isolatedSleep.rm to filter these cases by considering two consecutive rows instead of one (i.e., DayDiff.nDays is set to 2)

# running function and selecting cases of isolated sleep periods at the end of participants' recording period
isoSleep <- isolatedSleep.rm(SLEEPdata=ema[!is.na(ema$TIB),],DayDiff.max=10,DayDiff.nDays=2,showData=FALSE)

## 
##  - No. cases with > 10 consecutive missing days = 15
##  - No. cases with > 15 consecutive missing days = 10
##  - No. cases with > 20 consecutive missing days = 5
##  - No. cases with > 30 consecutive missing days = 1
##  - No. cases with > 50 consecutive missing days = 1
## 
## Cycle No. 1 : Excluding 8 cases of isolated sleep periods
##  - No. cases with > 10 consecutive missing days = 15
##  - No. cases with > 15 consecutive missing days = 10
##  - No. cases with > 20 consecutive missing days = 6
##  - No. cases with > 30 consecutive missing days = 2
##  - No. cases with > 50 consecutive missing days = 1
## 
## Cycle No. 2 : Excluding 8 cases of isolated sleep periods
##  - No. cases with > 10 consecutive missing days = 11
##  - No. cases with > 15 consecutive missing days = 6
##  - No. cases with > 20 consecutive missing days = 3
##  - No. cases with > 30 consecutive missing days = 0
##  - No. cases with > 50 consecutive missing days = 0
## 
## Cycle No. 3 : Excluding 2 cases of isolated sleep periods
##  - No. cases with > 10 consecutive missing days = 10
##  - No. cases with > 15 consecutive missing days = 6
##  - No. cases with > 20 consecutive missing days = 3
##  - No. cases with > 30 consecutive missing days = 0
##  - No. cases with > 50 consecutive missing days = 0
## 
## Cycle No. 4 : Excluding 2 cases of isolated sleep periods
##  - No. cases with > 10 consecutive missing days = 9
##  - No. cases with > 15 consecutive missing days = 5
##  - No. cases with > 20 consecutive missing days = 3
##  - No. cases with > 30 consecutive missing days = 0
##  - No. cases with > 50 consecutive missing days = 0
## 
## No more cases with > 50 consecutive missing days due to isolated sleep periods 
## Total No. of isolated sleep periods to be filtered = 20

Comments:

the function identified further 20 cases of isolated sleep periods accounting for 6 of the 15 cases of consecutive missing days > 10, and for both the remaining two cases with 30+ consecutive missing days
note that the 20 cases also include the sleep periods (N = 10) recorded between each case with lag > 10 and the end of the data collection period for that participant

Here, we remove these 20 cases from the ema dataset.

# removing sleepLog variables from cases with corresponding to isoSleep
memory <- ema
ema[!is.na(ema$TST) & ema$LogId %in% isoSleep, sleepLogVars] <- NA

# print info (20 removed cases)
cat("No. excluded cases of isolated sleep =",
    nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))

## No. excluded cases of isolated sleep = 20

Then, we further inspect the remaining cases with 10+ missing consecutive sleepLog days.

# computing LAG values for ActivityDate
library(dplyr)
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))

# showing 9 cases with more than 10 missing days
Nsleep.vars <- c("ID","ActivityDate","lag",sleepLogVars)
isolatedSleep <- as.data.frame(matrix(nrow=0,ncol=length(Nsleep.vars)))
for(i in 1:nrow(Nsleep)){
  if(!is.na(Nsleep[i,"lag"]) & Nsleep[i,"lag"]>10){ 
    isolatedSleep <- rbind(isolatedSleep,Nsleep[(i-4):(i+4),c(Nsleep.vars,"dailyStress")]) } }
isolatedSleep[,c("ID","LogId","ActivityDate","lag","StartTime","EBEonly","dailyStress")]

Comments:

some of the remaining cases of 10+ consecutive missing days are still due to isolated final sleep periods recorded several days later than the previous sleep period, among which 4 cases have missing dailyDiary values**
however, these cases were not filtered above because of two cases with lag < 10 between them and the end of the data collection period for a given participant

Here, we manually remove these 4 cases:

# manually removing sleepLog variables from cases with corresponding to isoSleep
memory <- ema
isoSleep <- c("23751186378","23751186379","23758815846", # s041 - removing last 3 cases (20 days after)
              "23769564082") # s050 - removing first 1 case (21 days before)
ema[!is.na(ema$TIB) & ema$LogId %in% isoSleep, sleepLogVars] <- NA

# print info (10 removed cases)
cat("No. excluded cases of isolated sleep =",
    nrow(memory[!is.na(memory$TIB),])-nrow(ema[!is.na(ema$TIB),]))

## No. excluded cases of isolated sleep = 4

In summary, we removed a total of 50 isolated observations recorded 10+ days before or after the remaining observations obtained from the same participant. A total of six cases still show a No. of consecutive missing days between 11 and 26.

# computing LAG values for ActivityDate
library(dplyr)
Nsleep <- ema[!is.na(ema$TIB),]
Nsleep <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nsleep <- as.data.frame(Nsleep)
detach("package:dplyr", unload=TRUE)
Nsleep$lag <- as.numeric(difftime(Nsleep$ActivityDate,Nsleep$AD_lag,units="days"))

# printing info
cat("No. of cases with 11 to",max(Nsleep$lag,na.rm=TRUE),"consecutive missing days =",
    nrow(Nsleep[(!is.na(Nsleep$lag) & Nsleep$lag>10 & Nsleep$lag<max(Nsleep$lag,na.rm=TRUE)),]))

## No. of cases with 11 to 26 consecutive missing days = 6

5.2.2.4. Duplicated

Finally, we check again for duplicated cases, that is cases with the same sleepLog values, or with the same ID and ActivityDate values.

# creating IDday variable
ema$IDday <- as.factor(paste(ema$ID,ema$ActivityDate,sep="_"))
Nsleep <- ema[!is.na(ema$TIB),]

# sanity check by LogId (0 cases)
cat("Sanity check:",nrow(Nsleep[duplicated(Nsleep$LogId),])==0)

## Sanity check: TRUE

# sanity check by IDday (0 cases)
cat("Sanity check:",nrow(Nsleep[duplicated(Nsleep$IDday),])==0)

## Sanity check: TRUE

Comments:

no duplicated cases are included in the ema dataset for sleepLog variables

5.2.3. Summary of data

In summary, from the original No. of sleep logs (N = 5,779 sleep periods recorded over 4,764 days), 67 cases (1%) were combined with preceding or following consecutive sleep periods, 478 + 3 = 481 cases (8%) were removed due to StartTime between 6 AM and 6 PM, and 64 cases (1%) were removed due to TST < 3h. 50 cases (1%) of isolated sleep periods recorded 10 or more days before or after the remaining observations from the same participant were also removed.

Thus, data cleaning led to a total No. of excluded sleep periods = 662 cases (13.9%).

# 658 (4 less than expected)
cat("Total No. of excluded cases =",nrow(sleep_noncomb) - nrow(ema[!is.na(ema$TIB),]))

## Total No. of excluded cases = 658

Here, we compute the updated information on the non-missing data related to sleepLog variables.

# updating compliance dataset
Nsleep <- ema[!is.na(ema$TIB),]

# computing compliance_clean (reference: 60 days)
for(i in 1:nrow(compliance)){
  compliance[i,"nSleep_clean"] <- 
    nlevels(as.factor(as.character(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"],"IDday"]))) 
  compliance[i,"periodSleep_clean"] <- 
    difftime(max(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"],"StartTime"]),
             min(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"],"StartTime"]),units="days") }
compliance$compl.sleep_clean <- 100*compliance$nSleep_clean/60 # % of days on two months

# printing compliance info
cat("\n\nsleepLog data:\n- 'Cleaned' No. days/participants =",
    round(mean(compliance$nSleep_clean),2)," ( SD =",round(sd(compliance$nSleep_clean),2),
    ") \n- 'Cleaned' sleepLog compliance =",
    round(mean(compliance$compl.sleep_clean),2),"% ( SD =",round(sd(compliance$compl.sleep_clean),2),
    ") \n- No. of missing days =",
    round(mean(as.numeric(compliance$periodSleep_clean)-compliance$nSleep_clean),2),
    " ( SD =",round(sd(as.numeric(compliance$periodSleep_clean)-compliance$nSleep_clean),2),")")

## 
## 
## sleepLog data:
## - 'Cleaned' No. days/participants = 55.06  ( SD = 14.03 ) 
## - 'Cleaned' sleepLog compliance = 91.77 % ( SD = 23.38 ) 
## - No. of missing days = 5.82  ( SD = 9.05 )

# plotting No. of cases
hist(compliance$nSleep_clean,breaks=50,main="No. of nonmissing sleep periods per participant",xlab="")

Comments:

compliance is almost identical to that reported for the full dataset

5.2.4. Sleep stages

Finally, we summarize the No. and % of sleepLog cases with sleep stage data.

# computing No. of cases with nonmissing sleep stage data
for(i in 1:nrow(compliance)){
  compliance[i,"nSleepStage_clean"] <- 
    nlevels(as.factor(as.character(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"] & 
                                            !is.na(Nsleep$light),"IDday"])))  }

# printing compliance info
cat("\n\nsleepLog data (nonmissing sleep stages):\n- 'Cleaned' No. days/participants =",
    round(mean(compliance$nSleepStage_clean),2)," ( SD =",round(sd(compliance$nSleepStage_clean),2),
    ") \n- sleep stages compliance =",
    round(100*mean(compliance$nSleepStage_clean/60),2),"% ( SD =",
    round(sd(100*compliance$nSleepStage_clean/60),2),
    ") \n- non missing sleep stages/Total No. 'clean' cases =",
    round(mean(100*compliance$nSleepStage_clean/compliance$nSleep_clean),2),
    "% ( SD =",round(sd(100*compliance$nSleepStage_clean/compliance$nSleep_clean),2),")")

## 
## 
## sleepLog data (nonmissing sleep stages):
## - 'Cleaned' No. days/participants = 47.32  ( SD = 14.57 ) 
## - sleep stages compliance = 78.87 % ( SD = 24.28 ) 
## - non missing sleep stages/Total No. 'clean' cases = 85.98 % ( SD = 15.01 )

# plotting No. of cases
hist(compliance$nSleepStage_clean,breaks=50,main="No. of nonmissing sleep stage values per participant",xlab="")

Comments:

the percentage of missing data increased from the 8.23% for classic sleepLog data to the 21% for sleepStage data

5.3. sleepHR

sleepHR data include information describing heart rate (HR) mean values for NREM and REM sleep, as computed in section 4.4.

5.3.1. Compliance

Here, we summarize the No. and % of sleepLog cases with sleep-related HR data.

# computing No. of cases with nonmissing sleep stage data
for(i in 1:nrow(compliance)){
  compliance[i,"nSleepHR"] <- 
    nlevels(as.factor(as.character(Nsleep[as.character(Nsleep$ID)==compliance[i,"ID"] & 
                                            !is.na(Nsleep$stageHR_NREM),"IDday"])))  }

# printing compliance info
cat("\n\nsleepLog data (nonmissing TST-related HR):\n- 'Cleaned' No. days/participants =",
    round(mean(compliance$nSleepHR),2)," ( SD =",round(sd(compliance$nSleepHR),2),
    ") \n- sleep HR compliance =",
    round(mean(100*compliance$nSleepHR/60),2),"% ( SD =",
    round(sd(100*compliance$nSleepHR/60),2),
    ") \n- non missing stageHR_NREM/Total No. 'clean' cases =",
    round(mean(100*compliance$nSleepHR/compliance$nSleep_clean),2),
    "% ( SD =",round(sd(100*compliance$nSleepHR/compliance$nSleep_clean),2),")")

## 
## 
## sleepLog data (nonmissing TST-related HR):
## - 'Cleaned' No. days/participants = 47.03  ( SD = 14.3 ) 
## - sleep HR compliance = 78.39 % ( SD = 23.83 ) 
## - non missing stageHR_NREM/Total No. 'clean' cases = 85.54 % ( SD = 14.98 )

# plotting No. of cases
hist(compliance$nSleepHR,breaks=50,main="No. of nonmissing stageHR_NREM values per participant",xlab="")

Comments:

the percentage of missing HR data (15%) is similar than that observed for sleep staging information (14%)

5.3.2. Data filtering

sleepHR data was computed in section 4.4 from all the available epochs recorded within each TIB interval. Thus, sleepHR data can be filtered both by accounting for the No. of missing HR epochs used for computing each measure, and by accounting for the range of HR values.

5.3.2.1. Missing epochs

Here, we inspect the distribution of the difference between the No. of minutes in each sleep period (TST), and the total No. of epochs (i.e., minutes) used for computing sleepHR variables (nHR_TST)

# computing and plotting differences between expected and observed No. of TIB epochs
NsleepHR <- ema[!is.na(ema$stageHR_NREM),]
NsleepHR$nHR_TST <- apply(NsleepHR[,c("nHR_NREM","nHR_REM")],1,sum,na.rm=TRUE)
NsleepHR$nHR_TSTdiff <- NsleepHR$TST - NsleepHR$nHR_TST
hist(NsleepHR$nHR_TSTdiff,breaks=100,main="TST - nHR_TST (minutes)",xlab="")

# printing info (No. of missing epochs)
nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR[NsleepHR$nHR_TSTdiff>=nMiss[i],]),
                               "cases with",nMiss[i],"or more missing HR epochs")}

## 
## - 3711 cases with 0 or more missing HR epochs
## - 3364 cases with 1 or more missing HR epochs
## - 1457 cases with 10 or more missing HR epochs
## - 143 cases with 20 or more missing HR epochs
## - 35 cases with 30 or more missing HR epochs
## - 15 cases with 50 or more missing HR epochs
## - 4 cases with 75 or more missing HR epochs
## - 2 cases with 100 or more missing HR epochs
## - 2 cases with 150 or more missing HR epochs
## - 2 cases with 200 or more missing HR epochs
## - 1 cases with 400 or more missing HR epochs

Comments:

a substantial No. of HR values (N = 1,457, 33%) were computed from recordings with 20+ missing epochs (i.e., 10+ min)
a lower No. of cases (N = 143, 3%) were computed from recordings with 20+ missing epochs (i.e., 20+ min)

Here, we filter sleepHR values computed from recordings with 50+ missing epochs (i.e., 15 cases with 50+ minutes with no HR).

# identifying sleepHR variables
sleepHRVars <- colnames(ema)[which(colnames(ema)=="stageHR_REM"):which(colnames(ema)=="stageHR_NREM")]

# identifying cases with 50+ missing epochs
toRemove <- levels(as.factor(as.character(NsleepHR[NsleepHR$nHR_TSTdiff>=50,"LogId"])))
cat("No. of cases to be removed =",length(toRemove)) # 106

## No. of cases to be removed = 15

# removing sleepHR variables from cases with 50+ missing epochs
memory <- ema
ema[!is.na(ema$stageHR_NREM) & ema$LogId %in% toRemove, sleepHRVars] <- NA

# print info (15 removed cases)
cat("No. excluded cases with 50+ missing epochs =",
    nrow(memory[!is.na(memory$stageHR_NREM),])-nrow(ema[!is.na(ema$stageHR_NREM),]))

## No. excluded cases with 50+ missing epochs = 15

Comments:

15 cases were removed due to 50+ minutes of missing epochs within a given TST interval

Here, we further inspect the differences between the expected and the observed No. of HR epochs in each sleep period.

# computing and plotting differences between expected and observed No. of TST epochs
NsleepHR <- ema[!is.na(ema$stageHR_NREM),]
NsleepHR$nHR_TST <- apply(NsleepHR[,c("nHR_NREM","nHR_REM")],1,sum,na.rm=TRUE)
NsleepHR$nHR_TSTdiff <- NsleepHR$TST - NsleepHR$nHR_TST
hist(NsleepHR$nHR_TSTdiff,breaks=100,main="TST - nHR_TST (minutes)",xlab="")

# printing info (No. of missing epochs)
nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR[NsleepHR$nHR_TSTdiff>=nMiss[i],]),
                               "cases with",nMiss[i],"or more missing HR epochs")}

## 
## - 3696 cases with 0 or more missing HR epochs
## - 3349 cases with 1 or more missing HR epochs
## - 1442 cases with 10 or more missing HR epochs
## - 128 cases with 20 or more missing HR epochs
## - 20 cases with 30 or more missing HR epochs
## - 0 cases with 50 or more missing HR epochs
## - 0 cases with 75 or more missing HR epochs
## - 0 cases with 100 or more missing HR epochs
## - 0 cases with 150 or more missing HR epochs
## - 0 cases with 200 or more missing HR epochs
## - 0 cases with 400 or more missing HR epochs

Comments:

now, none of the included sleepHR measures are computed from recordings with 50+ minutes of missing data

Then, we can apply the same criterion to separately filter stageHR_NREM and stageHR_REM, based on maximum acceptable No. of missing data. The HRdata.filter function is used to optimize the process.

show HRdata.filter

HRdata.filter <- function(EMAdata=NA,nHRvar=NA,HRvar=NA,logId=NA,SLEEPvar=NA,filter=FALSE,maxDiff=NA){ 
  
  # preparing dataset
  colnames(EMAdata)[which(colnames(EMAdata)==nHRvar)] <- "nHRvar"
  colnames(EMAdata)[which(colnames(EMAdata)==logId)] <- "logId"
  if(length(HRvar)>1){
    for(i in 1:length(HRvar)){ colnames(EMAdata)[which(colnames(EMAdata)==HRvar[i])] <- paste("HRvar",i,sep="") }
    colnames(EMAdata)[colnames(EMAdata)=="HRvar1"] <- "HRvar"
    } else { colnames(EMAdata)[which(colnames(EMAdata)==HRvar)] <- "HRvar" }
  if(is.numeric(SLEEPvar)){ EMAdata$SLEEPvar <- SLEEPvar 
    } else { colnames(EMAdata)[which(colnames(EMAdata)==SLEEPvar)] <- "SLEEPvar" }
  NsleepHR <- EMAdata[!is.na(EMAdata$HRvar),]
  NsleepHR$diff <- NsleepHR$SLEEPvar - NsleepHR$nHRvar
  
  # printing info (No. of missing epochs)
  cat("\n\nNo. of cases with missing HR epochs:\n")
  nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
  for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR[NsleepHR$diff>=nMiss[i],]),
                                 "cases with",nMiss[i],"or more missing",nHRvar,"epochs")
    if(nrow(NsleepHR[NsleepHR$diff>=nMiss[i],])==0){ break }}
  
  # filtering HRmeasures data based on maxDiff
  if(filter==TRUE & !is.na(maxDiff)){
    if(is.numeric(maxDiff)){
      toRemove <- levels(as.factor(as.character(NsleepHR[NsleepHR$diff>=maxDiff,"logId"])))
    } else if(is.character(maxDiff) & substr(maxDiff,nchar(maxDiff),nchar(maxDiff))=="%"){
      maxDiff_perc <- as.numeric(gsub("%","",maxDiff))/100
      toRemove <- levels(as.factor(as.character(NsleepHR[NsleepHR$diff >= NsleepHR$SLEEPvar*maxDiff_perc,"logId"]))) }
    
    memory <- EMAdata
    EMAdata[!is.na(EMAdata$nHRvar) & EMAdata$logId %in% toRemove, c("nHRvar","HRvar")] <- NA
    if(length(HRvar)>1){ 
      EMAdata[!is.na(EMAdata$HRvar2) & EMAdata$logId %in% toRemove, paste("HRvar",2:length(HRvar),sep="")] <- NA } 
    
    # printing info
    cat("\n\nNo. excluded cases with more than",maxDiff,"missing epochs =",
        nrow(memory[!is.na(memory$HRvar),])-nrow(EMAdata[!is.na(EMAdata$HRvar),]))
    NsleepHR2 <- EMAdata[!is.na(EMAdata$HRvar),]
    NsleepHR2$diff <- NsleepHR2$SLEEPvar - NsleepHR2$nHRvar
    cat("\n\nUpdated No. of cases with missing HR epochs:\n")
    nMiss <- c(0,1,10,20,30,50,75,100,150,200,400)
    for(i in 1:length(nMiss)){ cat("\n-",nrow(NsleepHR2[NsleepHR2$diff>=nMiss[i],]),
                                   "cases with",nMiss[i],"or more missing",nHRvar,"epochs")
    if(nrow(NsleepHR[NsleepHR2$diff>=nMiss[i],])==0){ break }}
    
    # plotting
    par(mfrow=c(2,1))
    hist(NsleepHR$diff,xlab="",breaks=100,main=paste("differences between",SLEEPvar,"and",nHRvar))
    hist(NsleepHR2$diff,xlab="",breaks=100,main=paste("Updated differences between",SLEEPvar,"and",nHRvar)) }
  
  # renaming variables
  colnames(EMAdata)[which(colnames(EMAdata)=="nHRvar")] <- nHRvar
  if(!is.numeric(SLEEPvar)){ colnames(EMAdata)[which(colnames(EMAdata)=="SLEEPvar")] <- SLEEPvar  
  } else { EMAdata$SLEEPvar <- NULL } 
  colnames(EMAdata)[which(colnames(EMAdata)=="logId")] <- logId
  if(length(HRvar)>1){ colnames(EMAdata)[which(substr(colnames(EMAdata),1,5)=="HRvar")] <- HRvar
  } else { colnames(EMAdata)[which(colnames(EMAdata)=="HRvar")] <- HRvar }
  
  return(EMAdata) }

Here, the procedure is applied to both stage-related HR measures by setting the maxDiff argument to 20 in order to remove all cases with 20 or more missing epochs.

# stageHR_REM: 20 (filtering 5 cases)
ema <- HRdata.filter(EMAdata=ema,nHRvar="nHR_REM",HRvar="stageHR_REM",logId="LogId",SLEEPvar="rem",
                     filter=TRUE,maxDiff=20)

## 
## 
## No. of cases with missing HR epochs:
## 
## - 3520 cases with 0 or more missing nHR_REM epochs
## - 2580 cases with 1 or more missing nHR_REM epochs
## - 12 cases with 10 or more missing nHR_REM epochs
## - 5 cases with 20 or more missing nHR_REM epochs
## - 0 cases with 30 or more missing nHR_REM epochs
## 
## No. excluded cases with more than 20 missing epochs = 5
## 
## Updated No. of cases with missing HR epochs:
## 
## - 3515 cases with 0 or more missing nHR_REM epochs
## - 2575 cases with 1 or more missing nHR_REM epochs
## - 7 cases with 10 or more missing nHR_REM epochs
## - 0 cases with 20 or more missing nHR_REM epochs

# stageHR_NREM: 20 (filtering 39 cases)
ema$nrem <- ema$light + ema$deep
ema <- HRdata.filter(EMAdata=ema,nHRvar="nHR_NREM",HRvar="stageHR_NREM",logId="LogId",SLEEPvar="nrem",
                     filter=TRUE,maxDiff=20)

## 
## 
## No. of cases with missing HR epochs:
## 
## - 3647 cases with 0 or more missing nHR_NREM epochs
## - 3228 cases with 1 or more missing nHR_NREM epochs
## - 995 cases with 10 or more missing nHR_NREM epochs
## - 39 cases with 20 or more missing nHR_NREM epochs
## - 10 cases with 30 or more missing nHR_NREM epochs
## - 0 cases with 50 or more missing nHR_NREM epochs
## 
## No. excluded cases with more than 20 missing epochs = 39
## 
## Updated No. of cases with missing HR epochs:
## 
## - 3608 cases with 0 or more missing nHR_NREM epochs
## - 3189 cases with 1 or more missing nHR_NREM epochs
## - 956 cases with 10 or more missing nHR_NREM epochs
## - 0 cases with 20 or more missing nHR_NREM epochs

Comments:

5 (0.1%) stageHR_REM values were excluded due to 20+ missing HR epochs
39 (0.8%) stageHR_NREM values were excluded due to 20+ missing HR epochs

5.3.2.2. HR values

Finally, we inspect the range of HR values in each stageHR variable, and we compare HR values with normative cut-offs (i.e., 1st and 99th centiles for ages 15-18y = 43 and 104 bpm, respectively) from Fleming et al. (2011).

sleepHRVars <- paste("stageHR",c("NREM","REM"),sep="_")
for(i in 1:length(sleepHRVars)){ 
  colnames(ema)[which(colnames(ema)==sleepHRVars[i])] <- "sleepHR"
  cat("\n\n",sleepHRVars[i],":\n -",nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR < 43,]),"cases with mean HR < 43 bpm (",
      round(100*nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR < 43,])/nrow(ema[!is.na(ema$sleepHR),]),1),
      "% )\n -",nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR > 104,]),"cases with mean HR > 104 bpm (",
      round(100*nrow(ema[!is.na(ema$sleepHR) & ema$sleepHR > 104,])/nrow(ema[!is.na(ema$sleepHR),]),1),
      ")\n - min HR =",min(ema[!is.na(ema$sleepHR),"sleepHR"]),"- max HR =",max(ema[!is.na(ema$sleepHR),"sleepHR"]),"bpm")
  hist(ema$sleepHR,breaks=100,xlab="",main=sleepHRVars[i]); abline(v=c(43,104),col="red")
  colnames(ema)[which(colnames(ema)=="sleepHR")] <- sleepHRVars[i] }

## 
## 
##  stageHR_NREM :
##  - 28 cases with mean HR < 43 bpm ( 0.6 % )
##  - 0 cases with mean HR > 104 bpm ( 0 )
##  - min HR = 39.825 - max HR = 100.93 bpm

## 
## 
##  stageHR_REM :
##  - 19 cases with mean HR < 43 bpm ( 0.4 % )
##  - 0 cases with mean HR > 104 bpm ( 0 )
##  - min HR = 41.011 - max HR = 97.297 bpm

Comments:

a small No. of cases (ranging from 19 for stageHR_REM to 28 for stageHR_NREM) show HR lower than 43 bpm
no cases show mean HR higher than 104 bpm

Due to the very low No. of cases with extreme cases, we decide to not filter sleepHR data based on HR values.

5.3.3. Summary of data

In summary, from the original No. of sleep logs with nonmissing HR data (N = 5,037), 15 cases were removed due values computed from recordings with 50+ missing epochs, and further cases were removed due to 20+ missing epochs in stageHR_REM (N = 5) and stageHR_NREM (N = 39).

Thus, data cleaning led to a total No. of excluded sleep periods = 54 cases for stageHR_NREM and 20 cases for stageHR_REM.

# stageHR_NREM: 54 cases (as expected)
cat("stageHR_NREM: Total No. of excluded cases =",nrow(memory[!is.na(memory$stageHR_NREM),]) - nrow(ema[!is.na(ema$stageHR_NREM),]))

## stageHR_NREM: Total No. of excluded cases = 54

# stageHR_NREM: 19 cases (1 less than expected)
cat("stageHR_REM: Total No. of excluded cases =",nrow(memory[!is.na(memory$stageHR_REM),]) - nrow(ema[!is.na(ema$stageHR_REM),]))

## stageHR_REM: Total No. of excluded cases = 19

Here, compute the updated information on the nonmissing data related to sleepHR variables.

for(i in 1:length(sleepHRVars)){ 
  
  # updating compliance dataset
  colnames(ema)[which(colnames(ema)==sleepHRVars[i])] <- "sleepHR"
  NsleepHR <- ema[!is.na(ema$sleepHR),]
  
  # computing compliance_clean (reference: 60 days)
  for(j in 1:nrow(compliance)){
    compliance[j,"sleepHR"] <- 
      nlevels(as.factor(as.character(NsleepHR[as.character(NsleepHR$ID)==compliance[j,"ID"] &
                                                !is.na(NsleepHR$sleepHR),"IDday"])))  }
  
  # printing compliance info
  cat("\n\n",sleepHRVars[i],":\n- 'Cleaned' No. days/participants =",
    round(mean(compliance$sleepHR),2)," ( SD =",round(sd(compliance$sleepHR),2),
    ") \n- compliance =",
    round(mean(100*compliance$sleepHR/60),2),"% ( SD =",
    round(sd(100*compliance$sleepHR/60),2),
    ") \n- non missing",sleepHRVars[i],"/Total No. 'clean' sleepLog cases =",
    round(mean(100*compliance$sleepHR/compliance$nSleep_clean),2),
    "% ( SD =",round(sd(100*compliance$sleepHR/compliance$nSleep_clean),2),")")
  
  # plotting No. of cases
  hist(compliance$sleepHR,breaks=50,main=paste("No. of nonmissing",sleepHRVars[i],"values per participant"),xlab="")
  
  # back to original variable name
  colnames(ema)[which(colnames(ema)=="sleepHR")] <- sleepHRVars[i]
  colnames(compliance)[which(colnames(compliance)=="sleepHR")] <- paste("n",sleepHRVars[i],"_clean",sep="") }

## 
## 
##  stageHR_NREM :
## - 'Cleaned' No. days/participants = 46.45  ( SD = 14.13 ) 
## - compliance = 77.42 % ( SD = 23.56 ) 
## - non missing stageHR_NREM /Total No. 'clean' sleepLog cases = 84.54 % ( SD = 14.88 )

## 
## 
##  stageHR_REM :
## - 'Cleaned' No. days/participants = 46.8  ( SD = 14.25 ) 
## - compliance = 77.99 % ( SD = 23.74 ) 
## - non missing stageHR_REM /Total No. 'clean' sleepLog cases = 85.15 % ( SD = 15.05 )

Comments:

compliance is similar to that reported for the full dataset

5.4. dailyAct

dailyAct data include information on daily TotalSteps and physical activity durations. The TotalSteps variable was recomputed from hourlySteps data in section 4.1, with only 13 cases being only included in the dailyAct but not in the hourlySteps dataset.

5.4.1. Compliance

Here, we compute the original No. of dailyAct days (IDday) identified by the FC3 device.

cat(nrow(ema[!is.na(ema$TotalSteps),]),"nonmissing cases of dailyAct data") # 6,019

## 6019 nonmissing cases of dailyAct data

For estimating compliance, we compute the ratio between the No. of nonmissing days per each participant by the length of the recording periods originally required by the study, that is two months (60 days).

# computing No. of cases with nonmissing dailyAct data
for(i in 1:nrow(compliance)){
  compliance[i,"ndailyAct"] <- 
    nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] & 
                                         !is.na(ema$TotalSteps),"IDday"])))
   compliance[i,"ndailyAct_sleep"] <- 
    nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] & 
                                         !is.na(ema$TIB) & !is.na(ema$TotalSteps),"IDday"]))) }

# printing compliance info
cat("\n\ndailyAct data:\n- Original No. days/participants =",
    round(mean(compliance$ndailyAct),2)," ( SD =",round(sd(compliance$ndailyAct),2),
    ") \n- Original dailyAct compliance =",
    round(mean(100*compliance$ndailyAct/60),2),"% ( SD =",round(sd(100*compliance$ndailyAct/60),2),
    ") \n- non missing original dailyAct/Total No. Original sleep cases =",
    round(mean(100*compliance$ndailyAct_sleep/compliance$nSleep),2),
    "% ( SD =",round(sd(100*compliance$ndailyAct_sleep/compliance$nSleep),2),")")

## 
## 
## dailyAct data:
## - Original No. days/participants = 64.72  ( SD = 5.25 ) 
## - Original dailyAct compliance = 107.87 % ( SD = 8.75 ) 
## - non missing original dailyAct/Total No. Original sleep cases = 96.52 % ( SD = 11.59 )

# plotting No. of cases
hist(compliance$ndailyAct,breaks=50,main="No. of nonmissing TotalSteps values per participant",xlab="")

Comments:

compared to the criterion of having two months of continuous recordings, dailyAct data show an original compliance of 108%, meaning that most participants recorded more than two months of physical activity
as showed in section 1.1, no missing days occurred during the data collection period
the 96% of sleepLog data also include dailyAct data

5.4.2. Data filtering

5.4.2.1. Missing epochs

First, we remove the 13 cases that were not computed from HourlySteps data.

# printing No. of cases
cat(nrow(ema[!is.na(ema$hourlySteps) & ema$hourlySteps==FALSE,]),"cases with no corresponding hourlySteps data")

## 13 cases with no corresponding hourlySteps data

# removing 13 cases with no corresponding hourlySteps data
dailyActVars <- colnames(ema)[which(colnames(ema)=="TotalSteps"):which(colnames(ema)=="hourlySteps")] 
memory <- ema
ema[!is.na(ema$hourlySteps) & ema$hourlySteps==FALSE,dailyActVars] <- NA
cat("Removed",nrow(ema[is.na(ema$TotalSteps),])-nrow(memory[is.na(memory$TotalSteps),]),"cases")

## Removed 13 cases

5.4.2.2. Wear time

As suggested by Herrmann et al (2012), the validity of physical activity data should be inpected based on wear time. Here, we implement this approach by using three indicators of wear time:

one based on the TotalActivityMinutes variable, computed by summing the durations automatically stored in Fitabase
one based on nonmissing diurnalHR epochs
one based on the No. of hourlySteps counts higher than zero

5.4.2.2.1. actWearTime

First, we compute actWearTime variable, simply expressing the TotalActivityMinutes in hours, that is the total No. of physical activity minutes recorded by the device, according to Fitabase.

# computing actWearTime
ema$actWearTime <- ema$TotalActivityMinutes/60

# plotting
hist(ema$actWearTime,xlab="",
     main=paste("actWearTime (hours) - min =",round(min(ema$actWearTime,na.rm=T),1),
                "max =",round(max(ema$actWearTime,na.rm=T),1)),breaks=100); abline(v=13,col="red")

5.4.2.2.2. HRWearTime

Second, we use the HR.1min dataset to create the variable diurnalHR as the average of all nonmissing HR values outside sleepLog StartTime and EndTime. This is done with the dayTimeHR function.

FUNCTION

show dayTimeHR

dayTimeHR <- function(SLEEPACTdata=NA,HRdata=NA){ require(tcltk); require(lubridate)
  
  # HR data preparation (creating Hour as the current date + Time from HR.1min)
  HRdata$Hour <- as.POSIXct(paste(hour(HRdata$Time),minute(HRdata$Time)),format="%H %M",tz="GMT")
  HRdata$IDday <- as.factor(paste(HRdata$ID,substr(HRdata$Time,1,10),sep="_"))
  
  # SLEEP-ACT data preparation (creating IDday and setting first row)
  SLEEPACTdata$IDday <- as.factor(paste(SLEEPACTdata$ID,SLEEPACTdata$ActivityDate,sep="_"))
  if(!is.na(SLEEPACTdata[1,"TotalSteps"])){
    if(!is.na(SLEEPACTdata[1,"TST"])){
      # selecting HR data with the same IDday value
      HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[1,"IDday"] & HRdata$Time <= SLEEPACTdata[1,"StartTime"],]
    } else {
      HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] & 
                         HRdata$Hour >= as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") & 
                         HRdata$Hour <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:00:00"),tz="GMT"),] }
    SLEEPACTdata[1,c("nEpochsdiurnalHR","diurnalHR.type","diurnalHR")] <- # computing HR variables
        data.frame(nEpochsdiurnalHR=nrow(HRday),diurnalHR.type="6.to.23",diurnalHR=mean(HRday$HR)) }
  
  # iteratively computing dayTimeHR
  pb <- tkProgressBar("", "%",0, 100, 50) # progress bar
  for(i in 2:nrow(SLEEPACTdata)){ IDday <- as.character(SLEEPACTdata[i,"IDday"])
    info <- sprintf("%d%% done", round(which(rownames(SLEEPACTdata)==i)/nrow(SLEEPACTdata)*100))
    setTkProgressBar(pb, round(which(rownames(SLEEPACTdata)==i)/nrow(SLEEPACTdata)*100), sprintf("Computing mean HR...", info), info)
    diurnalHR.type <- NA
    if(!is.na(SLEEPACTdata[i,"TotalSteps"])){
      
      # if sleepLog data was NOT missing on day i
      if(!is.na(SLEEPACTdata[i,"TST"])){
        
        # if same subject and nonmissing sleepLog on day i-1 --> diurnal HR from the previous EndTime to the current StartTime
        if(SLEEPACTdata[i,"ID"]==SLEEPACTdata[i-1,"ID"] & !is.na(SLEEPACTdata[i-1,"TST"]) &
           difftime(SLEEPACTdata[i,"ActivityDate"],SLEEPACTdata[i-1,"ActivityDate"],units="days")<=1){ 
          diurnalHR.type <- "TIB.based"
          HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] & 
                            HRdata$Time >= SLEEPACTdata[i-1,"EndTime"] & HRdata$Time <= SLEEPACTdata[i,"StartTime"],]
          
          } else { # if different subject or missing sleepLog day --> diurnal HR from 6:00 to the current StartTime
            diurnalHR.type <- "6.to.start"
            HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] & 
                              HRdata$Hour >= as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") & 
                              HRdata$Time <= SLEEPACTdata[i,"StartTime"],] 
            
            }} else { # if sleepLog data was missing on day i --> diurnal HR from 6:00 to 23:00
              diurnalHR.type <- "6.to.23"
              HRday <- HRdata[as.character(HRdata$IDday)==SLEEPACTdata[i,"IDday"] & 
                                HRdata$Hour >= as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT") & 
                                HRdata$Hour <= as.POSIXct(paste(substr(Sys.time(),1,10),"23:00:00"),tz="GMT"),] }
      
      # updating dataset
      meanHR <- ifelse(nrow(HRday)>0,mean(HRday$HR),NA)
      SLEEPACTdata[i,c("nEpochsdiurnalHR","diurnalHR.type","diurnalHR")] <-
        data.frame(nEpochsdiurnalHR=nrow(HRday),diurnalHR.type=diurnalHR.type,diurnalHR=meanHR) }}
  close(pb)
  
  return(SLEEPACTdata) }

ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),] # sorting by ID and ActivityDate
ema <- dayTimeHR(SLEEPACTdata=ema,HRdata=HR.1min)

SANITY CHECKS

First, we inspect the No. and percentage of missing data.

Nact <- ema[!is.na(ema$TotalSteps),]
cat("Manually computed",nrow(Nact[!is.na(Nact$diurnalHR),]),"diurnalHR values, of which:\n -",
    summary(as.factor(Nact$diurnalHR.type))[1],"from 6 AM to 11 PM (missing sleepLog data)\n -",
    summary(as.factor(Nact$diurnalHR.type))[2],"from 6 AM to StartTime (missing previous sleepLog data)\n -",
    summary(as.factor(Nact$diurnalHR.type))[3],"based on TIB boundaries \n\n -",
    nrow(Nact[is.na(Nact$diurnalHR),]),
    "missing values (",round(100*nrow(Nact[is.na(Nact$diurnalHR),])/nrow(Nact),1),"% )")

## Manually computed 5604 diurnalHR values, of which:
##  - 996 from 6 AM to 11 PM (missing sleepLog data)
##  - 408 from 6 AM to StartTime (missing previous sleepLog data)
##  - 4602 based on TIB boundaries 
## 
##  - 402 missing values ( 6.7 % )

Comments:

Considering the total No. of cases (non missing either in sleepLog or in dailyAct):

diurnalHR information is missing in 402 cases (6.7%)
most cases (82%) were computed based on TIB boundaries
a minority of cases were computed from 6 AM to 11 PM (18%) or from 6 AM to sleepLog StartTime (7%) due to missing sleepLog data

Then, we inspect the No. of cases whose nEpochsdiurnalHR is lower than the distance between diurnalMinutes boundaries, computed by accounting for diurnalHR.type.

# computing differences
ema$StartHour <- as.POSIXct(paste(hour(ema$StartTime),minute(ema$StartTime)),format="%H %M",tz="GMT")
for(i in 1:nrow(ema)){
  if(!is.na(ema[i,"diurnalHR"])){
    if(ema[i,"diurnalHR.type"]=="6.to.23"){ ema[i,"diurnalMinutes"] <- 17*60 # 6.to.23 -> 17 hours
    } else if(ema[i,"diurnalHR.type"]=="6.to.start"){ # 6.to.start -> StartHour - 6 AM
      timeDiff <- as.numeric(difftime(ema[i,"StartHour"],as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),tz="GMT"),
                                      units="mins"))
      if(timeDiff>0){ ema[i,"diurnalMinutes"] <- timeDiff } else { # when StartHour after midnight -> StartHour - 1 day
        ema[i,"diurnalMinutes"] <- as.numeric(difftime(ema[i,"StartHour"],
                                                        as.POSIXct(paste(substr(Sys.time(),1,10),"06:00:00"),
                                                                   tz="GMT")-1*60*60*24,
                                                       units="mins"))
      }} else if(ema[i,"diurnalHR.type"]=="TIB.based"){ # TIB.based -> StartHour - previous EndHour
        ema[i,"diurnalMinutes"] <- as.numeric(difftime(ema[i,"StartTime"],ema[i-1,"EndTime"],units="mins")) }}}
ema$diurnalHR.timeDiff <- ema$diurnalMinutes - ema$nEpochsdiurnalHR

# plotting differences
hist(ema$diurnalHR.timeDiff,breaks=100,
     main=paste("Differences between diurnalMinutes and No. of diurnalHR epochs \nmin =",
                min(ema$diurnalHR.timeDiff,na.rm=TRUE),"max =",max(ema$diurnalHR.timeDiff,na.rm=TRUE)))

# summarizing and showing 122 diurnalHR.timeDiff < 0
cat(nrow(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff<0,]),"negative differences")

## 125 negative differences

summary(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff<0,"diurnalHR.timeDiff"]) # from -1 to -0.5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.000  -1.000  -1.000  -0.784  -0.500  -0.500

# no cases with diurnalHR.timeDiff > 1 day = 24*60 = 1,440 min
cat(nrow(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff>1440,]),"differences > 1 day")

## 0 differences > 1 day

# summarizing 120 cases with diurnalHR.timeDiff > 1,000 minutes (16.7h)
cat(nrow(ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff>1000,]),"differences > 16.7h")

## 120 differences > 16.7h

ema[!is.na(ema$diurnalHR) & ema$diurnalHR.timeDiff>1000,
       c("StartTime","EndTime","diurnalMinutes","diurnalHR.type","nEpochsdiurnalHR","diurnalHR.timeDiff")]

Comments:

differences between diurnalMinutes and the No. of available diurnalHR epochs range from -1 to 1,400 min (23.3h)
only 126 differences (2%) are negative, ranging from -1 to 0.5 min, probably due to approximations of timing variables
many differences are substantial, with 120 cases (2%) showing more than 1,000 min of missing diurnalHR epochs, which is interesting for comparing valid and invalid dailyAct data based on HRwearTime values

HRwearTime

Here, we compute the HRWearTime variable quantifying wear time in terms of diurnal hours of nonmissing HR values.

# computing actWearTime
ema$HRWearTime <- ema$nEpochsdiurnalHR/60

# plotting
hist(ema$HRWearTime,xlab="",
     main=paste("HRWearTime (hours) - min =",round(min(ema$HRWearTime,na.rm=T),1),
                "max =",round(max(ema$HRWearTime,na.rm=T),1)),breaks=100); abline(v=13,col="red")

5.4.2.2.3. stepsWearTime

Third, we compute the stepsWearTime variable by counting the non-wear time as the No. of hourlySteps hours with no StepTotal data (i.e., zero counts) (e.g., see Aadland et al. 2018; Herrmann et al., 2014)

# counting No. of zero counts per day (i.e., 24h periods)
ema$IDday <- as.factor(paste(ema$ID,ema$ActivityDate,sep="_"))
hourlySteps$IDday <- as.factor(paste(hourlySteps$ID,hourlySteps$ActivityDate,sep="_"))
for(i in 1:nrow(ema)){
  ema[i,"nZeroCounts"] <- nrow(hourlySteps[hourlySteps$IDday==as.character(ema[i,"IDday"]) 
                                           & hourlySteps$StepTotal==0,]) }
ema[is.na(ema$TotalSteps),"nZeroCounts"] <- NA

# computing stepsWearTime (24 - nZeroCounts)
ema$stepsWearTime <- 24 - ema$nZeroCounts

# plotting
hist(ema$stepsWearTime,xlab="",
     main=paste("stepsWearTime (hours) - min =",round(min(ema$stepsWearTime,na.rm=T),1),
                "max =",round(max(ema$stepsWearTime,na.rm=T),1)),breaks=100); abline(v=13,col="red")

Data filtering

Here, we use the actWearTime, HRWearTime, and stepsWearTime variables to inspect the validity of dailyAct data. Specifically, we focus on the 13-hour criterion as 13h+ of non-wear time were recommended as reliable approximates of 14h/day accelerometer data collection, which is the average wear time in large studies using accelerometers (see Herrmann et al., 2013; 2014; Quante et al., 2015).

# plotting
hist(ema$stepsWearTime,xlab="",col=rgb(0,1,0,alpha=0.5),breaks=100,
     main=paste("actWearTime (blue), HRWearTime (red), and stepsWearTime (green) in hours"))
hist(ema$actWearTime,add=TRUE,col=rgb(0,0,1,alpha=0.5),breaks=100) ; abline(v=13,col="red")
hist(ema$HRWearTime,add=TRUE,col=rgb(1,0,0,alpha=0.5),breaks=100) ; abline(v=13,col="red")

# printing info
Nact <- ema[!is.na(ema$TotalSteps),]; n = nrow(Nact)
cat("No. of cases with wearTime < 13h: \n 1) based on dailyAct data:",nrow(Nact[Nact$actWearTime<13,]),
    "(",round(100*nrow(Nact[Nact$actWearTime<13,])/n,1),"% )\n 2) based on diurnalHR data:",
    nrow(Nact[Nact$HRWearTime<13,]),"(",round(100*nrow(Nact[Nact$HRWearTime<13,])/n,1),
    "% )\n 3) based on hourlySteps non-zero counts:",nrow(Nact[Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$stepsWearTime<13,])/n,1),
    "% )\n 4) based on 1) and 2):", nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,])/n,1),
    "% )\n 5) based on 1) and 3):",nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,])/n,1),
    "% )\n 6) based on 2) and 3):",nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
    "% )\n 7) based on all criteria:",nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
    "% )\n 8) based on at least one criterion:",
    nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,])/n,1),"% )")

## No. of cases with wearTime < 13h: 
##  1) based on dailyAct data: 889 ( 14.8 % )
##  2) based on diurnalHR data: 2311 ( 38.5 % )
##  3) based on hourlySteps non-zero counts: 1418 ( 23.6 % )
##  4) based on 1) and 2): 278 ( 4.6 % )
##  5) based on 1) and 3): 109 ( 1.8 % )
##  6) based on 2) and 3): 1342 ( 22.3 % )
##  7) based on all criteria: 97 ( 1.6 % )
##  8) based on at least one criterion: 2986 ( 49.7 % )

Comments:

almost half of the data (49.7%) show less than 13h of wear time based on one or more wear time criteria
the most conservative criterion is diurnalHR data, with 38.4% of cases showing less than 13h of wear time
the less conservative criterion is actWearTime, with the 14.8% of cases showing less than 13h of wear time

Since we rely more on high-resolution data than on aggregate scores, we apply the stepsWearTime criterion while accounting for HRWearTime, that is we keep those cases with stepsWearTime < 13h but HRWearTime >= 13h. Thus, we are removing 1,342 cases (22.3%)** that do not meet both criteria.

# identifying sleepHR variables
dailyActVars <- colnames(ema)[which(colnames(ema)=="TotalSteps"):which(colnames(ema)=="hourlySteps")]

# identifying cases with 30+ missing epochs
toRemove <- levels(as.factor(as.character(Nact[Nact$stepsWearTime<13 & Nact$HRWearTime<13,"IDday"])))
cat("No. of cases to be removed =",length(toRemove))

## No. of cases to be removed = 1342

# removing sleepHR variables from cases with 50+ missing epochs
memory <- ema
ema[!is.na(ema$TotalSteps) & ema$IDday %in% toRemove, dailyActVars] <- NA

# print info (64 removed cases)
cat("No. excluded cases with stepsWearTime AND HRWearTime < 13h =",
    nrow(memory[!is.na(memory$TotalSteps),])-nrow(ema[!is.na(ema$TotalSteps),]))

## No. excluded cases with stepsWearTime AND HRWearTime < 13h = 1342

# plotting TotalSteps
hist(memory$TotalSteps,xlab="",breaks=100,col=rgb(0,0,1),main="TotalSteps in original (blue) and filtered data (red)")
hist(ema$TotalSteps,xlab="",breaks=100,col=rgb(1,0,0),add=TRUE)

And we check again the three wear time criteria.

# excluding wearTime variables from filtered cases
ema[is.na(ema$TotalSteps),c("actWearTime","HRWearTime","stepsWearTime")] <- NA

# plotting
hist(ema$stepsWearTime,xlab="",col=rgb(0,1,0,alpha=0.5),breaks=100,
     main=paste("actWearTime (blue), HRWearTime (red), and stepsWearTime (green) in hours"))
hist(ema$actWearTime,add=TRUE,col=rgb(0,0,1,alpha=0.5),breaks=100) ; abline(v=13,col="red")
hist(ema$HRWearTime,add=TRUE,col=rgb(1,0,0,alpha=0.5),breaks=100) ; abline(v=13,col="red")

# printing info
Nact <- ema[!is.na(ema$TotalSteps),]; n = nrow(Nact)
cat("No. of cases with wearTime < 13h: \n 1) based on dailyAct data:",nrow(Nact[Nact$actWearTime<13,]),
    "(",round(100*nrow(Nact[Nact$actWearTime<13,])/n,1),"% )\n 2) based on diurnalHR data:",
    nrow(Nact[Nact$HRWearTime<13,]),"(",round(100*nrow(Nact[Nact$HRWearTime<13,])/n,1),
    "% )\n 3) based on hourlySteps non-zero counts:",nrow(Nact[Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$stepsWearTime<13,])/n,1),
    "% )\n 4) based on 1) and 2):", nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13,])/n,1),
    "% )\n 5) based on 1) and 3):",nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 & Nact$stepsWearTime<13,])/n,1),
    "% )\n 6) based on 2) and 3):",nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
    "% )\n 7) based on all criteria:",nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 & Nact$HRWearTime<13 & Nact$stepsWearTime<13,])/n,1),
    "% )\n 8) based on at least one criterion:",
    nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,]),"(",
    round(100*nrow(Nact[Nact$actWearTime<13 | Nact$HRWearTime<13 | Nact$stepsWearTime<13,])/n,1),"% )")

## No. of cases with wearTime < 13h: 
##  1) based on dailyAct data: 792 ( 17 % )
##  2) based on diurnalHR data: 969 ( 20.8 % )
##  3) based on hourlySteps non-zero counts: 76 ( 1.6 % )
##  4) based on 1) and 2): 181 ( 3.9 % )
##  5) based on 1) and 3): 12 ( 0.3 % )
##  6) based on 2) and 3): 0 ( 0 % )
##  7) based on all criteria: 0 ( 0 % )
##  8) based on at least one criterion: 1644 ( 35.2 % )

Comments:

we filtered 1,342 cases (22.3%), leading to a dramatic reduction of the No. of cases with TotalSteps = 0 (from 196 to 7)
no more cases have less than 13h of wear time based on both the stepsWearTime and the HRWearTime criteria, suggesting that the data filtering was effective
35.2% of the cases still show less than 13h of wear time based on one or more criteria

5.4.2.3. Isolated days

Here, as done for sleepLog data in section 5.2.2.3, we better inspect cases of tempolarily isolated days of dailyAct measures, that is dailyAct data recorded substantially later than all other dailyAct data previously recorded from the same participant.

# computing LAG values for ActivityDate
library(dplyr)
ema <- ema[order(ema$ID,ema$ActivityDate,ema$StartTime),]
Nact <- ema[!is.na(ema$TotalSteps),]
Nact <- Nsleep %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Nact <- as.data.frame(Nact)
detach("package:dplyr", unload=TRUE)

# computing and plotting time lags
Nact$lag <- as.numeric(difftime(Nact$ActivityDate,Nact$AD_lag,units="days"))
hist(Nact$lag,main="Consecutive missing days day(i) - day(i-1)",breaks=100)

# printing info
n <- c(10,15,20,30,50)
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
                           nrow(Nact[(!is.na(Nact$lag) & Nact$lag>n[i]),])) }

## 
## - No. cases with > 10 consecutive missing days = 7
## - No. cases with > 15 consecutive missing days = 3
## - No. cases with > 20 consecutive missing days = 2
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0

# showing 41 cases with more than 10 missing days
Nact.vars <- c("ID","ActivityDate","lag",dailyActVars)
isolatedDay <- as.data.frame(matrix(nrow=0,ncol=length(Nact.vars)+1))
for(i in 1:nrow(Nact)){
  if(!is.na(Nact[i,"lag"]) & Nact[i,"lag"]>10){ 
    isolatedDay <- rbind(isolatedDay,Nact[(i-2):(i+2),c(Nact.vars,"TIB")]) } }
isolatedDay[,c("ID","ActivityDate","lag","TotalSteps","TIB")]

Comments:

contrarily to sleepLog, and consistently with what observed in section 2.1, dailyAct data do not show cases of isolated recording days
in contrast, all cases of consecutive missing days are observed within each participant data collection periods (not at the beginning or at the end)

5.4.2.4. Duplicated

Finally, we check again for duplicated cases, that is cases with the same ID and ActivityDate values.

# creating IDday variable
ema$IDday <- as.factor(paste(ema$ID,ema$ActivityDate,sep="_"))
Nact <- ema[!is.na(ema$TotalSteps),]

# sanity check by IDday (0 cases)
cat("Sanity check:",nrow(Nact[duplicated(Nact$IDday),])==0)

## Sanity check: TRUE

Comments:

no duplicated cases are included in the ema dataset for dailyAct variables

5.4.3. Summary of data

In summary, from the original No. of dailyAct measures (N = 6,019 days, corresponding to a compliance higher than the 100%), 13 cases (0.2%) were excluded due to no corresponding hourlySteps data, and 1,342 cases (22%) were removed due to less than 13h of wear time based on hourlySteps and diurnalHR data.

Thus, data cleaning led to a total No. of excluded sleep periods = 1,355 cases (22%).

cat(nrow(ema[!is.na(ema$TotalSteps),]),"'cleaned' cases of dailyAct data") # 4,664

## 4664 'cleaned' cases of dailyAct data

Here, compute the updated information on the nonmissing data related to dailyAct variables.

# updating compliance dataset
Nact <- ema[!is.na(ema$TotalSteps),]

# computing compliance_clean (reference: 60 days)
for(i in 1:nrow(compliance)){
  compliance[i,"ndailyAct_clean"] <- 
    nlevels(as.factor(as.character(Nact[as.character(Nact$ID)==compliance[i,"ID"],"IDday"]))) 
  compliance[i,"ndailyAct_sleep_clean"] <- 
    nlevels(as.factor(as.character(Nact[as.character(Nact$ID)==compliance[i,"ID"] & 
                                          !is.na(Nact$TIB) & !is.na(Nact$TotalSteps),"IDday"]))) }
compliance$compl.act_clean <- 100*compliance$ndailyAct_clean/60 # % of days on two months

# printing compliance info
cat("\n\ndailyAct data:\n- 'Clean' No. days/participants =",
    round(mean(compliance$ndailyAct_clean),2)," ( SD =",round(sd(compliance$ndailyAct_clean),2),
    ") \n- 'clean' dailyAct compliance =",
    round(mean(100*compliance$ndailyAct_clean/60),2),"% ( SD =",
    round(sd(100*compliance$ndailyAct_clean/60),2),
    ") \n- 'clean' non missing sleep stages/Total No. Original sleep cases =",
    round(mean(100*compliance$ndailyAct_sleep_clean/compliance$nSleep_clean),2),
    "% ( SD =",round(sd(100*compliance$ndailyAct_sleep_clean/compliance$nSleep),2),")")

## 
## 
## dailyAct data:
## - 'Clean' No. days/participants = 50.15  ( SD = 15.97 ) 
## - 'clean' dailyAct compliance = 83.58 % ( SD = 26.62 ) 
## - 'clean' non missing sleep stages/Total No. Original sleep cases = 82.8 % ( SD = 22.75 )

# plotting No. of cases
hist(compliance$ndailyAct_sleep,breaks=50,main="No. of nonmissing TotalSteps values per participant",xlab="")

Comments:

data cleaning substantially decreased the No. of available dailyAct observations
compared to the criterion of having two months of continuous recordings, dailyAct compliance decreased from 107 to 84%, which is slightly lower than that shown by sleepLog and dailyDiary data
the average percentage of sleepLog data also including dailyAct data decreased from 98 to 83%

5.5. dailyDiary

The final variable to be ‘cleaned’ is dailyDiary, consisting of the day-by-day participants’ self-reports of three core variables, namely dailyStress, eveningMood, and eveningWorry, as recorded with the Survey Sparrow mobile application.

5.5.1. Compliance

From the original No. of available dailyDiary cases (N = 5,133) a total of 188 cases were removed because they were duplicated responses or for other reasons (see below), leading to an actual No. of 4,945 nonmissing dailyDiary cases.

Here, we compute the original No. of dailyDiary days (IDday) identified by the FC3 device.

cat(nrow(ema[!is.na(ema$StartedTime),]),"nonmissing cases of dailyDiary data") # 4,945

## 4945 nonmissing cases of dailyDiary data

# computing No. of cases with nonmissing dailyDiary dat
Ndiary <- ema[!is.na(ema$StartedTime),]
for(i in 1:nrow(compliance)){
  compliance[i,"nDiary"] <- nlevels(as.factor(as.character(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],
                                                                  "IDday"])))
  compliance[i,"periodDiary"] <- 
    difftime(max(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],"StartedTime"]),
             min(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],"StartedTime"]),
             units="days")
  compliance[i,"nDiary_sleep"] <- 
    nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] & 
                                         !is.na(ema$TIB) & !is.na(ema$StartedTime),"IDday"])))
  compliance[i,"nDiary_sleepAct"] <- 
    nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] & 
               !is.na(ema$TIB) & !is.na(ema$TotalSteps) & !is.na(ema$StartedTime),"IDday"]))) }

# printing compliance info
cat("\n\ndailyDiary data:\n- Original No. days/participants =",
    round(mean(compliance$nDiary),2)," ( SD =",round(sd(compliance$nDiary),2),
    ") \n- Original dailyDiary compliance =",
    round(mean(100*compliance$nDiary/60),2),"% ( SD =",round(sd(100*compliance$nDiary/60),2),
    ") \n- No. of missing days =",
    round(mean(as.numeric(compliance$periodDiary)-compliance$nDiary),2),
    " ( SD =",round(sd(as.numeric(compliance$periodDiary)-compliance$nDiary),2),
    ") \n- non missing original diary cases/Total No. Original sleep cases =",
    round(mean(100*compliance$nDiary_sleep/compliance$nSleep),2),
    "% ( SD =",round(sd(100*compliance$nDiary_sleep/compliance$nSleep),2),
    ") \n- non missing original diary cases/Total No. Original sleep AND dailyAct cases =",
    round(mean(100*compliance$nDiary_sleepAct/compliance$nSleep),2),
    "% ( SD =",round(sd(100*compliance$nDiary_sleepAct/compliance$nSleep),2),")")

## 
## 
## dailyDiary data:
## - Original No. days/participants = 53.17  ( SD = 10.17 ) 
## - Original dailyDiary compliance = 88.62 % ( SD = 16.94 ) 
## - No. of missing days = 9.05  ( SD = 10.78 ) 
## - non missing original diary cases/Total No. Original sleep cases = 82.94 % ( SD = 17.81 ) 
## - non missing original diary cases/Total No. Original sleep AND dailyAct cases = 70.36 % ( SD = 23.57 )

# plotting No. of cases
hist(compliance$nDiary,breaks=50,main="No. of nonmissing dailyDiary values per participant",xlab="")

Comments:

compared to the criterion of having two months of continuous recordings, dailyDiary data show an original mean compliance of 88.6%, slightly lower than that originally showed by sleepLog data (92.3%)
an average of 9 missing days, lower than that shown by sleepLog data, are observed
the 83% of sleepLog data also include dailyDiary data, which is included in the 70.4% of nonmissing sleepLog AND dailyAct data

5.5.2. Data filtering

dailyAct data was already filtered in section 2.7:

43 double responses with the same StartedTime values were excluded
further 139 double responses with the same ID and ActivityDate values were excluded
6 cases were excluded because surveyDuration was longer than 17h (i.e., not sure if the responses were referred to the current or the following day)

In summary, a total of 188 cases were excluded mainly due to double responses.

# printing info
cat("No. of excluded cases =",5133-nrow(Ndiary)) # 188

## No. of excluded cases = 188

5.5.2.1. Isolated days

Here, as done for sleepLog data in section 5.2.2.3, we better inspect cases of tempolarily isolated days of dailyDiary measures, that is dailyDiary data recorded substantially later than all other dailyDiary data previously recorded from the same participant.

# computing LAG values for ActivityDate
library(dplyr)
Ndiary <- Ndiary[order(Ndiary$ID,Ndiary$ActivityDate,Ndiary$StartedTime),]
Ndiary <- Ndiary %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Ndiary <- as.data.frame(Ndiary)
detach("package:dplyr", unload=TRUE)

# computing and plotting time lags
Ndiary$lag <- as.numeric(difftime(Ndiary$ActivityDate,Ndiary$AD_lag,units="days"))
hist(Ndiary$lag,main="Consecutive missing days day(i) - day(i-1)",breaks=100)

# printing info
n <- c(5,7,10,15,20,30,50)
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
                           nrow(Ndiary[(!is.na(Ndiary$lag) & Ndiary$lag>n[i]),])) }

## 
## - No. cases with > 5 consecutive missing days = 23
## - No. cases with > 7 consecutive missing days = 7
## - No. cases with > 10 consecutive missing days = 2
## - No. cases with > 15 consecutive missing days = 1
## - No. cases with > 20 consecutive missing days = 0
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0

# showing 7 cases with more than 7 missing days
Ndiary.vars <- c("ID","ActivityDate","lag",
                 colnames(ema)[which(colnames(ema)=="StartedTime"):which(colnames(ema)=="diary.timeDiff")])
isolatedDay <- as.data.frame(matrix(nrow=0,ncol=length(Ndiary.vars)+1))
for(i in 1:nrow(Nsleep)){
  if(!is.na(Ndiary[i,"lag"]) & Ndiary[i,"lag"]>7){ 
    isolatedDay <- rbind(isolatedDay,Ndiary[(i-2):(i+2),c(Ndiary.vars,"TIB","TotalSteps")]) } }
isolatedDay[,c("ID","ActivityDate","lag","StartedTime","dailyStress","eveningMood","TIB","TotalSteps")]

Comments:

consecutive missing days of dailyDiary data (i.e., No. of days between each and the preceding observation from the same participant) are higher than 7 only for 7 cases, with only two cases showing consecutive missing days > 10 days
only one of these cases (ID s052, StartedTime 2019-12-01 21:09:00 ) is due nly due to isolated final day recorded several days later than the previous observation from the same participant, with no corresponding sleepLog or dailyAct values

Here, we manually remove this single case:

# selecting dailyDiary variables
diaryVars <- colnames(ema)[which(colnames(ema)=="StartedTime"):which(colnames(ema)=="diary.timeDiff")]

# manually removing dailyDiary variables from 1 case
memory <- ema
ema[!is.na(ema$StartedTime) & ema$ID=="s052" & as.character(ema$StartedTime)=="2019-12-01 21:09:00", 
    diaryVars] <- NA

# print info (10 removed cases)
cat("No. excluded cases of isolated days =",
    nrow(memory[!is.na(memory$StartedTime),])-nrow(ema[!is.na(ema$StartedTime),]))

## No. excluded cases of isolated days = 1

In summary, we removed only 1 isolated observations recorded 17 days after the remaining observations obtained from the same participant. Only one case shows more than 10 consecutive missing days.

library(dplyr)
Ndiary <- ema[!is.na(ema$StartedTime),]
Ndiary <- Ndiary[order(Ndiary$ID,Ndiary$ActivityDate,Ndiary$StartedTime),]
Ndiary <- Ndiary %>% group_by(ID) %>% mutate(AD_lag = dplyr::lag(ActivityDate,n=1,default=NA))
Ndiary <- as.data.frame(Ndiary)
detach("package:dplyr", unload=TRUE)

# printing info
Ndiary$lag <- as.numeric(difftime(Ndiary$ActivityDate,Ndiary$AD_lag,units="days"))
for(i in 1:length(n)){ cat("\n- No. cases with >",n[i],"consecutive missing days =",
                           nrow(Ndiary[(!is.na(Ndiary$lag) & Ndiary$lag>n[i]),])) }

## 
## - No. cases with > 5 consecutive missing days = 22
## - No. cases with > 7 consecutive missing days = 6
## - No. cases with > 10 consecutive missing days = 1
## - No. cases with > 15 consecutive missing days = 0
## - No. cases with > 20 consecutive missing days = 0
## - No. cases with > 30 consecutive missing days = 0
## - No. cases with > 50 consecutive missing days = 0

5.5.2.2. Duplicated

Then, we check again for duplicated cases, that is cases with the same ID and ActivityDate values.

# creating IDday variable
Ndiary <- ema[!is.na(ema$StartedTime),]
Ndiary$IDday <- as.factor(paste(Ndiary$ID,Ndiary$ActivityDate,sep="_"))

# sanity check by IDday (0 cases)
cat("Sanity check:",nrow(Ndiary[duplicated(Ndiary$IDday),])==0)

## Sanity check: TRUE

# No. of duplicated cases in general
cat("Sanity check:",nrow(ema[duplicated(ema$IDday),])==0)

## Sanity check: TRUE

Comments:

no duplicated cases are included in the ema dataset for dailyDiary variables
in general, there are no cases of duplicated days

5.5.2.3. Missing responses

Finally, we check the No. of missing responses at the three core variables, namely dailyStress, eveningMood, and eveningWorry, as well as the remaining dailyDiary variables.

# printing info
Ndiary <- ema[!is.na(ema$StartedTime),]
n <- nrow(Ndiary)
cat("- No. of cases with missing dailyStress:",nrow(Ndiary[is.na(Ndiary$dailyStress),]),
    "(",round(100*nrow(Ndiary[is.na(Ndiary$dailyStress),])/n,1),
    "% )\n - No. of cases with missing eveningMood:",nrow(Ndiary[is.na(Ndiary$eveningMood),]),
    "(",round(100*nrow(Ndiary[is.na(Ndiary$eveningMood),])/n,1),
    "% )\n - No. of cases with missing eveningWorry:",nrow(Ndiary[is.na(Ndiary$eveningWorry),]),
    "(",round(100*nrow(Ndiary[is.na(Ndiary$eveningWorry),])/n,1),
    "% )\n - all missing:", nrow(Ndiary[is.na(Ndiary$dailyStress) & 
                                           is.na(Ndiary$eveningMood) & is.na(Ndiary$eveningWorry),]),
    "(",round(100*nrow(Ndiary[is.na(Ndiary$dailyStress) & is.na(Ndiary$eveningMood) & 
                                is.na(Ndiary$eveningWorry),])/n,1),"% )")

## - No. of cases with missing dailyStress: 4 ( 0.1 % )
##  - No. of cases with missing eveningMood: 6 ( 0.1 % )
##  - No. of cases with missing eveningWorry: 5 ( 0.1 % )
##  - all missing: 0 ( 0 % )

# showing cases with 1+ missing in core variables
Ndiary[is.na(Ndiary$dailyStress) | is.na(Ndiary$eveningWorry) | is.na(Ndiary$eveningMood),
       c("ID","StartedTime","dailyStress","eveningMood","eveningWorry","TIB","TotalSteps","stageHR_NREM")]

Comments:

a total of 12 cases (0.2%) show a missing response to one or more core dailyDiary variables

Here, we remove these 12 cases with missing responses to one or more dailyDiar core variables.

# removing 12 cases with no corresponding hourlySteps data
memory <- ema
ema[!is.na(ema$StartedTime) & (is.na(ema$dailyStress) | is.na(ema$eveningWorry) | is.na(ema$eveningMood)),
    diaryVars] <- NA
cat("Removed",nrow(ema[is.na(ema$StartedTime),])-nrow(memory[is.na(memory$StartedTime),]),"cases")

## Removed 12 cases

Finally, we inspect the No. of missing data in the remaining dailyDiary variables.

summary(ema[!is.na(ema$StartedTime),diaryVars])

##   StartedTime                  SubmittedTime                 surveyDuration    
##  Min.   :2019-01-07 21:00:00   Min.   :2019-01-07 21:01:00   Min.   :  0.0000  
##  1st Qu.:2019-06-21 21:19:30   1st Qu.:2019-06-21 21:19:30   1st Qu.:  0.0000  
##  Median :2019-12-24 00:10:30   Median :2019-12-24 00:10:30   Median :  0.0000  
##  Mean   :2020-02-13 20:54:59   Mean   :2020-02-13 20:55:53   Mean   :  0.8974  
##  3rd Qu.:2020-10-17 22:06:45   3rd Qu.:2020-10-17 22:08:15   3rd Qu.:  1.0000  
##  Max.   :2021-04-30 21:00:00   Max.   :2021-04-30 21:00:00   Max.   :758.0000  
##                                                                                
##   dailyStress     eveningWorry    eveningMood    stress_school stress_family
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   0   :2617     0   :3440    
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1   :2141     1   : 404    
##  Median :2.000   Median :2.000   Median :4.000   NA's: 174     NA's:1088    
##  Mean   :2.213   Mean   :2.245   Mean   :3.606                              
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000                              
##  Max.   :5.000   Max.   :5.000   Max.   :5.000                              
##                                                                             
##  stress_health stress_COVID stress_peers stress_other stress_total worry_school
##  0   :2628     0   : 772    0   :2832    0   :3556    0:1554       0   :2500   
##  1   : 309     1   :  99    1   : 537    1   :1277    1:2286       1   :2241   
##  NA's:1995     NA's:4061    NA's:1563    NA's:  99    2: 849       NA's: 191   
##                                                       3: 201                   
##                                                       4:  34                   
##                                                       5:   4                   
##                                                       6:   4                   
##  worry_family worry_health worry_peer  worry_COVID worry_sleep worry_other
##  0   :3281    0   :2670    0   :2890   0   : 871   0   :2928   0   :3313  
##  1   : 294    1   : 351    1   : 554   1   :  92   1   : 732   1   :1289  
##  NA's:1357    NA's:1911    NA's:1488   NA's:3969   NA's:1272   NA's: 330  
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  diary.timeDiff    
##  Min.   :-22.1583  
##  1st Qu.:  0.0708  
##  Median :  1.0167  
##  Mean   : -0.5663  
##  3rd Qu.:  2.5500  
##  Max.   :  8.7417  
##  NA's   :541

no no more missing cases are observed for core variables
as noted in section 3.7, the No. of missing values in the remaining variables varies from the 2 to the 82%

5.5.3. Summary of data

In summary, from the original No. of dailyDiary measures (N = 5,133, corresponding to a compliance of 88.6%), we excluded: 43 + 139 = 182 duplicated cases, 6 cases with surveyDuration >17h, 1 case of isolated day, and 12 cases of missing responses to one or more core variables.

Thus, data cleaning led to a total No. of excluded dailyDiary responses = 201 cases (4%).

cat(nrow(ema[!is.na(ema$StartedTime),]),"'cleaned' cases of dailyDiary data") # 4,932

## 4932 'cleaned' cases of dailyDiary data

Here, compute the updated information on the nonmissing data related to dailyDiary variables.

# updating compliance dataset
Ndiary <- ema[!is.na(ema$StartedTime),]
for(i in 1:nrow(compliance)){
  compliance[i,"nDiary_clean"] <- 
    nlevels(as.factor(as.character(Ndiary[as.character(Ndiary$ID)==compliance[i,"ID"],"IDday"])))
  compliance[i,"nDiary_sleep_clean"] <- 
    nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] & 
                                         !is.na(ema$TIB) & !is.na(ema$StartedTime),"IDday"])))
  compliance[i,"nDiary_sleepAct_clean"] <- 
    nlevels(as.factor(as.character(ema[as.character(ema$ID)==compliance[i,"ID"] & 
               !is.na(ema$TIB) & !is.na(ema$TotalSteps) & !is.na(ema$StartedTime),"IDday"]))) }

# printing compliance info
cat("\n\ndailyDiary data:\n- 'clean' No. days/participants =",
    round(mean(compliance$nDiary_clean),2)," ( SD =",round(sd(compliance$nDiary_clean),2),
    ") \n- 'Clean' dailyDiary compliance =",
    round(mean(100*compliance$nDiary_clean/60),2),"% ( SD =",round(sd(100*compliance$nDiary_clean/60),2),
    ") \n- non missing 'clean' diary cases/Total No. 'clean' sleep cases =",
    round(mean(100*compliance$nDiary_sleep_clean/compliance$nSleep),2),
    "% ( SD =",round(sd(100*compliance$nDiary_sleep_clean/compliance$nSleep),2),
    ") \n- non missing 'clean' diary cases/Total No. 'clean' sleep AND dailyAct cases =",
    round(mean(100*compliance$nDiary_sleepAct_clean/compliance$nSleep),2),
    "% ( SD =",round(sd(100*compliance$nDiary_sleepAct_clean/compliance$nSleep),2),")")

## 
## 
## dailyDiary data:
## - 'clean' No. days/participants = 53.03  ( SD = 10.18 ) 
## - 'Clean' dailyDiary compliance = 88.39 % ( SD = 16.96 ) 
## - non missing 'clean' diary cases/Total No. 'clean' sleep cases = 82.72 % ( SD = 17.74 ) 
## - non missing 'clean' diary cases/Total No. 'clean' sleep AND dailyAct cases = 70.18 % ( SD = 23.48 )

# plotting No. of cases
hist(compliance$nDiary_clean,breaks=50,main="No. of nonmissing dailyDiary values per participant",xlab="")

Comments:

compliance is almost identical to that of the original dataset

5.6. Individual compliance

Finally, we inspect the overall individual compliance by considering all the ‘cleaned’ data in the five classes of variables: sleepLog, sleepStages, sleepHR, dailyDiary, and dailyAct, and their combination. For each of them, we also evaluate the No. of participants with extreme missing data, and the temporal continuity for each class of data.

SUMMARY OF DATA

OVERALL

Here we summarize the No. and % of ‘clean’ data for each class of variables.

# printing compliance info
cat("\n\nsleepLog data: total No. 'clean' cases =",nrow(ema[!is.na(ema$TIB),]), # sleepLog
    "\n- No. days/participants =",round(mean(compliance$nSleep_clean),2)," ( SD =",
    round(sd(compliance$nSleep_clean),2),") \n- 'Clean' sleepLog compliance =",
    round(mean(100*compliance$nSleep_clean/60),2),"% ( SD =",round(sd(100*compliance$nSleep_clean/60),2),
    
    ")\n\nsleepStages data: total No. 'clean' cases =",nrow(ema[!is.na(ema$light),]), # sleepStages
    "\n- No. days/participants =", round(mean(compliance$nSleepStage_clean),2)," ( SD =",
    round(sd(compliance$nSleepStage_clean),2),") \n- 'Clean' sleepStages compliance =",
    round(mean(100*compliance$nSleepStage_clean/60),2),"% ( SD =",round(sd(100*compliance$nSleepStage_clean/60),2),
    
    ")\n\nsleepHR data: total No. 'clean' cases =",nrow(ema[!is.na(ema$stageHR_TST),]), # sleepHR
    "\n- No. days/participants =",round(mean(compliance$nstageHR_NREM_clean),2)," ( SD =",
    round(sd(compliance$nstageHR_NREM_clean),2),") \n- 'Clean' dailyDiary compliance =",
    round(mean(100*compliance$nstageHR_NREM_clean/60),2),"% ( SD =",round(sd(100*compliance$nstageHR_NREM_clean/60),2),
    
    ")\n\ndailyDiary data: total No. 'clean' cases =",nrow(ema[!is.na(ema$StartedTime),]), # dialyDiary
    "\n- No. days/participants =",round(mean(compliance$nDiary_clean),2)," ( SD =",
    round(sd(compliance$nDiary_clean),2),") \n- 'Clean' dailyDiary compliance =",
    round(mean(100*compliance$nDiary_clean/60),2),"% ( SD =",round(sd(100*compliance$nDiary_clean/60),2),
    
    ")\n\ndailyAct data: total No. 'clean' cases =",nrow(ema[!is.na(ema$TotalSteps),]), # dailyAct
    "\n- No. days/participants =",round(mean(compliance$ndailyAct_clean),2)," ( SD =",
    round(sd(compliance$ndailyAct_clean),2),") \n- 'Clean' dailyDiary compliance =",
    round(mean(100*compliance$ndailyAct_clean/60),2),"% ( SD =",round(sd(100*compliance$ndailyAct_clean/60),2))

## 
## 
## sleepLog data: total No. 'clean' cases = 5121 
## - No. days/participants = 55.06  ( SD = 14.03 ) 
## - 'Clean' sleepLog compliance = 91.77 % ( SD = 23.38 )
## 
## sleepStages data: total No. 'clean' cases = 4401 
## - No. days/participants = 47.32  ( SD = 14.57 ) 
## - 'Clean' sleepStages compliance = 78.87 % ( SD = 24.28 )
## 
## sleepHR data: total No. 'clean' cases = 0 
## - No. days/participants = 46.45  ( SD = 14.13 ) 
## - 'Clean' dailyDiary compliance = 77.42 % ( SD = 23.56 )
## 
## dailyDiary data: total No. 'clean' cases = 4932 
## - No. days/participants = 53.03  ( SD = 10.18 ) 
## - 'Clean' dailyDiary compliance = 88.39 % ( SD = 16.96 )
## 
## dailyAct data: total No. 'clean' cases = 4664 
## - No. days/participants = 50.15  ( SD = 15.97 ) 
## - 'Clean' dailyDiary compliance = 83.58 % ( SD = 26.62

RESPECTIVE TO sleepLog

Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing sleepLog cases.

# printing compliance info
n <- ema[!is.na(ema$TIB),]
cat("\n\nsleepLog data: total No. 'clean' cases =",nrow(n),", of which:",
    "\n- ",nrow(n[!is.na(n$light),]),"cases with nonmissing sleepStages data (",
    round(100*nrow(n[!is.na(n$light),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$stageHR_NREM),]),"cases with nonmissing stageHR data (",
    round(100*nrow(n[!is.na(n$stageHR_NREM),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$StartedTime),]),"cases with nonmissing dailyDiary data (",
    round(100*nrow(n[!is.na(n$StartedTime),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
    round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")

## 
## 
## sleepLog data: total No. 'clean' cases = 5121 , of which: 
## -  4401 cases with nonmissing sleepStages data ( 85.94 % )
## -  4320 cases with nonmissing stageHR data ( 84.36 % )
## -  4333 cases with nonmissing dailyDiary data ( 84.61 % )
## -  4310 cases with nonmissing TotalSteps data ( 84.16 % )

RESPECTIVE TO sleepStages

Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing sleepStages cases.

# printing compliance info
n <- ema[!is.na(ema$light),]
cat("\n\nsleepStages data: total No. 'clean' cases =",nrow(n),", of which:",
    "\n- ",nrow(n[!is.na(n$stageHR_NREM),]),"cases with nonmissing stageHR data (",
    round(100*nrow(n[!is.na(n$stageHR_NREM),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$StartedTime),]),"cases with nonmissing dailyDiary data (",
    round(100*nrow(n[!is.na(n$StartedTime),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
    round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")

## 
## 
## sleepStages data: total No. 'clean' cases = 4401 , of which: 
## -  4320 cases with nonmissing stageHR data ( 98.16 % )
## -  3741 cases with nonmissing dailyDiary data ( 85 % )
## -  3729 cases with nonmissing TotalSteps data ( 84.73 % )

RESPECTIVE TO sleepHR

Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing sleepHR cases. Note that the No. of missing values for sleepHR cases varies from variable to variable. Here, we consider stageHR_TST as one of the variables with less missing data.

# printing compliance info
n <- ema[!is.na(ema$stageHR_TST),]
cat("\n\nsleepHR data: total No. 'clean' cases =",nrow(n),", of which:",
    "\n- ",nrow(n[!is.na(n$light),]),"cases with nonmissing sleepStages data (",
    round(100*nrow(n[!is.na(n$light),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$StartedTime),]),"cases with nonmissing dailyDiary data (",
    round(100*nrow(n[!is.na(n$StartedTime),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
    round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")

## 
## 
## sleepHR data: total No. 'clean' cases = 0 , of which: 
## -  0 cases with nonmissing sleepStages data ( NaN % )
## -  0 cases with nonmissing dailyDiary data ( NaN % )
## -  0 cases with nonmissing TotalSteps data ( NaN % )

RESPECTIVE TO dailyDiary

Here, we summarize the No. and % of ’clean data focusing on cases with nonmissing dailyDiary cases.

# printing compliance info
n <- ema[!is.na(ema$StartedTime),]
cat("\n\ndailyDiary data: total No. 'clean' cases =",nrow(n),", of which:",
    "\n- ",nrow(n[!is.na(n$TIB),]),"cases with nonmissing sleepLog data (",
    round(100*nrow(n[!is.na(n$TIB),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$light),]),"cases with nonmissing sleepStages data (",
    round(100*nrow(n[!is.na(n$light),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$stageHR_NREM),]),"cases with nonmissing stageHR data (",
    round(100*nrow(n[!is.na(n$stageHR_NREM),])/nrow(n),2),
    "% )\n- ",nrow(n[!is.na(n$TotalSteps),]),"cases with nonmissing TotalSteps data (",
    round(100*nrow(n[!is.na(n$TotalSteps),])/nrow(n),2),"% )")

## 
## 
## dailyDiary data: total No. 'clean' cases = 4932 , of which: 
## -  4333 cases with nonmissing sleepLog data ( 87.85 % )
## -  3741 cases with nonmissing sleepStages data ( 75.85 % )
## -  3692 cases with nonmissing stageHR data ( 74.86 % )
## -  3949 cases with nonmissing TotalSteps data ( 80.07 % )

MISSING DATA

Here, we inspect the No. of participants with extreme No. of missing values for each variable and their combination. The missingInfo and the infoParticipant functions are used to optimize the process.

show missingInfo

missingInfo <- function(data=NA,var.name=NA,missingThreshold=NA){ 
  
  # renaming focus variable
  colnames(data)[which(colnames(data)==var.name)] <- "variable"
  
  # selecting cases with focus variable < missingThreshold
  highMiss <- data[data$variable<missingThreshold,]
  
  # plotting
  hist(data$variable,breaks=100,xlab="",main=paste(var.name,": No. of nonmissing from",
                                                   min(data$variable),"to",max(data$variable)))
  
  # when 1+ cases have a No. of nonmissing observations < missingThreshold
  if(nrow(highMiss)>0){
    abline(v=missingThreshold,col="red")
    
    # printing info
    cat("\n\n",nrow(highMiss),"participants with less than",missingThreshold,"nonmissing days of",var.name,":")
    for(i in 1:nrow(highMiss)){ cat("\n- ",i,as.character(highMiss[i,"ID"]),"( insomnia =",
                                    as.character(highMiss[i,"insomnia"]),
                                    ") :",highMiss[i,"variable"],"observations") }
    } else { cat("\n\n","no participants with less than",missingThreshold,"nonmissing days of",var.name) }
  }

show infoParticipant

infoParticipant <- function(data=NA,var.name=NA,participants=NA){ 
  
  # renaming focus variable
  colnames(data)[which(colnames(data)==var.name)] <- "variable"
  
  # selecting cases with focus variable < missingThreshold
  highMiss <- data[data$ID%in%participants,]
  
  # printing info on participant
  for(i in 1:nrow(highMiss)){ cat("\n- ",i,as.character(highMiss[i,"ID"]),"( insomnia =",
                                    as.character(highMiss[i,"insomnia"]),
                                    ") :",highMiss[i,"variable"],"observations") }}

SleepLog

missingInfo(data=compliance,var.name="nSleep_clean",missingThreshold=10)

## 
## 
##  2 participants with less than 10 nonmissing days of nSleep_clean :
## -  1 s038 ( insomnia = 0 ) : 5 observations
## -  2 s089 ( insomnia = 1 ) : 6 observations

Comments:

only two participants have less than 10 nonmissing observations for sleepLog variables: s038 and s089
further three participants have from 10 to 20 observations

sleepStages

missingInfo(data=compliance,var.name="nSleepStage_clean",missingThreshold=10)

## 
## 
##  3 participants with less than 10 nonmissing days of nSleepStage_clean :
## -  1 s038 ( insomnia = 0 ) : 5 observations
## -  2 s052 ( insomnia = 1 ) : 8 observations
## -  3 s089 ( insomnia = 1 ) : 4 observations

Comments:

only three participants have less than 10 nonmissing observations for sleepStages variables: s038, s089, and s052
further four participants have from 10 to 20 observations

sleepHR

missingInfo(data=compliance,var.name="nstageHR_NREM_clean",missingThreshold=10)

## 
## 
##  3 participants with less than 10 nonmissing days of nstageHR_NREM_clean :
## -  1 s038 ( insomnia = 0 ) : 5 observations
## -  2 s052 ( insomnia = 1 ) : 8 observations
## -  3 s089 ( insomnia = 1 ) : 4 observations

Comments:

only three participants have less than 10 nonmissing observations for the stageHR_NREM variable: s038, s089, and s052

dailyDiary

missingInfo(data=compliance,var.name="nDiary_clean",missingThreshold=10)

## 
## 
##  no participants with less than 10 nonmissing days of nDiary_clean

Comments:

no participants have less than 10 nonmissing observations for dailyDiary variables
further four participants have from 10 to 30 observations

Let’s see who these four participants are.

missingInfo(data=compliance,var.name="nDiary_clean",missingThreshold=30)

## 
## 
##  3 participants with less than 30 nonmissing days of nDiary_clean :
## -  1 s040 ( insomnia = 0 ) : 26 observations
## -  2 s050 ( insomnia = 1 ) : 27 observations
## -  3 s051 ( insomnia = 0 ) : 29 observations

Comments:

the three participants with less than 30 observations are s040, s050, and s051

Finally, let’s see the No. of nonmissing dailyDiary observations for those participants with the highest No. of missing data for sleepLog variables (i.e., s038, s089, and s052). We can see these participants have more than one month of nonmissing dailyDiary observations.

infoParticipant(data=compliance,var.name="nDiary_clean",participants=c("s038","s089","s052"))

## 
## -  1 s038 ( insomnia = 0 ) : 35 observations
## -  2 s052 ( insomnia = 1 ) : 31 observations
## -  3 s089 ( insomnia = 1 ) : 38 observations

dailyAct

missingInfo(data=compliance,var.name="ndailyAct_clean",missingThreshold=10)

## 
## 
##  4 participants with less than 10 nonmissing days of ndailyAct_clean :
## -  1 s065 ( insomnia = 1 ) : 8 observations
## -  2 s080 ( insomnia = 1 ) : 3 observations
## -  3 s089 ( insomnia = 1 ) : 5 observations
## -  4 s095 ( insomnia = 0 ) : 0 observations

Comments:

four participants have less than 10 nonmissing observations for sleepLog variables: s089, s065, s080, and s095
importantly, s095 has no data with nonmissing dailyAct variables
further three participants have from 10 to 20 observations

sleepLog & dailyAct

missingInfo(data=compliance,var.name="ndailyAct_sleep_clean",missingThreshold=10)

## 
## 
##  6 participants with less than 10 nonmissing days of ndailyAct_sleep_clean :
## -  1 s038 ( insomnia = 0 ) : 4 observations
## -  2 s040 ( insomnia = 0 ) : 7 observations
## -  3 s065 ( insomnia = 1 ) : 8 observations
## -  4 s080 ( insomnia = 1 ) : 2 observations
## -  5 s089 ( insomnia = 1 ) : 4 observations
## -  6 s095 ( insomnia = 0 ) : 0 observations

Comments:

six participants have less than 10 nonmissing observations for both sleepLog and dailyAct variables: s038, s089, s040, s065, s080, and s095
importantly, s095 has no data with simultaneously nonmissing sleepLog and dailyAct variables
one further participant has from 10 to 20 observations

sleepLog & dailyDiary

missingInfo(data=compliance,var.name="nDiary_sleep_clean",missingThreshold=10)

## 
## 
##  4 participants with less than 10 nonmissing days of nDiary_sleep_clean :
## -  1 s038 ( insomnia = 0 ) : 2 observations
## -  2 s040 ( insomnia = 0 ) : 9 observations
## -  3 s041 ( insomnia = 1 ) : 8 observations
## -  4 s089 ( insomnia = 1 ) : 4 observations

Comments:

four participants have less than 10 nonmissing observations for both sleepLog and dailyDiary variables: s038, s089, s040, s041
one further participant has from 10 to 20 observations

sleepLog & dailyDiary & dailyAct

missingInfo(data=compliance,var.name="nDiary_sleepAct_clean",missingThreshold=10)

## 
## 
##  7 participants with less than 10 nonmissing days of nDiary_sleepAct_clean :
## -  1 s038 ( insomnia = 0 ) : 2 observations
## -  2 s040 ( insomnia = 0 ) : 5 observations
## -  3 s041 ( insomnia = 1 ) : 6 observations
## -  4 s065 ( insomnia = 1 ) : 6 observations
## -  5 s080 ( insomnia = 1 ) : 2 observations
## -  6 s089 ( insomnia = 1 ) : 3 observations
## -  7 s095 ( insomnia = 0 ) : 0 observations

Comments:

seven participants have less than 10 nonmissing observations for both sleepLog, dailyDiary, and dailyAct variables: s038, s089, s040, s041, s065, s080, and s095
importantly, s095 has no data with simultaneously nonmissing sleepLog and dailyAct variables
one further participant has from 10 to 20 observations

TEMPORAL CONTINUITY

Finally, we use the tempCont function to plot DAY order against time for each participant in order to better inspect the pattern of missing data, and the cases of discontinuous data recording. Yellow is for dailyAct, blue is for sleepLog, purple is for sleepStages, red is for sleepHR, and green is for dailyDiary. Note that the actual data collection period is that shown in blue for sleepLog variables. The horizontal red line shows the No. of observations with nonmissing sleepLog, dailyDiary, and dailyAct data.

show tempCont

tempCont <- function(data=NA,compliance=NA){ 
  
  data <- data[!(is.na(data$TIB) & is.na(data$StartedTime)),]
  
  par(mfrow=c(3,3))
  for(ID in levels(data$ID)){ 
    IDdata <- data[data$ID==ID,]
    IDcompliance <- compliance[compliance$ID==ID,]
    Xlim <- c(min(IDdata[IDdata$ID==ID,"ActivityDate"])-30,max(IDdata[IDdata$ID==ID,"ActivityDate"])+30)
    Ylim <- c(0,
              max(IDcompliance$nSleep_clean,IDcompliance$nSleepStage_clean,IDcompliance$nstageHR_NREM_clean,
                  IDcompliance$nDiary_clean,IDcompliance$ndailyAct_clean))
  n <- IDdata[!is.na(IDdata$TotalSteps),] # yellow = dailyAct
  if(nrow(n[n$ID==ID,])>0){ n$ActivityDate <- n$ActivityDate - 10
    plot((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
         col=rgb(1,1,0,alpha=0.4),cex=3,pch=20)
    n <- IDdata[!is.na(IDdata$TIB),]; n$ActivityDate <- n$ActivityDate # blue = sleepLog
    points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
           col=rgb(0,0,1,alpha=0.4),cex=3,pch=20) } else {
    n <- IDdata[!is.na(IDdata$TIB),]; n$ActivityDate <- n$ActivityDate
    plot((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
         col=rgb(0,0,1,alpha=0.4),cex=3,pch=20) }
  n <- IDdata[!is.na(IDdata$light),]; n$ActivityDate <- n$ActivityDate + 20 # purple = sleepStages
  points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
         col=rgb(1,0,1,alpha=0.4),cex=3,pch=20)
  n <- IDdata[!is.na(IDdata$stageHR_NREM),]; n$ActivityDate <- n$ActivityDate + 30 # red = sleepHR
  points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
         col=rgb(1,0,0,alpha=0.4),cex=3,pch=20)
  n <- IDdata[!is.na(IDdata$StartedTime),]; n$ActivityDate <- n$ActivityDate - 20 # green = dailyDiary
  points((1:nrow(n[n$ID==ID,])~n[n$ID==ID,"ActivityDate"]),main=ID,xlab="",ylab="",xlim=Xlim,ylim=Ylim,
         col=rgb(0,1,0,alpha=0.4),cex=3,pch=20) 
  abline(h=ifelse(IDcompliance$ndailyAct_clean > 0,IDcompliance$nDiary_sleepAct_clean,
                  IDcompliance$nDiary_sleep_clean),col="red") }}

tempCont(data=ema,compliance=compliance)

Comments:

the most critical cases are s038 (control; with only 5 on 14 nSleep_clean, only 10 on 64 ndailyAct_clean, and only 2 observations with simultaneously nonmissing dailyDiary and sleepLog observations), s040 (control; with only 15 on 18 nSleep_clean, only 13 on 64 dailyAct clean, and only 9 observations with simultaneously nonmissing dailyDiary and sleepLog observations), s041 (insomnia; with only 15 on 21 nSleep_clean, only 18 on 100 ndailyAct_clean, and only 13 on 64 dailyAct clean, and only 8 observations with simultaneously nonmissing dailyDiary and sleepLog observations), and s089 (insomnia; with only 6 on 6 nSleep_clean, only 5 on 63 ndailyAct_clean, and only 4 observations with simultaneously nonmissing dailyDiary and sleepLog observations)
a No. of further participants have minor problems, including s024 (control; with only 15 on 64 ndailyAct_clean), s052 (insomnia; with only 14 on 23 nSleep_clean, and only 10 on 64 ndailyAct_clean), s065 (insomnia; with only 15 on 63 ndailyAct_clean), s080 (insomnia; with only 3 on 63 ndailyAct_clean), s095 (control; with 0 on 64 ndailyAct_clean), and s105 (with only 17 on 63 nSleepStage_clean)

Here, we show details on compliance for all of the mentioned participants.

# major problems
majors <- c("s038","s040","s041","s089")
compliance[compliance$ID %in% majors,]

# minor problems
minors <- c("s024","s052","s065","s080","s095","s105")
compliance[compliance$ID %in% minors,]

Comments:

major problems are probably due to technical dysfunction with the FC3 device resulting in a low No. of ‘clean’ sleepLog data (ranging from 5 to 15) and dailyAct (ranging from 5 to 18), despite the higher No. of ‘clean’ dailyDiary values (ranging from 26 to 46)
in these cases, the difference between the No. of original and ‘clean’ dailyAct observations suggests that our filtering of the former was effective (i.e., from 63-100 original data, only 5-15 were valid, a similar No. to that of ‘clean’ sleepLog cases)
technical dysfunctions with the FC3 device are also the likely origin of some minor problems including low No. of Fitbit data (i.e., s052), a low No. of dailyAct ‘clean’ data (i.e., s024, s065, s080, and s095; No. clean dailyAct ranging from 0 to 15), and a low No. of sleepStages ‘clean’ data (i.e., s105)

MARKING DATA

Finally, we mark the cases highlighted in the TEMPORAL CONTINUITY section by specifying the majMiss and the minMiss variables.

# creating majMiss and minMiss variables
ema$majMiss <- ema$minMiss <- 0
ema[ema$ID %in% majors,"majMiss"] <- 1
ema[ema$ID %in% minors,"minMiss"] <- 1
ema[,c("majMiss","minMiss")] <- lapply(ema[,c("majMiss","minMiss")],as.factor)

# summarizing variables
summary(ema[,c("majMiss","minMiss")])

##  majMiss  minMiss 
##  0:5910   0:5829  
##  1: 309   1: 390

6. Data export

As the very final step, we sort the ema columns one more time, and we export the final processed dataset.

# sorting final dataset
ema <- ema[,c("ID","sex","age","BMI","insomnia","insomnia.group", # demos at the beginning
              "majMiss","minMiss", # response rate criteria
              
              "ActivityDate", # date (day of the year)
              
              # sleepLog
              "LogId","StartTime","EndTime","SleepDataType","EBEDataType", # sleepLog info
              "TIB","TST","SE","SO","WakeUp","midSleep","SOL","WASO","nAwake","fragIndex", # sleepLog variables
              
              # sleepStages
              "light","deep","rem",
              
              # sleepHR
              "stageHR_NREM","stageHR_REM",
              
              # dailyAct
              "TotalSteps",
              
              # dailyDiary
              "StartedTime","SubmittedTime","surveyDuration",
              "dailyStress","eveningWorry","eveningMood",
              "stress_school","stress_family","stress_health","stress_COVID","stress_peers","stress_other",
              "worry_school","worry_family","worry_health","worry_peer","worry_COVID","worry_sleep","worry_other",
              
              # removed variables
              # "combined","combinedLogId","nCombined", # info on combined sleep cases
              # "EBEonly","nEpochs","nFinalWake","missing_start","missing_middle", # info on EBE data 
              # "TimeInBed","MinutesAsleep","fitabaseWASO","MinutesToFallAsleep", # original Fitabase sleepLog variables
              # "nHRepochs1","nHRepochs2","nHRepochs3", # sleepHR No. of considered epochs
              # "nSleepHRepochs1","nSleepHRepochs2","nSleepHRepochs3",
              # "nHR_TST","nHR_SOL","nHR_WASO","nHR_NREM","nHR_REM",
              # "LightlyActiveMinutes","SedentaryMinutes","ModerateVigorousMinutes","TotalActivityMinutes","hourlySteps",
              
              "group")]
ema$group <- NULL

# saving dataset
save(ema,file="DATA/datasets/ema_finalClean.RData")

# showing first 3 rows
ema[1:3,]

6.1. Data dictionary

INDIVIDUAL-LEVEL VARIABLES

Demographics:

ID = participants’ identification code
sex = participants’ sex (“F” = female, “M” = male)
age = participants’ age (years)
BMI = participants’ BMI (kg*m^(-2))
insomnia = participants’ group (1 = insomnia, 0 = control)
insomnia.group = participants’ insomnia group (“control” = control, “DSM.ins” = DSM insomnia, “sub.ins” = insomnia subthreshold)
majMiss = participants with extreme missing data (N = 4 with value 1)
minMiss = participants with substantial missing data (N = 6 with value 1)

DAY-LEVEL VARIABLES

ActivityDate = day of assessment (date in “mm-dd-yyyy” format)

sleepLog

LogId = sleep period identification code
StartTime = sleep period start hour (in “mm-dd-yyyy hh:mm:ss” format)
EndTime = sleep period end hour (in “mm-dd-yyyy hh:mm:ss” format)
SleepDataType: sleep data type originally scored by Fitabase
EBEDataType: type of EBE data used for manually recomputing sleep measures (i.e., updated by excluding the last wake epochs)
TIB = time in bed (min) computed as the number of minutes between StartTime (i.e., considering missing epochs at the beginning as wake epochs) and the last epoch included in EBE data (i.e., excluding the last wake epochs)
TST = total sleep time (min)
SE = sleep efficiency as the percent of TST over TIB (%)
SO = sleep onset hour (in “mm-dd-yyyy hh:mm:ss” format) corresponding to the time of the first epoch classified as sleep
WakeUp = wake-up time (in “mm-dd-yyyy hh:mm:ss” format) corresponding to the time of the last epoch classified as sleep
midSleep = mid-sleep time (in “mm-dd-yyyy hh:mm:ss” format) calculated as the halfway point between sleep onset and sleep offset
SOL = sleep onset latency (min), only for cases with wake epochs between StartTime and SO
WASO = wake after sleep onset (min)
nAwake = No. of awakenings longer than 5 minutes after SO
fragIndex = No. of sleep stage shifting (including wake) per hour
light = No. of minutes classified as “light” sleep
deep = No. of minutes classified as “deep” sleep
rem = No. of minutes classified as REM sleep

sleepHR

stageHR_NREM = mean HR value (bpm) computed over NREM sleep epochs
stageHR_REM = mean HR values (bpm) computed over REM sleep epochs

dailyAct

TotalSteps = sum of the No. of steps recoded in each day (recomputed from hourly steps data)

dailyDiary

StartedTime = initiation hour of the diary form (in “mm-dd-yyyy hh:mm:ss” format)
SubmittedTime = submission hour of the diary form (in “mm-dd-yyyy hh:mm:ss” format)
surveyDuration = duration of the survey (min)
dailyStress = score at the daily stress item (1-5)
eveningWorry = score at the evening worry item (1-5)
eveningMood = score at the evening Mood item (1-5)
stress_school, ..., stress_other = stressor categories (0 or 1)
worry_school, ..., worry_other = sources of worry (0 or 1)

References

Aadland, E., Andersen, L. B., Anderssen, S. A., & Resaland, G. K. (2018). A comparison of 10 accelerometer non-wear time criteria and logbooks in children. BMC public health, 18(1), 1-9. https://doi.org/10.1186/s12889-018-5212-4
Fleming, S., Thompson, M., Stevens, R., Heneghan, C., Plüddemann, A., Maconochie, I., … & Mant, D. (2011). Normal ranges of heart rate and respiratory rate in children from birth to 18 years of age: a systematic review of observational studies. The Lancet, 377(9770), 1011-1018. https://doi.org/10.1016/S0140-6736(10)62226-X
Herrmann, S. D., Barreira, T. V., Kang, M., & Ainsworth, B. E. (2014). Impact of accelerometer wear time on physical activity data: a NHANES semisimulation data approach. British journal of sports medicine, 48(3), 278-282. https://doi.org/10.1136/bjsports-2012-091410
Herrmann, S. D., Barreira, T. V., Kang, M., & Ainsworth, B. E. (2013). How many hours are enough? Accelerometer wear time may provide bias in daily activity estimates. Journal of Physical Activity and Health, 10(5), 742-749. https://doi.org/10.1123/jpah.10.5.742
Menghini, L., Cellini, N., Goldstone, A., Baker, F. C., & de Zambotti, M. (2021a). A standardized framework for testing the performance of sleep-tracking technology: step-by-step guidelines and open-source code. Sleep, 44(2), zsaa170. https://doi.org/10.1093/sleep/zsaa170
Quante, M., Kaplan, E. R., Rueschman, M., Cailler, M., Buxton, O. M., & Redline, S. (2015). Practical considerations in using accelerometers to assess physical activity, sedentary behavior, and sleep. Sleep health, 1(4), 275-284. https://doi.org/10.1016/j.sleh.2015.09.002

R packages

Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2021. Lubridate: Make Dealing with Dates a Little Easier. https://CRAN.R-project.org/package=lubridate.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2021. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2021. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Wearable and mobile technology to characterize daily patterns of sleep, stress, pre-sleep worry and mood in adolescent insomnia - Appendix B: Data pre-processing code and output

Luca Menghini, PhD, Dilara Yuksel, PhD, Devin Prouty, PhD, Fiona C Baker PhD, Christopher King, PhD, Massimiliano de Zambotti, PhD

2022-03-20

Aims and content

1. Data reading

2. Temporal synchronization

2.1. dailyAct

2.1.1. Temporal continuity

2.1.2. Saving dataset

2.2. hourlySteps

2.2.1. Temporal continuity

2.2.2. Saving dataset

2.3. sleepLog

2.3.1. LogId & SleepDataType

2.3.2. StartHour & EndHour

2.3.3. Combining sleep

DATA PROCESSING

PLOTTING

ORIGINAL vs. COMBINED DATA

ORIGINAL TIMING

COMBINED TIMING

TIB & TST

2.3.4. Updating ActivityDate

2.3.5. Daylight Saving Time

2.3.6. Temporal continuity

2.3.7. Saving dataset

2.4. sleepEBE

2.4.1. LogId

2.4.2. Daylight Saving Time

2.4.3. Temporal continuity

2.4.3. Saving dataset

2.5. classicEBE

2.5.1. LogId

2.5.2. Summary of sleep data

a) uniqueLogId

b) uniqueEBElogs

c) uniqueClassiclogs

d) ClassicAndSleepLog

2.5.3. Daylight Saving Time

2.5.4. Temporal continuity

2.5.5. Saving dataset

2.6. HR.1min

2.6.1. Daylight Saving Time

2.6.2. Temporal continuity

2.6.3. Saving dataset

2.7. dailyDiary

2.7.1. Duplicated responses

2.7.2. CompletionStatus

2.7.3. surveyDuration

2.7.4. StartedHour

2.7.5. Updating ActivityDate

2.7.6. Daylight Saving Time

2.7.7. Temporal continuity

2.7.8. Saving dataset

3. Data recoding

3.1. dailyAct

3.1.1. Saving dataset

3.2. hourlySteps

3.2.1. Saving dataset

3.3. sleepLog

3.3.1. Saving dataset

3.4. sleepEBE

3.4.1. Saving dataset

3.5. classicEBE

3.5.1. Saving dataset

3.6. HR.1min

3.6.1. Saving dataset

3.7. dailyDiary

3.7.1. dailyStress

3.7.2. eveningWorry

3.7.3. eveningMood

3.7.4. Saving dataset

3.8. demos

3.8.1. Saving dataset

4. Data aggregation

4.1. dailyAct & hourlySteps

4.1.1. Sanity checks

4.1.2. Saving dataset

4.2. sleep & classicEBE

4.2.1. Sanity checks