Laryngeal analysis (gam)

Published

June 20, 2025

1 Notes

  • The analysis goal is to extract GAM measures and correlate GAM-based measures with behavioral data.
  • The model to fit each individual’s data is: bam(uV ~ s(time, condition, bs = ‘fs’, m = 1) + s(time, stim, bs = ‘fs’, m = 1), data = df, discrete = TRUE)
    • condition is the interaction term of poa (dorsal vs. glottal), direction (high_to_low vs. low_to_high) x stim_role (standard vs. deviant)

2 Libraries

library(tidyverse)
library(mgcv)
library(itsadug)

3 Load data and cleaning

# F0 data
raw <- read_csv("~/OneDrive - University of Toronto/Projects/Laryngeal/data/gam_F0.csv")
# data cleaning and organization
df_f0 <- raw %>%
  pivot_longer(cols = 4:ncol(raw), names_to = "time", values_to = "uV") %>%
  separate(col = condition, into = c("poa", "blc_stan", "blc_devi", "stim_role", "item")) %>%
  unite(col = "block", c("blc_stan", "blc_devi"), sep = "_") %>%
  mutate(group = as.factor(group),
         participant = as.factor(participant),
         poa = as.factor(poa),
         block = as.factor(block),
         stim_role = as.factor(stim_role),
         item = as.factor(item),
         time = as.numeric(time)) %>%
  mutate(
    immn_direction = as.factor(case_when(
      block=="highStan_lowDevi" & stim_role=="devi" ~ "high_to_low",
      block=="lowStan_highDevi" & stim_role=="stan" ~ "high_to_low",
      block=="lowStan_highDevi" & stim_role=="devi" ~ "low_to_high",
      block=="highStan_lowDevi" & stim_role=="stan" ~ "low_to_high",
    ))
  ) %>%
  unite(col = "condition", c("poa", "immn_direction", "stim_role"), sep = "/", remove = FALSE) %>%
  mutate(condition = as.factor(condition)) %>%
  droplevels()


# VOT data
raw <- read_csv("~/OneDrive - University of Toronto/Projects/Laryngeal/data/gam_vot.csv")
# data cleaning and organization
df_vot <- raw %>%
  pivot_longer(cols = 4:ncol(raw), names_to = "time", values_to = "uV") %>%
  separate(col = condition, into = c("poa", "blc_stan", "blc_devi", "stim_role", "item")) %>%
  unite(col = "block", c("blc_stan", "blc_devi"), sep = "_") %>%
  mutate(group = as.factor(group),
         participant = as.factor(participant),
         poa = as.factor(poa),
         block = as.factor(block),
         stim_role = as.factor(stim_role),
         item = as.factor(item),
         time = as.numeric(time)) %>%
  mutate(
    immn_direction = as.factor(case_when(
      block=="highStan_lowDevi" & stim_role=="devi" ~ "long_to_short",
      block=="lowStan_highDevi" & stim_role=="stan" ~ "long_to_short",
      block=="lowStan_highDevi" & stim_role=="devi" ~ "short_to_long",
      block=="highStan_lowDevi" & stim_role=="stan" ~ "short_to_long",
    ))
  ) %>%
  unite(col = "condition", c("poa", "immn_direction", "stim_role"), sep = "/", remove = FALSE) %>%
  mutate(condition = as.factor(condition)) %>%
  droplevels()

The dataset consists of the following columns:
- group = participant group (vot vs. f0)
- participant = participant number
- time = time in milliseconds (ranges from -200 to 800, in 4 ms bins)
- poa = place of articulation (dorsal vs. glottal)
- stim_role = stimulus role (standard vs. deviant)
- uV = ERP amplitude in microvoltages
- immn_direction = direction for computing iMMN

4 GAM modeling and GAM-based individual measures

We extract the following GAM-based individual measures:

  • trad_erp: average amplitude of observed data in specified time window
  • model_area: Modelled Area = geometric area (amplitude * time) under the GAM curve. This measures the area under the peak (or maximum if there is no peak); only looking for positive area’s (or negative areas)
  • peak_height: Height Modelled Peak = height of the peak of the GAM smooth, or the highest point if no peak
  • NMP: Normalized Modelled Peak = a measure of robustness of the peak in units of SDs. I.e., how reliably does this subject show the peak? If value is above 1, then the 95% confidence bands do not overlap, and we can be certain the peak is there. If value is between 0 and 1, then there is a lot of variation between the items.
  • half_area_latency: Modelled Area Median Latency = fractional area latency, i.e. latency at 50% of the area (midpoint)
  • model_peak_time: Modelled Peak Latency = latency of the modelled peak

Additionally we compute the following measures, for extra information:

  • trad_norm_erp: normalized traditional average
  • hasPeak: TRUE/FALSE; is there a peak in the modeled signal in the search window?
  • gam_erp: average of the GAM smooth in the time window

For all these measures holds that we look at a specified time window or search window. The traditional measure requires a narrower time window (e.g. 150-300 ms post stimulus onset), the GAM measures require a search window which can be wider (here we use 0 to 800 ms post stimulus onset). We only look for negative peaks.

The next section of code runs the GAMs for one single participant, extracts the measures and creates the plots (in a separate pdf document).

The model to fit each individual’s data is:

bam(uV ~ s(time, condition, bs = ‘fs’, m = 1) + s(time, stim, bs = ‘fs’, m = 1), discrete = TRUE)

# define search window 
search_min = 0; search_max = 800;
# define classic erp window (this can be from permutation test)
trad_min <- 150; trad_max <- 300

# loop over groups
for (group in c("f0", "vot")) {
  
  # get data
  df <- get(paste0("df", "_", group))
  # define poa levels
  poas <- levels(df$poa)
  # define direction levels
  directions <- levels(df$immn_direction)
  
  # initialize the full dataframe
  df_gam <- data.frame()
  
  # get participant list
  participants <- levels(df$participant)
  
  # loop over participants for modeling
  for (participant in participants) {
    
    # get single participant data
    df_tmp <- df %>%
      filter(participant == participant) %>%
      droplevels()
    
    # modeling
    model <- bam(uV ~ 
                   # poa * stim_role +        # fixed effects
                   s(time, by = condition) +  # smooth for each poa x direction x stim_role
                   s(time, item, bs = "fs", m = 1), # random smooth by item
                 data = df_tmp, 
                 discrete = TRUE)  # for large data for speed
    
    # summary(model)
    
    # %%%%% extract peak height, peak time, and NMP %%%%%%
    
    # get values and SE for every individual time point
    min_time <- min(model$model[, "time"])
    max_time <- max(model$model[, "time"])
    nval = length(seq(min_time, max_time))
    
    # initialize stim index
    stim_ind <- 1
    
    # loop over POAs
    for (poa_ind in 1:length(poas)) {
      # loop over directions
      for (direction_ind in 1:length(directions)) {
        
        # get poa and direction labels
        poa <- poas[poa_ind]
        direction <- directions[direction_ind]
        
        # extract modeled standard and deviant data
        devi <- itsadug::get_modelterm(model, select=stim_ind, n.grid = nval, as.data.frame = TRUE)
        stim_ind <- stim_ind+1
        stan <- itsadug::get_modelterm(model, select=stim_ind, n.grid = nval, as.data.frame = TRUE)
        stim_ind <- stim_ind+1
        
        # get difference for MMN
        dat <- devi %>%
          mutate(condition = "mmn",
                 fit = devi$fit - stan$fit,
                 se.fit = sqrt(devi$se.fit^2 + stan$se.fit^2))
        
        # subset search data
        sdat <- dat[dat$time>=search_min & dat$time<=search_max, ]
        
        # get derivative and search for peak (derivative=0) of a negativity (previous derivative value < 0, which means the actual ERP waveform is decreasing before this point)
        drv <- data.frame(diff(dat$fit)/diff(dat$time))  # derivative
        colnames(drv) <- 'dYdX'
        drv$time <- rowMeans(embed(dat$time,2)) # center the X values for plotting
        drv$dYdX.next <- c(drv$dYdX[2:nrow(drv)],NA)
        drv$dYdX.prev <- c(NA,drv$dYdX[1:(nrow(drv)-1)])
        
        # MMN peak: going down (<0) then going up (>0)
        drv$local_peak <- ((drv$dYdX < 0 & drv$dYdX.next > 0) | (drv$dYdX.next > 0 & drv$dYdX == 0 & drv$dYdX.prev < 0))
        
        # if at least one local peak in the search time window
        if (sum(drv[drv$time>=search_min & drv$time<=search_max, ]$local_peak, na.rm=TRUE) >= 1) {
          hasPeak = TRUE
          # get all peak times
          all_peak_times <- drv[which(drv$local_peak & drv$time>=search_min & drv$time<=search_max), "time"]
          # initialize peak height with some larger value
          peak_height <- Inf
          # loop over local peak times
          for (peak_ind in 1:length(all_peak_times)) {
            # get the two fitted data points centering the local peak
            peakdat = dat[dat$time >= floor(all_peak_times[peak_ind]) & dat$time <= ceiling(all_peak_times[peak_ind]), ]
            # if the current height is smaller than the original peak height
            if ( min(peakdat$fit) < peak_height) {
              # update peak height
              peak_height <- min(peakdat$fit)
              # update peak time
              peak_time <- all_peak_times[peak_ind]
              # update se
              peak_se <- peakdat[which.min(peakdat$fit),]$se.fit
              # update NMP (original code doesn't have 1.96 factor)
              NMP <- peak_height / (1.96*peak_se) # relative peak measure (if < 1 then 95%CI overlaps with 0 at point of peak)
            }
          }
        } else { # if no local peak
          hasPeak <- FALSE
          # get general peak in search span
          subdat <- dat[dat$time>=search_min & dat$time<=search_max, ] # subset data
          # find peak
          peak_height <- min(subdat$fit)
          peak_index <- which.min(subdat$fit)
          # get time
          peak_time <- subdat[peak_index, "time"] # first time value with peak value
          peak_se <- subdat[peak_index, ]$se.fit
          NMP <- peak_height / (1.96*peak_se)
        }
        
        # if we are looking for a valley but the value is positive, then no correct positivity/negativity
        if (peak_height >= 0) {
          peak_height <- NA
          peak_time <- NA
          peak_se <- NA
          NMP <- NA
        }
        
        # if peak time is the first point, there is so no real minimum
        if (peak_time == min(sdat$time)) {
          peak_height <- NA
          peak_time <- NA
          peak_se <- NA
          NMP <- NA
        }
        
        # get average of the GAM smooth in the tradition erp time window
        gam_erp = mean(dat[dat$time>=trad_min & dat$time<=trad_max, ]$fit)
        
        # get area and fractional are latency
        if (is.na(peak_time)) {
          area <- NA
          half_area_latency <- NA
        } else {
          # initialize are
          area <- 0
          start = round(peak_time) # start time to take integral from
          # firsttime = search_min
          # lasttime = search_max
          # area to the right from the peak
          for (i in start:search_max) {
            val = sdat[sdat$time == i,]$fit
            
            # if derivative <=0
            if (val <= 0) {
              area = area + abs(val)
            } else { # end of peak, so stop going in this direction
              break
            }
            lasttime = i
          }
          # area to the left from the peak
          beforestart = start-1
          if (beforestart >= search_min) {
            for (j in beforestart:search_min) { # to the left from the peak
              val = sdat[sdat$time == j,]$fit
              # if derivative <=0
              if (val <= 0) {
                area = area + abs(val)
              } else { # end of peak, so stop going in this direction
                break
              }
              firsttime = j
            }
          }
          
          # get half area latency
          halfarea <- 0
          for (k in firsttime:lasttime) {
            val = sdat[sdat$time == k, ]$fit
            halfarea = halfarea + abs(val)
            if (halfarea >= 0.5 * area) {
              half_area_latency = k
              break
            }
          }
        } # get area and fractional are latency end
        
        # get measures for the current condition
        tmp_row <- data.frame(participant, poa, direction, hasPeak, area, peak_height, peak_se, NMP, peak_time, half_area_latency, gam_erp)
        # add to the extracted data
        df_gam <- rbind(df_gam, tmp_row)
        
        #%%%%% plot for each condition start %%%%%
        # get data for plotting
        dat$fit.plus.se = dat$fit + dat$se.fit
        dat$fit.minus.se = dat$fit - dat$se.fit
        
        # raw ERP data for plotting
        erp_df <- model$model %>%
          group_by(condition, time) %>%
          summarize(mean_uV = mean(uV)) %>%
          pivot_wider(names_from = condition, values_from = mean_uV) %>%
          mutate(
            uV_diff = .data[[paste0(poas[poa_ind], "/", directions[direction_ind], "/", "devi")]] - .data[[paste0(poas[poa_ind], "/", directions[direction_ind], "/", "stan")]]
          )
        
        # df for shades
        segments_df <- sdat %>%
          filter(time>=firsttime & time<=lasttime) %>%
          mutate(x = time, xend = time,
                 y = 0, yend = fit)
        
        # plotting
        fig <- ggplot(dat, aes(x = time, y = fit)) +
          # ribbon
          geom_ribbon(aes(ymin = fit - se.fit, ymax = fit + se.fit), fill = "skyblue", alpha = 0.3) +
          # gam modelled mmn
          geom_line(color = "blue", linewidth = 1) +
          # shade the search window
          annotate("rect", xmin = search_min, xmax = search_max, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.1) +
          # add horizontal and vertical zero lines
          geom_vline(xintercept = 0, linetype = "solid", alpha = 0.2) +
          geom_hline(yintercept = 0, linetype = "solid", alpha = 0.2) +
          # mark peak time
          geom_vline(xintercept=peak_time, linetype = "dashed", linewidth=1) +
          # mark fractional area latency
          geom_vline(xintercept=half_area_latency, linetype = "dotted", linewidth=1) +
          # add derivative (scaled for visualization)
          geom_line(data = drv, aes(x = time, y = dYdX*100), color = 'red', linetype = "dashed") +
          # add shades to the modeled are
          geom_segment(data = segments_df, aes(x=x, xend=xend, y=y, yend=yend), color = "blue", alpha = 0.1) +
          # raw difference erp
          geom_line(data = erp_df, aes(time, uV_diff), linetype = "dotted", color="black", linewidth=0.5) +
          theme_bw() +
          labs(title = paste0(participant, "_", poa, "_", direction, "_identityMMN"),
               subtitle = paste0("NMP = ",round(NMP,digits=2)," (uV: ",round(peak_height,digits=2),", SE: ", round(peak_se,digits=2), ")"),
               x = "Time (ms)", y = "uV")
        # print(fig)
        ggsave(plot = fig, width = 8, height = 5, units = "in", dpi = 300, filename = paste0("~/OneDrive - University of Toronto/Projects/Laryngeal/figures/gam/GAM_iMMN_", participant, "_", poa, "_", direction, ".png"))
        #### plot for each condition end ####
        
      } # loop over directions end
    } # loop over POAs end
  } # loop over participants end
} # loop over groups end

5 Brain-behavioral correlation