Overview

This notebook generates the data for Supplementary Tables S1 (Inv4m delimitation and breakpoints) and S2 (breakpoint knob repeats) from:

  1. Manually curated coordinates — breakpoint boundaries and bounding genes determined by inspecting AnchorWave CDS-anchored whole-genome alignments and identifying the genes flanking the strand-transition zones.
  2. BLAST megablast results — knob180 (M32524.1) and TR-1 (AY083942.1) repeat hits against chromosome 4, generated by scripts/02_genomics_foundation/run_repeat_annotation_4genomes.sh.

Genomes

Label Assembly Chr4 name
B73 Zm-B73-REFERENCE-NAM-5.0 chr4
PT Zm-PT-REFERENCE-HiLo-1.0 PT04
TIL18 Zx-TIL18-REFERENCE-PanAnd-1.0 chr4
Mi21 NIL Mi21 NIL HiFi (hifiasm + RagTag) scf00000010_RagTag

Curation method

Breakpoint coordinates were defined by the AnchorWave alignment endpoints where reference-orientation synteny was lost. Bounding genes were identified by inspecting gene annotations (B73 NAM5 GFF3 for B73, PT HiLo-1.0 GFF3 for PT, TIL18 PanAnd-1.0 GFF3 for TIL18, and Liftoff B73-to-Mi21 GFF3 for Mi21 NIL) flanking the breakpoint segments.

Table S1: Inv4m delimitation and breakpoints

Inv4m limits

Curated inversion boundaries — the first and last genes within the inverted region, as identified by AnchorWave strand analysis.

inv4m_limits <- tribble(
  ~genome,     ~start,      ~end,        ~start_gene,           ~end_gene,
  "B73",       172883881,   188131462,   "Zm00001eb190470",     "Zm00001eb194800",
  "PT",        186925484,   173486186,   "Zm00109aa017629",     "Zm00109aa018009",
  "TIL18",     180365316,   193570652,   "Zx00002aa015554",     "Zx00002aa015905",
  "Mi21",      156752807,   167620500,   "Zm00001eb194800",     "Zm00001eb190470"
) %>%
  mutate(length = abs(end - start))

kable(inv4m_limits, format.args = list(big.mark = ","))
genome start end start_gene end_gene length
B73 172,883,881 188,131,462 Zm00001eb190470 Zm00001eb194800 15,247,581
PT 186,925,484 173,486,186 Zm00109aa017629 Zm00109aa018009 13,439,298
TIL18 180,365,316 193,570,652 Zx00002aa015554 Zx00002aa015905 13,205,336
Mi21 156,752,807 167,620,500 Zm00001eb194800 Zm00001eb190470 10,867,693

Breakpoint segments

Gene-free segments flanking the inversion, defined as the gap between the last gene on the collinear side and the first gene on the inverted side.

“Upstream” and “Downstream” are defined in B73-sense orientation: upstream is the breakpoint at lower B73 coordinates (~172.9 Mb), downstream at higher (~188.1 Mb). For highland genomes carrying the inversion, the physical positions are swapped but the labels follow the B73 convention.

breakpoint_segments <- tribble(
  ~genome,     ~up_start,   ~up_end,     ~down_start, ~down_end,
  # Upstream/downstream follow B73-sense orientation (~172.9 Mb / ~188.1 Mb).
  # For highland genomes the inversion swaps the physical positions, but
  # the labels are defined by which B73 breakpoint each segment corresponds to.
  "B73",       172561330,   172883881,   188131461,   188220418,
  "PT",        173369064,   173486186,   186925483,   187092654,
  "TIL18",     180269950,   180365316,   193570651,   193734606,
  # Mi21: B73-sense upstream breakpoint maps to Mi21 ~167.6 Mb (higher coord),
  #        B73-sense downstream maps to Mi21 ~156.6 Mb (lower coord).
  "Mi21",      167621241,   167838764,   156560061,   156750619
) %>%
  mutate(
    up_span   = up_end - up_start,
    down_span = down_end - down_start
  )

kable(breakpoint_segments, format.args = list(big.mark = ","))
genome up_start up_end down_start down_end up_span down_span
B73 172,561,330 172,883,881 188,131,461 188,220,418 322,551 88,957
PT 173,369,064 173,486,186 186,925,483 187,092,654 117,122 167,171
TIL18 180,269,950 180,365,316 193,570,651 193,734,606 95,366 163,955
Mi21 167,621,241 167,838,764 156,560,061 156,750,619 217,523 190,558

Bounding genes

Genes flanking each breakpoint segment (outside → inside the inversion).

bounding_genes <- tribble(
  ~genome,     ~up_outer,             ~up_inner,             ~down_inner,           ~down_outer,
  "B73",       "Zm00001eb190450",     "Zm00001eb190470",     "Zm00109aa017627",     "Zm00001eb194820",
  "PT",        "Zm00109aa017627",     "Zm00109aa017629",     "Zx00002aa015905",     "Zm00109aa017629",
  "TIL18",     "Zx00002aa015905",     "Zx00002aa015906",     "Zx00002aa015906",     "Zx00002aa015554",
  "Mi21",      "Zm00001eb190450",     "Zm00001eb194800",     "Zm00001eb190470",     "Zm00001eb194820"
)

kable(bounding_genes)
genome up_outer up_inner down_inner down_outer
B73 Zm00001eb190450 Zm00001eb190470 Zm00109aa017627 Zm00001eb194820
PT Zm00109aa017627 Zm00109aa017629 Zx00002aa015905 Zm00109aa017629
TIL18 Zx00002aa015905 Zx00002aa015906 Zx00002aa015906 Zx00002aa015554
Mi21 Zm00001eb190450 Zm00001eb194800 Zm00001eb190470 Zm00001eb194820

Export Table S1

apply_label <- function(df) {
  df %>% mutate(genome = genome_labels[genome])
}

s1_limits <- inv4m_limits %>%
  select(genome, start, end, length, start_gene, end_gene) %>%
  apply_label()

s1_segments <- breakpoint_segments %>%
  select(genome, up_start, up_end, up_span, down_start, down_end, down_span) %>%
  apply_label()

s1_genes <- bounding_genes %>% apply_label()

write_csv(s1_limits,   file.path(paths$intermediate, "table_S1_inv4m_limits.csv"))
write_csv(s1_segments, file.path(paths$intermediate, "table_S1_breakpoint_segments.csv"))
write_csv(s1_genes,    file.path(paths$intermediate, "table_S1_bounding_genes.csv"))

Table S2: Breakpoint knob repeats

Load BLAST results

BLAST+ megablast (outfmt 6) results from run_repeat_annotation_4genomes.sh. Columns: qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, bitscore.

blast_cols <- c("qseqid", "sseqid", "pident", "length", "mismatch",
                "gapopen", "qstart", "qend", "sstart", "send",
                "evalue", "bitscore")

genomes <- c("B73", "PT", "TIL18", "Mi21")

read_blast <- function(repeat_type, genome) {
  f <- file.path(paths$intermediate, paste0(repeat_type, "_", genome, ".blast"))
  read_tsv(f, col_names = blast_cols, col_types = cols(.default = "c")) %>%
    mutate(
      across(c(length, sstart, send, bitscore), as.numeric),
      genome = genome,
      repeat_type = repeat_type,
      # Normalize subject coordinates so smin < smax
      smin = pmin(sstart, send),
      smax = pmax(sstart, send)
    )
}

all_blast <- bind_rows(
  lapply(genomes, function(g) read_blast("knob180", g)),
  lapply(genomes, function(g) read_blast("TR-1", g))
)

cat(sprintf("Loaded %s BLAST hits across %d genomes\n",
            format(nrow(all_blast), big.mark = ","), length(genomes)))
## Loaded 210,358 BLAST hits across 4 genomes

Count repeats in breakpoint segments

# Breakpoint segment coordinates per genome (B73-sense upstream/downstream)
segments <- breakpoint_segments %>%
  select(genome, up_start, up_end, down_start, down_end)

# Internal knob region: between inv4m boundaries
internal_regions <- inv4m_limits %>%
  mutate(
    int_start = pmin(start, end),
    int_end   = pmax(start, end)
  ) %>%
  select(genome, int_start, int_end)

count_in_region <- function(blast_df, genome_id, region_start, region_end) {
  blast_df %>%
    filter(
      genome == genome_id,
      repeat_type == "knob180",
      (smin >= region_start & smin <= region_end) |
      (smax >= region_start & smax <= region_end)
    ) %>%
    summarise(
      count       = n(),
      match_len   = sum(length),
      hit_start   = ifelse(n() > 0, min(smin), NA_real_),
      hit_end     = ifelse(n() > 0, max(smax), NA_real_),
      span        = ifelse(n() > 0, max(smax) - min(smin), NA_real_),
      .groups     = "drop"
    )
}

# Upstream breakpoint repeats
up_repeats <- lapply(seq_len(nrow(segments)), function(i) {
  s <- segments[i, ]
  count_in_region(all_blast, s$genome, s$up_start, s$up_end) %>%
    mutate(genome = s$genome, section = "upstream",
           seg_span = s$up_end - s$up_start)
}) %>% bind_rows()

# Downstream breakpoint repeats
down_repeats <- lapply(seq_len(nrow(segments)), function(i) {
  s <- segments[i, ]
  count_in_region(all_blast, s$genome, s$down_start, s$down_end) %>%
    mutate(genome = s$genome, section = "downstream",
           seg_span = s$down_end - s$down_start)
}) %>% bind_rows()

# Internal repeats
int_repeats <- lapply(seq_len(nrow(internal_regions)), function(i) {
  r <- internal_regions[i, ]
  count_in_region(all_blast, r$genome, r$int_start, r$int_end) %>%
    mutate(genome = r$genome, section = "internal",
           seg_span = r$int_end - r$int_start)
}) %>% bind_rows()

knob_repeats <- bind_rows(up_repeats, down_repeats, int_repeats) %>%
  mutate(
    pct_knob_span = ifelse(!is.na(span), round(span / seg_span * 100), NA),
    pct_knob      = ifelse(!is.na(span) & span > 0,
                           round(match_len / span * 100), NA),
    genome_label  = genome_labels[genome]
  ) %>%
  select(genome = genome_label, section, count, match_len,
         hit_start, hit_end, span, pct_knob_span, pct_knob)

Results

Upstream breakpoint segment

knob_repeats %>% filter(section == "upstream") %>%
  kable(format.args = list(big.mark = ","))
genome section count match_len hit_start hit_end span pct_knob_span pct_knob
B73 upstream 353 60,552 172,561,781 172,882,488 320,707 99 19
PT upstream 45 7,938 173,370,041 173,392,244 22,203 19 36
TIL18 upstream 45 7,938 180,270,927 180,293,129 22,202 23 36
Mi21 NIL upstream 210 36,316 167,622,041 167,760,234 138,193 64 26

Internal knob repeat

knob_repeats %>% filter(section == "internal") %>%
  select(-pct_knob_span, -pct_knob) %>%
  kable(format.args = list(big.mark = ","))
genome section count match_len hit_start hit_end span
B73 internal 651 113,511 184,152,381 184,525,209 372,828
PT internal 355 61,995 177,250,062 177,516,479 266,417
TIL18 internal 407 70,957 183,816,884 184,072,480 255,596
Mi21 NIL internal 330 57,545 160,494,941 160,755,992 261,051

Downstream breakpoint segment

knob_repeats %>% filter(section == "downstream") %>%
  kable(format.args = list(big.mark = ","))
genome section count match_len hit_start hit_end span pct_knob_span pct_knob
B73 downstream 27 4,624 188,148,945 188,220,088 71,143 80 6
PT downstream 131 21,945 186,926,848 187,051,496 124,648 75 18
TIL18 downstream 155 26,350 193,572,019 193,671,226 99,207 61 27
Mi21 NIL downstream 57 9,865 156,560,208 156,615,508 55,300 29 18

Export Table S2

write_csv(knob_repeats, file.path(paths$intermediate, "table_S2_knob_repeats.csv"))

Validation

Compare computed values against the hardcoded values in main.tex.

# Spot-check B73 upstream: expect count=352, match_len=59580
b73_up <- knob_repeats %>%
  filter(genome == "B73", section == "upstream")
cat(sprintf("B73 upstream: count=%d (expect 352), match_len=%d (expect 59580)\n",
            b73_up$count, b73_up$match_len))
## B73 upstream: count=353 (expect 352), match_len=60552 (expect 59580)
# Spot-check Mi21 downstream: expect count=57, match_len=9865
mi21_down <- knob_repeats %>%
  filter(genome == "Mi21 NIL", section == "downstream")
cat(sprintf("Mi21 downstream: count=%d (expect 57), match_len=%d (expect 9865)\n",
            mi21_down$count, mi21_down$match_len))
## Mi21 downstream: count=57 (expect 57), match_len=9865 (expect 9865)

Session info

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3.1
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.51  readr_2.1.6 tidyr_1.3.2 dplyr_1.1.4 here_1.0.2 
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.6.0        jsonlite_2.0.0   compiler_4.5.2   crayon_1.5.3    
##  [5] tidyselect_1.2.1 parallel_4.5.2   jquerylib_0.1.4  yaml_2.3.12     
##  [9] fastmap_1.2.0    R6_2.6.1         generics_0.1.4   tibble_3.3.0    
## [13] rprojroot_2.1.1  bslib_0.9.0      pillar_1.11.1    tzdb_0.5.0      
## [17] rlang_1.1.6      cachem_1.1.0     xfun_0.55        sass_0.4.10     
## [21] bit64_4.6.0-1    otel_0.2.0       cli_3.6.5        withr_3.0.2     
## [25] magrittr_2.0.4   digest_0.6.39    vroom_1.6.7      hms_1.1.4       
## [29] lifecycle_1.0.4  vctrs_0.6.5      evaluate_1.0.5   glue_1.8.0      
## [33] rmarkdown_2.30   purrr_1.2.0      tools_4.5.2      pkgconfig_2.0.3 
## [37] htmltools_0.5.9