glymotif is most useful when combined with other tools
in the glycoverse ecosystem. If you’re already using
glyread to import your glycoproteomics results and
glyexp to manage your experimental data, you can add
motif-level information directly to your experiment.
glymotif provides functions for adding motif information to
experiment objects.
Important note: This vignette assumes you’re
familiar with the glyexp package. If you haven’t used it
yet, start with its introduction.
Let’s work with a real-world dataset that shows how
glymotif can be used with experimental data. We will use
real_experiment from the glyexp package, a
serum N-glycoproteomics study with 12 samples. First, let’s use
glyclean::auto_clean() to preprocess the data.
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
exp <- auto_clean(real_experiment)
#>
#> ── Normalizing data ──
#>
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> ✔ Normalization completed.
#>
#> ── Removing variables with too many missing values ──
#>
#> ℹ Applying preset "discovery"...
#> ℹ Total removed: 24 (0.56%) variables.
#> ✔ Variable removal completed.
#>
#> ── Imputing missing values ──
#>
#> ℹ Imputation method: `impute_min_prob()`
#> ℹ Reason: default for "glycoproteomics" with n_samples < 30.
#> ✔ Imputation completed.
#>
#> ── Aggregating data ──
#>
#> ℹ Aggregating to "gfs" level
#> ✔ Aggregation completed.
#>
#> ── Normalizing data again ──
#>
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> ✔ Normalization completed.
#>
#> ── Correcting batch effects ──
#>
#> ℹ Batch column batch not found in sample_info. Skipping batch correction.
#> ✔ Batch correction completed.
exp
#>
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 3979 variables
#> ℹ Sample information fields: group <fct>
#> ℹ Variable information fields: protein <chr>, glycan_composition <comp>, glycan_structure <struct>, protein_site <int>, gene <chr>Now, let’s inspect the variable and sample information.
get_var_info(exp)
#> # A tibble: 3,979 × 6
#> variable protein glycan_composition glycan_structure protein_site gene
#> <chr> <chr> <comp> <struct> <int> <chr>
#> 1 P08185-176-He… P08185 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 176 SERP…
#> 2 P04196-344-He… P04196 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> 3 P04196-344-He… P04196 Hex(5)HexNAc(4) Hex(??-?)HexNAc… 344 HRG
#> 4 P04196-344-He… P04196 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> 5 P10909-291-He… P10909 Hex(6)HexNAc(5) Hex(??-?)HexNAc… 291 CLU
#> 6 P04196-344-He… P04196 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> 7 P04196-345-He… P04196 Hex(5)HexNAc(4) Hex(??-?)HexNAc… 345 HRG
#> 8 P04196-344-He… P04196 Hex(5)HexNAc(4)dH… dHex(??-?)Hex(?… 344 HRG
#> 9 P04196-344-He… P04196 Hex(4)HexNAc(3) Hex(??-?)HexNAc… 344 HRG
#> 10 P04196-344-He… P04196 Hex(4)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> # ℹ 3,969 more rowsget_sample_info(exp)
#> # A tibble: 12 × 2
#> sample group
#> <chr> <fct>
#> 1 C1 C
#> 2 C2 C
#> 3 C3 C
#> 4 H1 H
#> 5 H2 H
#> 6 H3 H
#> 7 M1 M
#> 8 M2 M
#> 9 M3 M
#> 10 Y1 Y
#> 11 Y2 Y
#> 12 Y3 YThis is an N-glycoproteomics dataset with 500 PSMs (Peptide Spectrum Matches) across 12 samples, which is a useful setting for motif analysis.
Tip: In real-world data analysis, you will usually
want to use glyclean to perform data preprocessing before
diving into any analysis.
The variable information tibble contains details about each glycoform - the proteins, sites, and glycan structures. We can add columns that describe whether specific motifs are present in each glycan.
The simple approach: One motif at a time
exp |>
mutate_var(n_hex = have_motif(glycan_structure, "Hex(a1-")) |>
get_var_info() |>
select(variable, protein, glycan_structure, n_hex)
#> # A tibble: 3,979 × 4
#> variable protein glycan_structure n_hex
#> <chr> <chr> <struct> <lgl>
#> 1 P08185-176-Hex(5)HexNAc(4)NeuAc(2) P08185 NeuAc(??-?)Hex(??-?)HexNA… FALSE
#> 2 P04196-344-Hex(5)HexNAc(4)NeuAc(1)-1 P04196 NeuAc(??-?)Hex(??-?)HexNA… FALSE
#> 3 P04196-344-Hex(5)HexNAc(4) P04196 Hex(??-?)HexNAc(??-?)Hex(… FALSE
#> 4 P04196-344-Hex(5)HexNAc(4)NeuAc(1)-2 P04196 NeuAc(??-?)Hex(??-?)HexNA… FALSE
#> 5 P10909-291-Hex(6)HexNAc(5)-1 P10909 Hex(??-?)HexNAc(??-?)Hex(… FALSE
#> 6 P04196-344-Hex(5)HexNAc(4)NeuAc(2) P04196 NeuAc(??-?)Hex(??-?)HexNA… FALSE
#> 7 P04196-345-Hex(5)HexNAc(4) P04196 Hex(??-?)HexNAc(??-?)Hex(… FALSE
#> 8 P04196-344-Hex(5)HexNAc(4)dHex(2) P04196 dHex(??-?)Hex(??-?)HexNAc… FALSE
#> 9 P04196-344-Hex(4)HexNAc(3)-1 P04196 Hex(??-?)HexNAc(??-?)Hex(… FALSE
#> 10 P04196-344-Hex(4)HexNAc(4)NeuAc(1) P04196 NeuAc(??-?)Hex(??-?)HexNA… FALSE
#> # ℹ 3,969 more rowsA repetitive approach: Multiple motifs one call at a time
# Don't do this
exp |>
mutate_var(
n_hex = have_motif(glycan_structure, "Hex(a1-"),
n_hexna = have_motif(glycan_structure, "HexNAc(a1-"),
n_dhex = have_motif(glycan_structure, "dHex(a1-")
)While this approach works, it’s not very efficient. Those three
separate calls to have_motif() all perform time-consuming
validations and conversions on the same set of glycan structures. Plus,
there’s a lot of repetitive typing. You have to type
have_motif and glycan_structure three
times.
The preferred approach: Use the
add_motifs_lgl() or add_motifs_int() functions
instead. They might look like simple syntactic sugar, but they are
optimized for this scenario:
The motif annotations are now integrated into your variable information tibble.
exp2 |>
get_var_info() |>
select(variable, motif1, motif2, motif3)
#> # A tibble: 3,979 × 4
#> variable motif1 motif2 motif3
#> <chr> <lgl> <lgl> <lgl>
#> 1 P08185-176-Hex(5)HexNAc(4)NeuAc(2) TRUE TRUE FALSE
#> 2 P04196-344-Hex(5)HexNAc(4)NeuAc(1)-1 TRUE TRUE FALSE
#> 3 P04196-344-Hex(5)HexNAc(4) TRUE TRUE FALSE
#> 4 P04196-344-Hex(5)HexNAc(4)NeuAc(1)-2 TRUE TRUE FALSE
#> 5 P10909-291-Hex(6)HexNAc(5)-1 TRUE TRUE FALSE
#> 6 P04196-344-Hex(5)HexNAc(4)NeuAc(2) TRUE TRUE FALSE
#> 7 P04196-345-Hex(5)HexNAc(4) TRUE TRUE FALSE
#> 8 P04196-344-Hex(5)HexNAc(4)dHex(2) TRUE TRUE TRUE
#> 9 P04196-344-Hex(4)HexNAc(3)-1 TRUE TRUE FALSE
#> 10 P04196-344-Hex(4)HexNAc(4)NeuAc(1) TRUE TRUE FALSE
#> # ℹ 3,969 more rowsYou can use these new columns for downstream filtering and analysis. For example:
Want to quantify the motifs in your experiment? Try the
glydet package. It provides the
quantify_motifs() function to perform relative and absolute
motif quantification.