When we talk about omics data
analysis, we’re usually diving into exciting territories like
differential expression analysis, PCA, survival analysis, and much more!
🧬 glystats brings all these powerful analyses together
under one unified, user-friendly interface.
🎯 Key Feature: glystats offers two
levels of interfaces to fit your workflow:
gly_xxx(): seamlessly works with
[glyexp::experiment()], the beating heart of the glycoverse
ecosystem 💓gly_xxx_(): flexible enough to work with general inputs
like matrices or data framesThis vignette focuses on the gly_xxx() interface. New to
[glyexp::experiment()]? No worries! Check out its introduction
first to get up to speed.
library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, unionLet’s start by exploring our demo dataset:
# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#> ℹ Reading data
#> ℹ Finding leader proteins
#> ✔ Finding leader proteins [181ms]
#>
#> ℹ Reading dataColumn group converted to <factor>.ℹ Parsing glycan compositions and structures
#> Column group converted to <factor>.✔ Parsing glycan compositions and structures [283ms]
#>
#> ℹ Reading data✔ Reading data [869ms]
#>
#>
#> ── Normalizing data ──
#>
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> ✔ Normalization completed.
#>
#> ── Removing variables with too many missing values ──
#>
#> ℹ Applying preset "discovery"...
#> ℹ Total removed: 2 (0.67%) variables.
#> ✔ Variable removal completed.
#>
#> ── Imputing missing values ──
#>
#> ℹ Imputation method: `impute_min_prob()`
#> ℹ Reason: default for "glycoproteomics" with n_samples < 30.
#> ✔ Imputation completed.
#>
#> ── Aggregating data ──
#>
#> ℹ Aggregating to "gf" level
#> ✔ Aggregation completed.
#>
#> ── Normalizing data again ──
#>
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> ✔ Normalization completed.
#>
#> ── Correcting batch effects ──
#>
#> ℹ Batch column batch not found in sample_info. Skipping batch correction.
#> ✔ Batch correction completed.
exp
#>
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 225 variables
#> ℹ Sample information fields: group <fct>
#> ℹ Variable information fields: protein <chr>, glycan_composition <comp>, protein_site <int>, gene <chr>Look at that! 🎉 We’ve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. That’s plenty of data to work with!
get_var_info(exp)
#> # A tibble: 225 × 5
#> variable protein glycan_composition protein_site gene
#> <glue> <chr> <comp> <int> <chr>
#> 1 P08185-176-Hex(5)HexNAc(4)NeuA… P08185 Hex(5)HexNAc(4)Ne… 176 SERP…
#> 2 P04196-344-Hex(5)HexNAc(4)NeuA… P04196 Hex(5)HexNAc(4)Ne… 344 HRG
#> 3 P04196-344-Hex(5)HexNAc(4) P04196 Hex(5)HexNAc(4) 344 HRG
#> 4 P10909-291-Hex(6)HexNAc(5) P10909 Hex(6)HexNAc(5) 291 CLU
#> 5 P04196-344-Hex(5)HexNAc(4)NeuA… P04196 Hex(5)HexNAc(4)Ne… 344 HRG
#> 6 P04196-345-Hex(5)HexNAc(4) P04196 Hex(5)HexNAc(4) 345 HRG
#> 7 P04196-344-Hex(5)HexNAc(4)dHex… P04196 Hex(5)HexNAc(4)dH… 344 HRG
#> 8 P04196-344-Hex(4)HexNAc(3) P04196 Hex(4)HexNAc(3) 344 HRG
#> 9 P04196-344-Hex(4)HexNAc(4)NeuA… P04196 Hex(4)HexNAc(4)Ne… 344 HRG
#> 10 P10909-291-Hex(5)HexNAc(4) P10909 Hex(5)HexNAc(4) 291 CLU
#> # ℹ 215 more rowsThe variable information tibble is like a detailed ID card for each glycoform 🆔 - it contains everything you need to know: the protein, glycosylation site, and glycan structures.
get_sample_info(exp)
#> # A tibble: 12 × 2
#> sample group
#> <chr> <fct>
#> 1 20241224-LXJ-Nglyco-H_1 H
#> 2 20241224-LXJ-Nglyco-H_2 H
#> 3 20241224-LXJ-Nglyco-H_3 H
#> 4 20241224-LXJ-Nglyco-M_1 M
#> 5 20241224-LXJ-Nglyco-M_2 M
#> 6 20241224-LXJ-Nglyco-M_3 M
#> 7 20241224-LXJ-Nglyco-Y_1 Y
#> 8 20241224-LXJ-Nglyco-Y_2 Y
#> 9 20241224-LXJ-Nglyco-Y_3 Y
#> 10 20241224-LXJ-Nglyco-C_1 C
#> 11 20241224-LXJ-Nglyco-C_2 C
#> 12 20241224-LXJ-Nglyco-C_3 COur sample information tibble features a crucial “group” column 🏷️. Here’s the key: “H” = healthy, “M” = hepatitis, “Y” = cirrhosis, and “C” = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!
Here’s where glystats really shines! ✨ Every function
follows a simple, intuitive naming pattern: gly_ + analysis
name. Think gly_ttest() for t-tests, gly_pca()
for PCA, and so on.
Pro tip: leverage RStudio’s auto-completion to discover all available functions! 💡
Let’s dive into action with an ANOVA analysis to identify differentially expressed glycoforms:
anova_res <- gly_anova(exp)
#> ℹ Number of groups: 4
#> ℹ Groups: "C", "H", "M", and "Y"
#> ℹ Pairwise comparisons will be performed, with levels coming first as reference groups.Boom! 💥 Analysis complete in just one line! gly_anova()
intelligently follows the glycoverse naming conventions,
automatically detecting the “group” column in your sample info and
fitting an ANOVA model for each glycoform.
All glystats functions return a consistent,
well-structured list with two key components:
tidy_result: Clean, analysis-ready tibbles in tidy
format 📊raw_result: The original result objects from underlying
statistical functions 🔧For gly_anova(), the tidy_result contains
two informative tibbles: main_test and
post_hoc_test.
You can use get_tidy_result() to get the tidy result
tibble:
get_tidy_result(anova_res, "main_test")
#> # A tibble: 225 × 14
#> variable protein glycan_composition protein_site gene term df sumsq
#> <glue> <chr> <comp> <int> <chr> <chr> <dbl> <dbl>
#> 1 P08185-176-… P08185 Hex(5)HexNAc(4)Ne… 176 SERP… group 3 55.1
#> 2 P04196-344-… P04196 Hex(5)HexNAc(4)Ne… 344 HRG group 3 158.
#> 3 P04196-344-… P04196 Hex(5)HexNAc(4) 344 HRG group 3 81.4
#> 4 P10909-291-… P10909 Hex(6)HexNAc(5) 291 CLU group 3 21.1
#> 5 P04196-344-… P04196 Hex(5)HexNAc(4)Ne… 344 HRG group 3 405.
#> 6 P04196-345-… P04196 Hex(5)HexNAc(4) 345 HRG group 3 39.7
#> 7 P04196-344-… P04196 Hex(5)HexNAc(4)dH… 344 HRG group 3 273.
#> 8 P04196-344-… P04196 Hex(4)HexNAc(3) 344 HRG group 3 68.0
#> 9 P04196-344-… P04196 Hex(4)HexNAc(4)Ne… 344 HRG group 3 9.64
#> 10 P10909-291-… P10909 Hex(5)HexNAc(4) 291 CLU group 3 86.3
#> # ℹ 215 more rows
#> # ℹ 6 more variables: meansq <dbl>, statistic <dbl>, p_val <dbl>, p_adj <dbl>,
#> # effect_size <dbl>, post_hoc <chr>Notice something cool? 😎 gly_anova() thoughtfully adds
back all the descriptive columns from your variable tibble. Want to
control this behavior? Just use the add_info parameter!
The raw_result houses two lists of models - one for the
main test, one for post hoc comparisons:
get_tidy_result() and get_raw_result() are
useful to be used in pipes:
exp |>
gly_anova() |>
get_tidy_result("main_test") |>
filter(p_adj < 0.05)
#> ℹ Number of groups: 4
#> ℹ Groups: "C", "H", "M", and "Y"
#> ℹ Pairwise comparisons will be performed, with levels coming first as reference groups.
#> # A tibble: 53 × 14
#> variable protein glycan_composition protein_site gene term df sumsq
#> <glue> <chr> <comp> <int> <chr> <chr> <dbl> <dbl>
#> 1 P04196-344-H… P04196 Hex(5)HexNAc(4)Ne… 344 HRG group 3 158.
#> 2 P04196-344-H… P04196 Hex(5)HexNAc(4) 344 HRG group 3 81.4
#> 3 P04196-344-H… P04196 Hex(5)HexNAc(4)Ne… 344 HRG group 3 405.
#> 4 P04196-345-H… P04196 Hex(5)HexNAc(4) 345 HRG group 3 39.7
#> 5 P04196-344-H… P04196 Hex(5)HexNAc(4)dH… 344 HRG group 3 273.
#> 6 P10909-291-H… P10909 Hex(5)HexNAc(4) 291 CLU group 3 86.3
#> 7 P04196-344-H… P04196 Hex(5)HexNAc(4)dH… 344 HRG group 3 149.
#> 8 P13671-855-H… P13671 Hex(5)HexNAc(4)dH… 855 C6 group 3 80.9
#> 9 P04196-344-H… P04196 Hex(4)HexNAc(3)dH… 344 HRG group 3 138.
#> 10 P04196-344-H… P04196 Hex(5)HexNAc(4)dH… 344 HRG group 3 108.
#> # ℹ 43 more rows
#> # ℹ 6 more variables: meansq <dbl>, statistic <dbl>, p_val <dbl>, p_adj <dbl>,
#> # effect_size <dbl>, post_hoc <chr>Feeling constrained by [glyexp::experiment()]? Fear not! 🦸♀️ Every
gly_xxx() function comes with a flexible
gly_xxx_() sibling that works with standard R objects. For
instance, gly_anova_() happily accepts plain matrices:
expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)This adaptability makes glystats a perfect team player
🤝 - it seamlessly integrates into existing analysis pipelines and
workflows, no matter what data structures you’re already using!
Ready to explore the full power of glystats? Here’s your
complete toolkit for glycomics and glycoproteomics data analysis:
gly_ttest(): Two-sample t-testgly_wilcox(): Wilcoxon rank sum testgly_anova(): One-way ANOVAgly_kruskal(): Kruskal-Wallis rank sum testgly_limma(): Linear models for microarray data
(limma)gly_fold_change(): Calculate fold changegly_pca(): Principal component analysisgly_tsne(): t-distributed stochastic neighbor embedding
(t-SNE)gly_umap(): Uniform manifold approximation and
projection (UMAP)gly_oplsda(): Orthogonal partial least squares
discriminant analysis (OPLS-DA)gly_plsda(): Partial least squares discriminant
analysis (PLS-DA)gly_kmeans(): K-means clusteringgly_hclust(): Hierarchical clusteringgly_cox(): Cox proportional hazards modelgly_enrich_go(): Gene Ontology enrichment analysisgly_enrich_kegg(): KEGG enrichment analysisgly_enrich_reactome(): Reactome enrichment
analysisgly_enrich_wikipathways(): WikiPathways enrichment
analysisgly_enrich_do(): Disease Ontology enrichment
analysisgly_enrich_ncg(): Network of Cancer Genes enrichment
analysisgly_enrich_dgn(): DisGeNET enrichment analysisgly_cor(): Correlation analysisgly_roc(): Receiver operating characteristic (ROC)
analysisReady to dive deeper into the glycoverse? Here’s your
roadmap to success:
📥 Data Import: Start with glyread
to seamlessly import your data into glyexp::experiment()
objects
🧹 Data Preprocessing: Use glyclean to polish and prepare your data for analysis
📊 Statistical Analysis: You’re here! Use
glystats to unlock powerful insights from your glycomics
data
🎨 Visualization: Stay tuned! We’re crafting an
amazing glyvis package for stunning data
visualizations
Happy analyzing! 🎉✨