Getting Started with glystats

When we talk about omics data analysis, we’re usually diving into exciting territories like differential expression analysis, PCA, survival analysis, and much more! 🧬 glystats brings all these powerful analyses together under one unified, user-friendly interface.

🎯 Key Feature: glystats offers two levels of interfaces to fit your workflow:

gly_xxx(): seamlessly works with [glyexp::experiment()], the beating heart of the glycoverse ecosystem 💓
gly_xxx_(): flexible enough to work with general inputs like matrices or data frames

This vignette focuses on the gly_xxx() interface. New to [glyexp::experiment()]? No worries! Check out its introduction first to get up to speed.

library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#> 
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#> 
#>     aggregate
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

🔍 Meet Your Data

Let’s start by exploring our demo dataset:

# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#> ℹ Reading data
#> ℹ Finding leader proteins
#> ✔ Finding leader proteins [181ms]
#> 
#> ℹ Reading dataColumn group converted to <factor>.ℹ Parsing glycan compositions and structures
#> Column group converted to <factor>.✔ Parsing glycan compositions and structures [283ms]
#> 
#> ℹ Reading data✔ Reading data [869ms]
#> 
#> 
#> ── Normalizing data ──
#> 
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> ✔ Normalization completed.
#> 
#> ── Removing variables with too many missing values ──
#> 
#> ℹ Applying preset "discovery"...
#> ℹ Total removed: 2 (0.67%) variables.
#> ✔ Variable removal completed.
#> 
#> ── Imputing missing values ──
#> 
#> ℹ Imputation method: `impute_min_prob()`
#> ℹ Reason: default for "glycoproteomics" with n_samples < 30.
#> ✔ Imputation completed.
#> 
#> ── Aggregating data ──
#> 
#> ℹ Aggregating to "gf" level
#> ✔ Aggregation completed.
#> 
#> ── Normalizing data again ──
#> 
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> ✔ Normalization completed.
#> 
#> ── Correcting batch effects ──
#> 
#> ℹ Batch column batch not found in sample_info. Skipping batch correction.
#> ✔ Batch correction completed.
exp
#> 
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 225 variables
#> ℹ Sample information fields: group <fct>
#> ℹ Variable information fields: protein <chr>, glycan_composition <comp>, protein_site <int>, gene <chr>

Look at that! 🎉 We’ve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. That’s plenty of data to work with!

get_var_info(exp)
#> # A tibble: 225 × 5
#>    variable                        protein glycan_composition protein_site gene 
#>    <glue>                          <chr>   <comp>                    <int> <chr>
#>  1 P08185-176-Hex(5)HexNAc(4)NeuA… P08185  Hex(5)HexNAc(4)Ne…          176 SERP…
#>  2 P04196-344-Hex(5)HexNAc(4)NeuA… P04196  Hex(5)HexNAc(4)Ne…          344 HRG  
#>  3 P04196-344-Hex(5)HexNAc(4)      P04196  Hex(5)HexNAc(4)             344 HRG  
#>  4 P10909-291-Hex(6)HexNAc(5)      P10909  Hex(6)HexNAc(5)             291 CLU  
#>  5 P04196-344-Hex(5)HexNAc(4)NeuA… P04196  Hex(5)HexNAc(4)Ne…          344 HRG  
#>  6 P04196-345-Hex(5)HexNAc(4)      P04196  Hex(5)HexNAc(4)             345 HRG  
#>  7 P04196-344-Hex(5)HexNAc(4)dHex… P04196  Hex(5)HexNAc(4)dH…          344 HRG  
#>  8 P04196-344-Hex(4)HexNAc(3)      P04196  Hex(4)HexNAc(3)             344 HRG  
#>  9 P04196-344-Hex(4)HexNAc(4)NeuA… P04196  Hex(4)HexNAc(4)Ne…          344 HRG  
#> 10 P10909-291-Hex(5)HexNAc(4)      P10909  Hex(5)HexNAc(4)             291 CLU  
#> # ℹ 215 more rows

The variable information tibble is like a detailed ID card for each glycoform 🆔 - it contains everything you need to know: the protein, glycosylation site, and glycan structures.

get_sample_info(exp)
#> # A tibble: 12 × 2
#>    sample                  group
#>    <chr>                   <fct>
#>  1 20241224-LXJ-Nglyco-H_1 H    
#>  2 20241224-LXJ-Nglyco-H_2 H    
#>  3 20241224-LXJ-Nglyco-H_3 H    
#>  4 20241224-LXJ-Nglyco-M_1 M    
#>  5 20241224-LXJ-Nglyco-M_2 M    
#>  6 20241224-LXJ-Nglyco-M_3 M    
#>  7 20241224-LXJ-Nglyco-Y_1 Y    
#>  8 20241224-LXJ-Nglyco-Y_2 Y    
#>  9 20241224-LXJ-Nglyco-Y_3 Y    
#> 10 20241224-LXJ-Nglyco-C_1 C    
#> 11 20241224-LXJ-Nglyco-C_2 C    
#> 12 20241224-LXJ-Nglyco-C_3 C

Our sample information tibble features a crucial “group” column 🏷️. Here’s the key: “H” = healthy, “M” = hepatitis, “Y” = cirrhosis, and “C” = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!

🚀 One Interface to Rule Them All

Here’s where glystats really shines! ✨ Every function follows a simple, intuitive naming pattern: gly_ + analysis name. Think gly_ttest() for t-tests, gly_pca() for PCA, and so on.

Pro tip: leverage RStudio’s auto-completion to discover all available functions! 💡

Let’s dive into action with an ANOVA analysis to identify differentially expressed glycoforms:

anova_res <- gly_anova(exp)
#> ℹ Number of groups: 4
#> ℹ Groups: "C", "H", "M", and "Y"
#> ℹ Pairwise comparisons will be performed, with levels coming first as reference groups.

Boom! 💥 Analysis complete in just one line! gly_anova() intelligently follows the glycoverse naming conventions, automatically detecting the “group” column in your sample info and fitting an ANOVA model for each glycoform.

📋 Understanding Your Results

All glystats functions return a consistent, well-structured list with two key components:

tidy_result: Clean, analysis-ready tibbles in tidy format 📊
raw_result: The original result objects from underlying statistical functions 🔧

For gly_anova(), the tidy_result contains two informative tibbles: main_test and post_hoc_test.

You can use get_tidy_result() to get the tidy result tibble:

get_tidy_result(anova_res, "main_test")
#> # A tibble: 225 × 14
#>    variable     protein glycan_composition protein_site gene  term     df  sumsq
#>    <glue>       <chr>   <comp>                    <int> <chr> <chr> <dbl>  <dbl>
#>  1 P08185-176-… P08185  Hex(5)HexNAc(4)Ne…          176 SERP… group     3  55.1 
#>  2 P04196-344-… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 158.  
#>  3 P04196-344-… P04196  Hex(5)HexNAc(4)             344 HRG   group     3  81.4 
#>  4 P10909-291-… P10909  Hex(6)HexNAc(5)             291 CLU   group     3  21.1 
#>  5 P04196-344-… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 405.  
#>  6 P04196-345-… P04196  Hex(5)HexNAc(4)             345 HRG   group     3  39.7 
#>  7 P04196-344-… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 273.  
#>  8 P04196-344-… P04196  Hex(4)HexNAc(3)             344 HRG   group     3  68.0 
#>  9 P04196-344-… P04196  Hex(4)HexNAc(4)Ne…          344 HRG   group     3   9.64
#> 10 P10909-291-… P10909  Hex(5)HexNAc(4)             291 CLU   group     3  86.3 
#> # ℹ 215 more rows
#> # ℹ 6 more variables: meansq <dbl>, statistic <dbl>, p_val <dbl>, p_adj <dbl>,
#> #   effect_size <dbl>, post_hoc <chr>

Notice something cool? 😎 gly_anova() thoughtfully adds back all the descriptive columns from your variable tibble. Want to control this behavior? Just use the add_info parameter!

The raw_result houses two lists of models - one for the main test, one for post hoc comparisons:

names(get_raw_result(anova_res))
#> [1] "main_test"     "post_hoc_test"

get_tidy_result() and get_raw_result() are useful to be used in pipes:

exp |>
  gly_anova() |>
  get_tidy_result("main_test") |>
  filter(p_adj < 0.05)
#> ℹ Number of groups: 4
#> ℹ Groups: "C", "H", "M", and "Y"
#> ℹ Pairwise comparisons will be performed, with levels coming first as reference groups.
#> # A tibble: 53 × 14
#>    variable      protein glycan_composition protein_site gene  term     df sumsq
#>    <glue>        <chr>   <comp>                    <int> <chr> <chr> <dbl> <dbl>
#>  1 P04196-344-H… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 158. 
#>  2 P04196-344-H… P04196  Hex(5)HexNAc(4)             344 HRG   group     3  81.4
#>  3 P04196-344-H… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 405. 
#>  4 P04196-345-H… P04196  Hex(5)HexNAc(4)             345 HRG   group     3  39.7
#>  5 P04196-344-H… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 273. 
#>  6 P10909-291-H… P10909  Hex(5)HexNAc(4)             291 CLU   group     3  86.3
#>  7 P04196-344-H… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 149. 
#>  8 P13671-855-H… P13671  Hex(5)HexNAc(4)dH…          855 C6    group     3  80.9
#>  9 P04196-344-H… P04196  Hex(4)HexNAc(3)dH…          344 HRG   group     3 138. 
#> 10 P04196-344-H… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 108. 
#> # ℹ 43 more rows
#> # ℹ 6 more variables: meansq <dbl>, statistic <dbl>, p_val <dbl>, p_adj <dbl>,
#> #   effect_size <dbl>, post_hoc <chr>

🔧 Maximum Flexibility Mode

Feeling constrained by [glyexp::experiment()]? Fear not! 🦸‍♀️ Every gly_xxx() function comes with a flexible gly_xxx_() sibling that works with standard R objects. For instance, gly_anova_() happily accepts plain matrices:

expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)

This adaptability makes glystats a perfect team player 🤝 - it seamlessly integrates into existing analysis pipelines and workflows, no matter what data structures you’re already using!

🎪 The Complete Analytical Arsenal

Ready to explore the full power of glystats? Here’s your complete toolkit for glycomics and glycoproteomics data analysis:

🔬 Differential Expression Analysis:
- gly_ttest(): Two-sample t-test
- gly_wilcox(): Wilcoxon rank sum test
- gly_anova(): One-way ANOVA
- gly_kruskal(): Kruskal-Wallis rank sum test
- gly_limma(): Linear models for microarray data (limma)
- gly_fold_change(): Calculate fold change
📐 Dimensionality Reduction:
- gly_pca(): Principal component analysis
- gly_tsne(): t-distributed stochastic neighbor embedding (t-SNE)
- gly_umap(): Uniform manifold approximation and projection (UMAP)
- gly_oplsda(): Orthogonal partial least squares discriminant analysis (OPLS-DA)
- gly_plsda(): Partial least squares discriminant analysis (PLS-DA)
🧩 Clustering:
- gly_kmeans(): K-means clustering
- gly_hclust(): Hierarchical clustering
⏱️ Survival Analysis:
- gly_cox(): Cox proportional hazards model
🎯 Enrichment Analysis:
- gly_enrich_go(): Gene Ontology enrichment analysis
- gly_enrich_kegg(): KEGG enrichment analysis
- gly_enrich_reactome(): Reactome enrichment analysis
- gly_enrich_wikipathways(): WikiPathways enrichment analysis
- gly_enrich_do(): Disease Ontology enrichment analysis
- gly_enrich_ncg(): Network of Cancer Genes enrichment analysis
- gly_enrich_dgn(): DisGeNET enrichment analysis
🔧 Additional Tools:
- gly_cor(): Correlation analysis
- gly_roc(): Receiver operating characteristic (ROC) analysis

🚀 What’s Next on Your Journey?

Ready to dive deeper into the glycoverse? Here’s your roadmap to success:

📥 Data Import: Start with glyread to seamlessly import your data into glyexp::experiment() objects
🧹 Data Preprocessing: Use glyclean to polish and prepare your data for analysis
📊 Statistical Analysis: You’re here! Use glystats to unlock powerful insights from your glycomics data
🎨 Visualization: Stay tuned! We’re crafting an amazing glyvis package for stunning data visualizations

Happy analyzing! 🎉✨