Every omics data analysis journey begins with the same challenge: taming your raw data. Think of it like preparing ingredients before cooking a gourmet meal – you need to wash, chop, and season everything just right. In the glycomics and glycoproteomics world, this means normalization, missing value handling, and batch effect correction.
Meet glyclean – your Swiss Army knife for
glycoproteomics and glycomics data preprocessing! This package provides
a comprehensive toolkit that takes the guesswork out of data cleaning,
with specialized methods designed specifically for the unique challenges
of glycan analysis.
Important Note: This package is primarily designed
for glyexp::experiment() objects. If you’re new to this
data structure, we highly recommend checking out its
introduction first. We’ll also be using the glyread package to load
our data – it’s the go-to tool in the glycoverse
ecosystem.
Let’s dive in with a real-world dataset that will showcase what
glyclean can do. We’ll use
glyread::read_pglyco3_pglycoquant() to load our data into a
proper glyexp::experiment() object.
exp <- real_experiment |>
mutate_obs(batch = factor(rep(c("A", "B", "C"), 4)))
exp
#>
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 4262 variables
#> ℹ Sample information fields: group <fct>, batch <fct>
#> ℹ Variable information fields: peptide <chr>, peptide_site <int>, protein <chr>, protein_site <int>, gene <chr>, glycan_composition <comp>, glycan_structure <struct>Let’s peek under the hood and see what we’re working with:
get_var_info(exp)
#> # A tibble: 4,262 Ă— 8
#> variable peptide peptide_site protein protein_site gene glycan_composition
#> <chr> <chr> <int> <chr> <int> <chr> <comp>
#> 1 P08185-N1… NKTQGK 1 P08185 176 SERP… Hex(5)HexNAc(4)Ne…
#> 2 P04196-N3… HSHNNN… 5 P04196 344 HRG Hex(5)HexNAc(4)Ne…
#> 3 P04196-N3… HSHNNN… 5 P04196 344 HRG Hex(5)HexNAc(4)
#> 4 P04196-N3… HSHNNN… 5 P04196 344 HRG Hex(5)HexNAc(4)Ne…
#> 5 P10909-N2… HNSTGC… 2 P10909 291 CLU Hex(6)HexNAc(5)
#> 6 P04196-N3… HSHNNN… 5 P04196 344 HRG Hex(5)HexNAc(4)Ne…
#> 7 P04196-N3… HSHNNN… 6 P04196 345 HRG Hex(5)HexNAc(4)
#> 8 P04196-N3… HSHNNN… 5 P04196 344 HRG Hex(5)HexNAc(4)dH…
#> 9 P04196-N3… HSHNNN… 5 P04196 344 HRG Hex(4)HexNAc(3)
#> 10 P04196-N3… HSHNNN… 5 P04196 344 HRG Hex(4)HexNAc(4)Ne…
#> # ℹ 4,252 more rows
#> # ℹ 1 more variable: glycan_structure <struct>get_sample_info(exp)
#> # A tibble: 12 Ă— 3
#> sample group batch
#> <chr> <fct> <fct>
#> 1 C1 C A
#> 2 C2 C B
#> 3 C3 C C
#> 4 H1 H A
#> 5 H2 H B
#> 6 H3 H C
#> 7 M1 M A
#> 8 M2 M B
#> 9 M3 M C
#> 10 Y1 Y A
#> 11 Y2 Y B
#> 12 Y3 Y CWhat we have here is a beautiful N-glycoproteomics dataset featuring 500 PSMs (Peptide Spectrum Matches) across 12 samples. These samples come from 3 different batches and represent 4 distinct biological groups – a perfect playground for demonstrating preprocessing techniques!
Ready for some magic? Watch this:
clean_exp <- auto_clean(exp)
#>
#> ── Normalizing data ──
#>
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> âś” Normalization completed.
#>
#> ── Removing variables with too many missing values ──
#>
#> ℹ Applying preset "discovery"...
#> ℹ Total removed: 24 (0.56%) variables.
#> âś” Variable removal completed.
#>
#> ── Imputing missing values ──
#>
#> ℹ Imputation method: `impute_min_prob()`
#> ℹ Reason: default for "glycoproteomics" with n_samples < 30.
#> âś” Imputation completed.
#>
#> ── Aggregating data ──
#>
#> ℹ Aggregating to "gfs" level
#> âś” Aggregation completed.
#>
#> ── Normalizing data again ──
#>
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> âś” Normalization completed.
#>
#> ── Correcting batch effects ──
#>
#> ℹ Batch effects detected in "2.2%" of variables (<= "30.0%"). Skipping batch correction.
#> ✔ Batch correction completed.That’s it! Your data is now preprocessed and ready for analysis! 🎉
But Wait, What Just Happened? auto_clean() isn’t
actually magic (sorry to disappoint) – it’s a carefully designed
intelligent pipeline that:
Let’s look into the details of what auto_clean() does.
Here is a simplified version of the function:
auto_clean <- function(exp, ...) {
# ... are other arguments, see documentation for more details
if (glyexp::get_exp_type(exp) == "glycoproteomics") {
exp <- auto_normalize(exp, ...)
exp <- auto_remove(exp, ...)
exp <- auto_impute(exp, ...)
exp <- auto_aggregate(exp)
exp <- auto_normalize(exp, ...)
exp <- auto_correct_batch_effect(exp, ...)
} else if (glyexp::get_exp_type(exp) == "glycomics") {
exp <- auto_remove(exp, ...)
exp <- auto_normalize(exp, ...)
exp <- normalize_total_area(exp)
exp <- auto_impute(exp, ...)
exp <- auto_correct_batch_effect(exp, ...)
} else {
stop("The experiment type must be 'glycoproteomics' or 'glycomics'.")
}
exp
}As you can see, auto_clean() calls the following
functions in sequence:
auto_normalize(): Automatically normalize the dataauto_remove(): Automatically remove variables with too
many missing valuesauto_impute(): Automatically impute the missing
valuesauto_aggregate(): Automatically aggregate the dataauto_correct_batch_effect(): Automatically correct the
batch effectsThese functions apply deterministic defaults for the given dataset.
How auto_normalize() works:
When QC samples are available: QC samples do not change the automatic method choice and are not used for normalization diagnostics.
For glycomics data: Total area normalization is used as the default.
For glycoproteomics data and other experiment types: Median normalization is used as the default.
You can make a custom pipeline by calling these functions in different orders:
clean_exp <- exp |>
auto_remove() |>
auto_normalize() |>
auto_impute() |>
auto_aggregate()
#> ℹ Applying preset "discovery"...
#> ℹ Total removed: 24 (0.56%) variables.
#> ℹ Normalization method: `normalize_median()`
#> ℹ Reason: default for "glycoproteomics".
#> ℹ Imputation method: `impute_min_prob()`
#> ℹ Reason: default for "glycoproteomics" with n_samples < 30.
#> ℹ Aggregating to "gfs" levelWhile auto_clean() and other auto_xxx()
functions are fantastic for getting started, you’ll eventually want more
control over your preprocessing pipeline. Let’s explore each step
individually – think of it as learning to cook rather than just ordering
takeout!
Imagine you’re comparing heights of people measured in different units – some in feet, some in meters, some in furlongs (okay, maybe not furlongs). Normalization does the same thing for your omics data, bringing all intensities to a comparable scale.
Available normalization methods in
glyclean:
| Function | Description | Best For |
|---|---|---|
normalize_median() |
Median-based normalization | General use, robust to outliers |
normalize_median_abs() |
Median absolute deviation | When you need extra robustness |
normalize_median_quotient() |
Median quotient method | Glycomics data (compositional) |
normalize_quantile() |
Quantile normalization | When you want identical distributions |
normalize_total_area() |
Total area normalization | Relative abundance data |
normalize_rlr() |
Robust linear regression | Complex batch designs |
normalize_rlrma() |
Robust linear regression with median adjustment | Complex batch designs |
normalize_rlrmacyc() |
Robust linear regression with median adjustment and cyclic smoothing | Complex batch designs |
normalize_loessf() |
LOESS with feature smoothing | Non-linear trends |
normalize_loesscyc() |
LOESS with cyclic smoothing | Cyclic data |
normalize_vsn() |
Variance stabilizing normalization | Heteroscedastic data |
transform_clr() |
Centered Log-Ratio transformation | Glycomics data (compositional) |
transform_alr() |
Additive Log-Ratio transformation | Glycomics data (compositional) |
Pro Tips:
by parameter in many functions? This allows
stratified normalization within groups – super useful when you have
distinct experimental conditions!by parameter
accepts both column names from your sample metadata and direct vectors!
This gives you complete control over grouping, even for custom groupings
that aren’t stored in your experiment object.# Example: Using by parameter with a custom vector
# Normalize within custom groups (e.g., based on some external criteria)
custom_groups <- c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B", "C", "C")
normalized_exp <- normalize_median(exp, by = custom_groups)Compositional Data: CLR, ALR, and Median Quotient Normalization
For glycomics data (where intensities represent relative abundances of glycans), standard normalization methods may not be appropriate. Instead, compositional data analysis techniques are recommended:
normalize_median_quotient(): Median
quotient normalization. A robust method specifically designed for
compositional data, dividing each sample by its median quotient relative
to a reference sample.
transform_clr(): Centered Log-Ratio
transformation with glycowork-compatible internal logic.
The transformation is carried out on the log2 scale and
then back-transformed to the original ratio space, so zeros remain zeros
in the returned result. The stochastic branch uses per-feature noise,
matching glycowork.
transform_alr(): Additive Log-Ratio
transformation with glycowork-compatible internal logic.
The reference glycan is chosen based on Procrustes correlation to CLR
geometry and minimal between-group variance, and the final result is
returned in the original ratio space. If no suitable reference is found,
ALR falls back to CLR.
The CLR and ALR methods align with the best practices described in DOI: 10.1038/s41467-025-56249-3 for glycomics data preprocessing.
Here we do the median normalization manually:
Some variables are like that friend who never shows up to plans – they’re missing most of the time and aren’t very helpful. In omics data, variables with too many missing values are often more noise than signal.
Available variable filtering functions in
glyclean:
| Function | Description | Best For |
|---|---|---|
remove_rare() |
Remove variables with too many missing values | General use |
remove_low_var() |
Remove variables with low variance | General use |
remove_low_cv() |
Remove variables with low coefficient of variation | General use |
remove_constant() |
Remove constant variables | General use |
remove_low_expr() |
Remove variables with low expression | General use |
Here we remove variables with more than 50% missing values in all samples:
Here’s where things get scientifically interesting! Missing values in mass spectrometry aren’t randomly distributed – they follow patterns based on the physics and chemistry of the measurement process.
The Science: Missing values in MS-based omics data are typically “Missing Not At Random” (MNAR), meaning they’re related to the intensity of the signal. Low-abundance ions are more likely to be missed, either due to poor ionization efficiency or because they fall below the detection threshold.
Your imputation toolkit:
| Function | Method | Best For |
|---|---|---|
impute_zero() |
Replace with zeros | Quick and simple |
impute_sample_min() |
Sample minimum values | Small datasets |
impute_half_sample_min() |
Half of sample minimum | Conservative approach |
impute_min_prob() |
Probabilistic minimum | Medium datasets |
impute_miss_forest() |
Random Forest ML | Large datasets |
impute_bpca() |
Bayesian PCA | High correlation structure |
impute_ppca() |
Probabilistic PCA | Linear relationships |
impute_sw_knn() |
K-nearest neighbors (Sample-wise) | Local similarity patterns |
impute_fw_knn() |
K-nearest neighbors (Feature-wise) | Local similarity patterns |
impute_svd() |
Singular value decomposition | High correlation structure |
For demonstration, let’s use the simple zero imputation:
Here’s where glycoproteomics gets uniquely challenging! Search engines typically report results at the PSM or peptide level, but what we really care about are glycoforms – the specific glycan structures attached to specific protein sites.
The Problem: One glycoform can appear multiple times in your data due to:
The Solution: Intelligent aggregation!
Aggregation levels available:
Pro move: Re-normalize after aggregation to account for the new intensity distributions:
Batch effects are the “different violinists playing the same piece” problem of omics data. Even when following identical protocols, subtle differences in instruments, reagents, or environmental conditions can introduce systematic bias.
The Good News: If your experimental design is well-controlled (conditions distributed across batches), batch effect correction can work wonders!
# To detect batch effects:
p_values <- detect_batch_effect(aggregated_exp2)
#> ℹ Detecting batch effects using ANOVA for 2738 variables...
#> âś” Batch effect detection completed. 19 out of 2738 variables show significant batch effects (p < 0.05).
p_values[1:5]
#> P08185-176-Hex(5)HexNAc(4)NeuAc(2) P04196-344-Hex(5)HexNAc(4)NeuAc(1)
#> 0.6998998 0.3951691
#> P04196-344-Hex(5)HexNAc(4) P10909-291-Hex(6)HexNAc(5)
#> 0.4852698 0.3329619
#> P04196-344-Hex(5)HexNAc(4)NeuAc(2)
#> 0.4907154Here we do not have batch effects, but we will correct it anyway for demonstration.
Here’s a fascinating biological puzzle: when you measure a glycopeptide’s intensity, you’re actually seeing two stories at once!
The Dual Nature Problem: Every glycopeptide intensity is a tale of two factors:
It’s like trying to understand applause volume – is the audience bigger, or are they just more enthusiastic?
The Solution: If you’re specifically interested in glycosylation changes (not protein abundance changes), you need to mathematically “subtract out” the protein expression component. Think of it as noise cancellation for biology!
# We don't run the code here because we don't have the protein expression matrix.
adjusted_exp <- adjust_protein(inferred_exp, pro_expr_mat)Important Prerequisites:
glyclean functions for this too!)Pro Tip: Most glyclean functions
happily accept plain matrices as input, so you can preprocess your
protein expression data using the same toolkit. Consistency is key!
Congratulations! You now have beautifully preprocessed, analysis-ready glycoproteomics data. Your data has been normalized, filtered, imputed, aggregated, and batch-corrected – it’s ready to reveal its biological secrets!
glyexp::experiment() Objects
đź”§While this vignette focuses on glyexp::experiment()
objects, glyclean is designed to be flexible and
accommodating to different workflows and data formats.
Most glyclean functions also support plain matrices as
input! This means you can use the package’s powerful preprocessing
capabilities even if you’re working with traditional data formats. The
functions intelligently detect your input type and handle it
appropriately.
Many functions in glyclean accept a by
parameter for stratified processing. This parameter offers maximum
flexibility – it accepts both column names from your sample metadata and
direct vectors!
# Using column names (standard approach)
normalized_exp <- normalize_median(exp, by = "group")
# Using custom vectors (advanced approach)
custom_groups <- c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B", "C", "C")
normalized_exp <- normalize_median(exp, by = custom_groups)
# This also works with matrices
normalized_matrix <- normalize_median(my_matrix, by = custom_groups)This gives you complete control over grouping, even for custom groupings that aren’t stored in your experiment object. Perfect for complex experimental designs or external grouping criteria!
glyclean provides a comprehensive suite of quality
control plotting functions to help you visualize and assess the quality
of your data at various stages of the preprocessing pipeline.
Available QC plotting functions in
glyclean:
| Function | Description | Best For |
|---|---|---|
plot_missing_heatmap() |
Binary heatmap of missing value patterns | Visualizing global missingness structure |
plot_missing_bar() |
Bar plot of missing proportions | Identifying samples/variables with high missingness |
plot_tic_bar() |
Total intensity (TIC) bar plot | Checking for systematic intensity differences between samples |
plot_rank_abundance() |
Protein rank abundance plot | Assessing the dynamic range of detected proteins |
plot_int_boxplot() |
Log2-intensity boxplots | Visualizing data distribution and normalization effects |
plot_rle() |
Relative Log Expression (RLE) boxplots | Detecting sample-wise bias and batch effects |
plot_cv_dent() |
CV density plot | Assessing reproducibility and technical variation |
plot_batch_pca() |
PCA score plot by batch | Visualizing batch effects and sample clustering |
plot_rep_scatter() |
Replicate scatter plots | Checking concordance between replicate samples |
These functions are designed to work seamlessly with
glyexp::experiment() objects and provide consistent,
high-quality visualizations using ggplot2.
For example, you can use plot_missing_heatmap() to
visualize the missing values in your data before and after cleaning:
With your clean data in hand, you’re ready to dive into the exciting world of glyco-omics analysis:
The glycoverse ecosystem has tools for all of these and
more!
glyexp::experiment() Objects đź”§