Creating Experiments

This vignette describes how to create an experiment() object from scratch.

Note: If you’re using glyread to import your data, sample information is already handled, and you can focus on understanding the structure.

library(glyexp)
library(glyrepr)
library(tibble)

Required Components

Creating an experiment() object requires three key components:

Expression Matrix: Numeric data with variables as rows and samples as columns
Sample Information: A tibble containing sample metadata (groups, batches, etc.)
Variable Information: A tibble describing variables (proteins, peptides, glycan compositions, etc.)

You also need to specify experiment metadata.

Step 1: Sample Information

The sample information table records metadata for each sample.

The golden rule here is simple: your first column must be named sample, and every entry needs to be unique (no duplicates allowed!). These are your sample identifiers – the names that tie everything together.

For the remaining columns, you have complete freedom! Add any information that matters for your analysis: groups, batches, patient demographics, treatment conditions – whatever helps tell your story.

Recommended Column Names

The glycoverse ecosystem uses these standard column names:

group: Experimental conditions or treatments (use factor type). Most glystats functions rely on this column.
batch: Sample batch information (use factor type). Essential for glyclean::remove_batch_effect(). If no batch column exists, glyclean::auto_clean() skips batch correction.
bio_rep: Biological replicates (use factor type).

Note: The function automatically coerces column types to expected types if they don’t match. This behavior can be controlled by the coerce_col_types and check_col_types arguments. Manual conversion is recommended for finer control over details like factor levels.

Let’s build our sample information table:

sample_info <- tibble(
  sample = c("S1", "S2", "S3", "S4", "S5", "S6"),
  group = factor(c("A", "A", "A", "B", "B", "B"), levels = c("A", "B")),
  batch = factor(c(1, 2, 1, 2, 1, 2), levels = c(1, 2))
)
sample_info
#> # A tibble: 6 × 3
#>   sample group batch
#>   <chr>  <fct> <fct>
#> 1 S1     A     1    
#> 2 S2     A     2    
#> 3 S3     A     1    
#> 4 S4     B     2    
#> 5 S5     B     1    
#> 6 S6     B     2

Step 2: Variable Information

The variable information table describes what each measurement represents.

For glycoproteomics: If you’re using glyread, variable information is automatically extracted from software output files.

For glycomics: Variable information may need to be built manually. Glycomics data structures are simpler than glycoproteomics.

Variable Naming

The first column must be named variable with unique values. Names don’t need to be meaningful - glyread uses simple names like “V1”, “V2”, “V3” by default.

Column Conventions by Experiment Type

For glycomics experiments:

glycan_composition: Required. Glycan composition as a glyrepr::glycan_composition() object
glycan_structure: Optional. Glycan structure as a glyrepr::glycan_structure() object (Parsed objects are recommended over character strings to avoid repeated parsing)

For glycoproteomics experiments:

protein: Required. UniProt accession (character)
protein_site: Required. Glycosylation site position on the protein (integer)
gene: Optional. Gene name (character)
peptide: Optional. Peptide sequence (character)
peptide_site: Optional. Peptide site position for glycan attachment (integer)

Let’s create our variable information table:

var_info <- tibble(
  variable = c("V1", "V2", "V3"),
  glycan_composition = glyrepr::glycan_composition(
    c(GalNAc = 1),
    c(Gal = 1, GalNAc = 1),
    c(Gal = 1, GalNAc = 1, GlcNAc = 1)
  ),
  glycan_structure = glyrepr::as_glycan_structure(c(
    "GalNAc(a1-",
    "Gal(b1-3)GalNAc(a1-",
    "Gal(b1-3)[GlcNAc(a1-6)]GalNAc(a1-"
  ))
)
var_info
#> # A tibble: 3 × 3
#>   variable glycan_composition       glycan_structure                 
#>   <chr>    <comp>                   <struct>                         
#> 1 V1       GalNAc(1)                GalNAc(a1-                       
#> 2 V2       Gal(1)GalNAc(1)          Gal(b1-3)GalNAc(a1-              
#> 3 V3       Gal(1)GlcNAc(1)GalNAc(1) Gal(b1-3)[GlcNAc(a1-6)]GalNAc(a1-

Step 3: Expression Matrix

The expression matrix contains your actual measurements. Layout: variables as rows, samples as columns.

No transformation (like log-transformation) is needed - glycoverse functions handle this internally if required.

Matching Requirements

Row names must match the variable column from variable information, and column names must match the sample column from sample information. The order doesn’t need to be perfect - functions handle alignment automatically.

Let’s create our expression matrix:

# Create a simple matrix with 3 variables and 6 samples
expr_mat <- matrix(
  rnorm(18, mean = 10, sd = 2), # Some realistic-looking data
  nrow = 3, 
  ncol = 6
)
rownames(expr_mat) <- var_info$variable
colnames(expr_mat) <- sample_info$sample
expr_mat
#>           S1        S2        S3       S4        S5       S6
#> V1  6.622033 10.552185  8.845018 6.086328 10.568779 10.91840
#> V2 10.503385 11.547696 11.787123 8.364052  9.603212 10.32853
#> V3  9.336817  8.737436  8.229095 8.252884 10.038534 11.46344

Step 4: Creating the Experiment

Now assemble the experiment object:

exp <- experiment(expr_mat, sample_info, var_info, exp_type = "glycomics", glycan_type = "N")
exp
#> 
#> ── Glycomics Experiment ────────────────────────────────────────────────────────
#> ℹ Expression matrix: 6 samples, 3 variables
#> ℹ Sample information fields: group <fct>, batch <fct>
#> ℹ Variable information fields: glycan_composition <comp>, glycan_structure <struct>

Specify the experiment type ("glycomics" or "glycoproteomics") and glycan type (like "N" for N-glycans, "O-GalNAc" for O-GalNAc glycans, etc.). These details help downstream functions interpret the data correctly.

Minimum Required Input

The minimum required input is only an expression matrix.

expr_mat <- matrix(runif(9), nrow = 3, ncol = 3)
colnames(expr_mat) <- c("S1", "S2", "S3")
rownames(expr_mat) <- c("V1", "V2", "V3")
experiment(expr_mat)
#> 
#> ── Others Experiment ───────────────────────────────────────────────────────────
#> ℹ Expression matrix: 3 samples, 3 variables
#> ℹ Sample information fields: none
#> ℹ Variable information fields: none

If any other information is not provided, it will be automatically generated based on the following rules:

sample_info: a tibble with only one column named “sample”, same as the column names of expr_mat.
var_info: a tibble with only one column named “variable”, same as the row names of expr_mat.
exp_type: “others”
glycan_type: NULL

This means you can create experiment() objects flexibly: first create the backbone using an expression matrix, then add information later using mutate_var() and mutate_obs(). If you have data in multiple tables, use left_join_var() and left_join_obs() to join them.