--- title: "Creating Experiments" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Creating Experiments} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette describes how to create an `experiment()` object from scratch. > **Note:** If you're using `glyread` to import your data, sample information is already handled, and you can focus on understanding the structure. ```{r setup} library(glyexp) library(glyrepr) library(tibble) ``` ## Required Components Creating an `experiment()` object requires three key components: - **Expression Matrix**: Numeric data with variables as rows and samples as columns - **Sample Information**: A tibble containing sample metadata (groups, batches, etc.) - **Variable Information**: A tibble describing variables (proteins, peptides, glycan compositions, etc.) You also need to specify experiment metadata. ## Step 1: Sample Information The sample information table records metadata for each sample. The golden rule here is simple: your first column **must** be named `sample`, and every entry needs to be unique (no duplicates allowed!). These are your sample identifiers – the names that tie everything together. For the remaining columns, you have complete freedom! Add any information that matters for your analysis: groups, batches, patient demographics, treatment conditions – whatever helps tell your story. ### Recommended Column Names The glycoverse ecosystem uses these standard column names: - **`group`**: Experimental conditions or treatments (use factor type). Most `glystats` functions rely on this column. - **`batch`**: Sample batch information (use factor type). Essential for `glyclean::remove_batch_effect()`. If no batch column exists, `glyclean::auto_clean()` skips batch correction. - **`bio_rep`**: Biological replicates (use factor type). > **Note:** The function automatically coerces column types to expected types if they don't match. This behavior can be controlled by the `coerce_col_types` and `check_col_types` arguments. Manual conversion is recommended for finer control over details like factor levels. Let's build our sample information table: ```{r} sample_info <- tibble( sample = c("S1", "S2", "S3", "S4", "S5", "S6"), group = factor(c("A", "A", "A", "B", "B", "B"), levels = c("A", "B")), batch = factor(c(1, 2, 1, 2, 1, 2), levels = c(1, 2)) ) sample_info ``` ## Step 2: Variable Information The variable information table describes what each measurement represents. **For glycoproteomics:** If you're using `glyread`, variable information is automatically extracted from software output files. **For glycomics:** Variable information may need to be built manually. Glycomics data structures are simpler than glycoproteomics. ### Variable Naming The first column must be named `variable` with unique values. Names don't need to be meaningful - `glyread` uses simple names like "V1", "V2", "V3" by default. ### Column Conventions by Experiment Type **For glycomics experiments:** - **`glycan_composition`**: Required. Glycan composition as a `glyrepr::glycan_composition()` object - **`glycan_structure`**: Optional. Glycan structure as a `glyrepr::glycan_structure()` object (Parsed objects are recommended over character strings to avoid repeated parsing) **For glycoproteomics experiments:** - **`protein`**: Required. UniProt accession (character) - **`protein_site`**: Required. Glycosylation site position on the protein (integer) - **`gene`**: Optional. Gene name (character) - **`peptide`**: Optional. Peptide sequence (character) - **`peptide_site`**: Optional. Peptide site position for glycan attachment (integer) Let's create our variable information table: ```{r} var_info <- tibble( variable = c("V1", "V2", "V3"), glycan_composition = glyrepr::glycan_composition( c(GalNAc = 1), c(Gal = 1, GalNAc = 1), c(Gal = 1, GalNAc = 1, GlcNAc = 1) ), glycan_structure = glyrepr::as_glycan_structure(c( "GalNAc(a1-", "Gal(b1-3)GalNAc(a1-", "Gal(b1-3)[GlcNAc(a1-6)]GalNAc(a1-" )) ) var_info ``` ## Step 3: Expression Matrix The expression matrix contains your actual measurements. Layout: variables as rows, samples as columns. No transformation (like log-transformation) is needed - glycoverse functions handle this internally if required. ### Matching Requirements Row names must match the `variable` column from variable information, and column names must match the `sample` column from sample information. The order doesn't need to be perfect - functions handle alignment automatically. Let's create our expression matrix: ```{r} # Create a simple matrix with 3 variables and 6 samples expr_mat <- matrix( rnorm(18, mean = 10, sd = 2), # Some realistic-looking data nrow = 3, ncol = 6 ) rownames(expr_mat) <- var_info$variable colnames(expr_mat) <- sample_info$sample expr_mat ``` ## Step 4: Creating the Experiment Now assemble the experiment object: ```{r} exp <- experiment(expr_mat, sample_info, var_info, exp_type = "glycomics", glycan_type = "N") exp ``` Specify the experiment type (`"glycomics"` or `"glycoproteomics"`) and glycan type (like `"N"` for N-glycans, `"O-GalNAc"` for O-GalNAc glycans, etc.). These details help downstream functions interpret the data correctly. ## Minimum Required Input The minimum required input is only an expression matrix. ```{r} expr_mat <- matrix(runif(9), nrow = 3, ncol = 3) colnames(expr_mat) <- c("S1", "S2", "S3") rownames(expr_mat) <- c("V1", "V2", "V3") experiment(expr_mat) ``` If any other information is not provided, it will be automatically generated based on the following rules: - `sample_info`: a tibble with only one column named "sample", same as the column names of `expr_mat`. - `var_info`: a tibble with only one column named "variable", same as the row names of `expr_mat`. - `exp_type`: "others" - `glycan_type`: `NULL` This means you can create `experiment()` objects flexibly: first create the backbone using an expression matrix, then add information later using `mutate_var()` and `mutate_obs()`. If you have data in multiple tables, use `left_join_var()` and `left_join_obs()` to join them.