--- title: "Get Started with glyexp" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Get Started with glyexp} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In this context, you typically work with three types of data in glycomics and glycoproteomics experiments: 1. **Expression data** - the actual measurements of your biological molecules (glycans, glycopeptides, etc.) 2. **Molecular annotations** - the identifiers for your molecules (structures, sequences, etc.) 3. **Experimental metadata** - the context of your samples (time points, treatments, experimental conditions) The `experiment()` class serves as a structured container that keeps all three data types organized and interconnected. **Why should you care?** Every package in the `glycoverse` ecosystem speaks `experiment()` fluently. It's like having a universal translator for your glycomics workflow - everything just *clicks* together. ```{r setup} library(glyexp) library(dplyr) library(conflicted) # Resolve function conflicts - prefer glyexp version over deprecated dplyr version # `dplyr::select_var` is deprecated anyway, so we can safely override it conflicts_prefer(glyexp::select_var) ``` ## Getting Started with glyexp Let's begin with a simple example to illustrate the basic concepts. ```{r} toy_exp <- toy_experiment toy_exp ``` The summary provides an overview of your entire experiment - variables, observations, and all the metadata. The three core components can be extracted as follows: ### The Expression Matrix The expression matrix contains your numerical data - rows are variables (molecules), columns are observations (samples). ```{r} get_expr_mat(toy_exp) ``` This matrix is where the magic happens - rows are your variables (molecules), columns are your observations (samples), and the numbers tell your biological story. ### Variable Information The variable information table contains detailed annotations for each molecule. ```{r} get_var_info(toy_exp) ``` Think of this as your molecular address book - every variable gets its own detailed profile. ### Sample Information The sample information table records the experimental conditions for each sample. ```{r} get_sample_info(toy_exp) ``` And this? This is your experimental diary - tracking every condition, timepoint, and treatment. Notice that the "variable" column in `get_var_info()` and the "sample" column in `get_sample_info()` match the row and column names in your expression matrix. These are the **index columns** that maintain synchronization between data components. ## Data Manipulation with glyexp glyexp provides dplyr-style functions for manipulating experiment objects. For every dplyr function, glyexp provides two specialized versions: - **`_obs()`** functions: work on your sample metadata - **`_var()`** functions: work on your variable annotations Here's an example of filtering for group "A" samples: ```{r} subset_exp <- filter_obs(toy_exp, group == "A") ``` Let's check what happened to our sample info: ```{r} get_sample_info(subset_exp) ``` Check the expression matrix: ```{r} get_expr_mat(subset_exp) ``` The expression matrix is automatically filtered to match! This is `filter_obs()`: it filters the sample information and automatically updates the expression matrix to match. Variable filtering works the same way: ```{r} toy_exp |> filter_obs(group == "A") |> filter_var(glycan_composition == "H5N2") |> get_expr_mat() ``` Notice how these functions support the pipe operator (`|>`)? That's the `dplyr` DNA in action! The pattern is straightforward: glyexp functions expect and return `experiment()` objects, and they preserve the index columns during operations. ## Complete dplyr Function Reference The following table lists all supported dplyr-style functions. These functions maintain synchronization between the expression matrix, sample information, and variable information: | dplyr Function | For Samples (`_obs`) | For Variables (`_var`) | What It Does | |:---|:---|:---|:---| | `filter()` | `filter_obs()` | `filter_var()` | Subset rows based on conditions | | `select()` | `select_obs()` | `select_var()` | Choose specific columns | | `arrange()` | `arrange_obs()` | `arrange_var()` | Reorder rows by column values | | `mutate()` | `mutate_obs()` | `mutate_var()` | Create/modify columns | | `rename()` | `rename_obs()` | `rename_var()` | Rename columns | | `slice()` | `slice_obs()` | `slice_var()` | Select rows by position | | `slice_head()` | `slice_head_obs()` | `slice_head_var()` | Select first n rows | | `slice_tail()` | `slice_tail_obs()` | `slice_tail_var()` | Select last n rows | | `slice_sample()` | `slice_sample_obs()` | `slice_sample_var()` | Select random rows | | `slice_max()` | `slice_max_obs()` | `slice_max_var()` | Select rows with highest values | | `slice_min()` | `slice_min_obs()` | `slice_min_var()` | Select rows with lowest values | | `left_join()` | `left_join_obs()` | `left_join_var()` | Add new columns from another table (left join) | | `inner_join()` | `inner_join_obs()` | `inner_join_var()` | Add new columns from another table (inner join) | | `semi_join()` | `semi_join_obs()` | `semi_join_var()` | Filter rows from another table (semi join) | | `anti_join()` | `anti_join_obs()` | `anti_join_var()` | Filter rows from another table (anti join) | Each function automatically updates the expression matrix when you modify metadata. Filter samples, and the matrix follows. Rearrange variables, and the matrix adjusts accordingly. For functions not directly supported (like `distinct()`, `pull()`, `count()`, etc.), extract the tibble first: ```{r eval=FALSE} # Extract the tibble, then use any dplyr function you want toy_exp |> get_sample_info() |> distinct(group) toy_exp |> get_var_info() |> pull(protein) |> unique() toy_exp |> get_sample_info() |> count(group) ``` ## Index Columns As mentioned, the index columns maintain synchronization between data components. Avoid modifying these columns directly, as glyexp relies on them to keep everything connected. ![](experiment.png) The index columns are essential for data integrity - they can be renamed but not removed. Want to select specific columns from your sample info? Easy: ```{r} toy_exp |> select_obs(group) |> get_sample_info() ``` The "sample" column remains protected: The index column cannot be removed: ## Matrix-Style Subsetting Experiments can be subset using matrix-style indexing: ```{r} subset_exp <- toy_exp[, 1:3] ``` This selects the first 3 samples and updates all components accordingly: ```{r} get_expr_mat(subset_exp) ``` ```{r} get_sample_info(subset_exp) ``` Both the expression matrix and sample info are synchronized. ## Merging and Splitting Experiments can be merged or split: ```R merge(exp1, exp2) ``` The `merge()` function combines two experiments. If you need to preserve batch information, use `mutate_obs()` to add an ID column before merging. What if you have more than one experiment to merge? Put them in a list and use `purrr::reduce()` to merge them: ```R purrr::reduce(list(exp1, exp2, exp3), merge) ``` The `split()` function divides an experiment into a list of experiments based on a column: ```{r} split(toy_exp, group, where = "sample_info") ``` Now the "A" experiment only contains samples from group "A", and "B" experiment from group "B". ## Converting to Tibbles For operations beyond what glyexp provides, you can extract the individual components: Alternatively, use `as_tibble()` to convert your experiment to a tibble in tidy format: ```{r} as_tibble(toy_exp) ``` These tibbles can be large, so filtering first is recommended: ```{r} toy_exp |> filter_var(glycan_composition == "H5N2") |> select_obs(group) |> select_var(-glycan_composition) |> as_tibble() ``` Much more manageable, right? ## Background and Design Principles The `experiment()` class was designed with insights from related data containers: **SummarizedExperiment** The foundational omics data container from Bioconductor. Well-established for RNA-seq analysis. **tidySummarizedExperiment** An attempt to bring tidy principles to SummarizedExperiment from the tidySummarizedExperiment package. While the concept is sound, storing all components in a single tibble doesn't align with the mental model of separated data types. **massdataset** A related package for mass spectrometry data. It provides tidy operations, clean data separation, and data processing history tracking. We appreciate its approach to reproducibility. While object-oriented programming has its merits, glyexp takes a functional programming approach. Your analysis code serves as the reproducibility record - clear, transparent, and familiar to R users. **Design Philosophy** glyexp uses functional programming because it aligns with how R users work. The design emphasizes clear, chainable functions. Thank you to all the developers who contributed to these foundational packages. ## What's Next? Now you have the basic understanding of `experiment()`. Next, you can learn how to use `experiment()` in your analysis. - [Creating Experiments](https://glycoverse.github.io/glyexp/articles/create-exp.html) - [Experiment Types](https://glycoverse.github.io/glyexp/articles/exp-type.html) - [Dplyr-Style Functions](https://glycoverse.github.io/glyexp/articles/dplyr-style-functions.html)