--- title: "dplyr-Style Functions: Data Harmony in Action" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{dplyr-Style Functions: Data Harmony in Action} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette describes glyexp's dplyr-style functions for synchronized data manipulation. When working with multi-table datasets, filtering one table can desynchronize your data from other components. Rearranging another table can break carefully established relationships. **glyexp's dplyr-style functions** address this by understanding the connection between your expression matrix, sample information, and variable annotations. When you transform one component, everything else follows in synchronization. **Note:** These functions only work with `experiment()` objects - they cannot be used on regular data.frames, tibbles, or other data structures. ```{r setup} library(glyexp) library(dplyr) library(conflicted) conflicts_prefer(glyexp::select_var) conflicts_prefer(dplyr::filter) ``` ## Core Philosophy: One Action, Three Updates glyexp's dplyr-style functions work on three components: 1. **Expression Matrix**: Numerical data 2. **Sample Info**: Experimental metadata 3. **Variable Info**: Molecular annotations In traditional data analysis, filtering samples requires manually updating all related tables. glyexp's dplyr-style functions handle this synchronization automatically. Here's an example: ```{r} toy_exp <- toy_experiment print(toy_exp) ``` ## Two Flavors: `_obs()` and `_var()` Every dplyr-style function in glyexp comes in two variants: - **`_obs()` functions**: Work on sample information in `experiment()` objects - **`_var()` functions**: Work on variable annotations in `experiment()` objects Both variants automatically update the expression matrix to maintain synchronization. These functions require an `experiment()` object as input and return an `experiment()` object as output. For standard tibbles or data.frames, use regular dplyr functions directly. ### Filtering Filtering is the most common operation: #### Sample-Based Filtering with `filter_obs()` Say you want to focus only on group "A" samples: ```{r} # Before filtering - let's see what we have get_sample_info(toy_exp) ``` ```{r} # Filter for group A samples only filtered_exp <- filter_obs(toy_exp, group == "A") get_sample_info(filtered_exp) ``` Check the expression matrix: ```{r} # Original matrix dimensions: dim(get_expr_mat(toy_exp)) # Original matrix: get_expr_mat(toy_exp) ``` ```{r} # Filtered expression matrix - automatically updated! # Filtered matrix dimensions: dim(get_expr_mat(filtered_exp)) # Filtered matrix: get_expr_mat(filtered_exp) ``` The expression matrix is automatically filtered to match the remaining samples. #### Variable-Based Filtering with `filter_var()` Now let's filter variables and watch the same magic happen: ```{r} # Filter for specific glycan compositions var_filtered_exp <- filter_var(toy_exp, glycan_composition == "H5N2") get_var_info(var_filtered_exp) ``` ```{r} # The expression matrix rows automatically follow suit! get_expr_mat(var_filtered_exp) ``` **The matrix rows automatically reduced to match the filtered variables!** This is the core power of glyexp - you think about your metadata, and the expression data follows your lead. #### Chaining Filters Both samples and variables can be filtered by chaining operations: ```{r} double_filtered <- toy_exp |> filter_obs(group == "A") |> filter_var(glycan_composition %in% c("H5N2", "N3N2")) # Final dimensions after double filtering: dim(get_expr_mat(double_filtered)) get_expr_mat(double_filtered) ``` The functions support pipe operations: ## Index Columns: Guardians of Data Integrity Index columns (like "sample" and "variable") are essential for maintaining data relationships. Removing them would break synchronization. Let's see this protection in action: ### Attempting to Remove Index Columns ```{r error=TRUE} # Try to select everything EXCEPT the sample index column protective_exp <- select_obs(toy_exp, -sample) get_sample_info(protective_exp) ``` glyexp throws an error to protect data integrity: ```{r error=TRUE} # Same protection for variable info protective_var_exp <- select_var(toy_exp, -variable) get_var_info(protective_var_exp) ``` Similarly, glyexp throws an error to protect the "variable" column from being removed. ### Why This Protection Matters Without index columns, an `experiment()` object would lose its ability to: - Keep expression matrix and metadata synchronized - Validate data consistency - Enable seamless subsetting operations - Work with other glycoverse packages Index columns are essential for maintaining data relationships. ## Complete Function Reference glyexp provides dplyr-style equivalents for common data manipulation functions. Each function comes in both `_obs()` and `_var()` variants, and all automatically maintain matrix synchronization. These functions are methods specifically for `experiment()` objects. ### Core Data Manipulation Functions | Standard dplyr | Sample Operations | Variable Operations | Description | |:---|:---|:---|:---| | `filter()` | `filter_obs()` | `filter_var()` | Subset with sync | | `select()` | `select_obs()` | `select_var()` | Choose with protection | | `arrange()` | `arrange_obs()` | `arrange_var()` | Sort with order | | `mutate()` | `mutate_obs()` | `mutate_var()` | Create with consistency | | `rename()` | `rename_obs()` | `rename_var()` | Rename with safety | ### Advanced Slicing Functions | Standard dplyr | Sample Operations | Variable Operations | Description | |:---|:---|:---|:---| | `slice()` | `slice_obs()` | `slice_var()` | Position-based selection | | `slice_head()` | `slice_head_obs()` | `slice_head_var()` | Top n with sync | | `slice_tail()` | `slice_tail_obs()` | `slice_tail_var()` | Bottom n with sync | | `slice_sample()` | `slice_sample_obs()` | `slice_sample_var()` | Random with consistency | | `slice_max()` | `slice_max_obs()` | `slice_max_var()` | Highest values with order | | `slice_min()` | `slice_min_obs()` | `slice_min_var()` | Lowest values with order | ### Joining Functions | Standard dplyr | Sample Operations | Variable Operations | Description | |:---|:---|:---|:---| | `left_join()` | `left_join_obs()` | `left_join_var()` | Add new columns from another table (left join) | | `inner_join()` | `inner_join_obs()` | `inner_join_var()` | Add new columns from another table (inner join) | | `semi_join()` | `semi_join_obs()` | `semi_join_var()` | Filter rows from another table (semi join) | | `anti_join()` | `anti_join_obs()` | `anti_join_var()` | Filter rows from another table (anti join) | ## Function-by-Function Examples ### Selection ```{r} # Select specific columns from sample info selected_exp <- select_obs(toy_exp, group, batch) get_sample_info(selected_exp) ``` ```{r} # Select columns from variable info (notice the index protection!) var_selected_exp <- select_var(toy_exp, glycan_composition) get_var_info(var_selected_exp) ``` Use `dplyr`-style helpers like `starts_with()`, `ends_with()`, and `contains()`: ```{r} # Select columns starting with "glycan" helper_exp <- select_var(toy_exp, starts_with("glycan")) get_var_info(helper_exp) ``` ### Arrangement ```{r} # Arrange samples by batch and group arranged_exp <- arrange_obs(toy_exp, batch, group) get_sample_info(arranged_exp) ``` Check how the expression matrix columns rearranged to match: ```{r} # Expression matrix columns follow the new sample order get_expr_mat(arranged_exp) ``` ### Mutation ```{r} # Add a new calculated column to sample info mutated_exp <- mutate_obs( toy_exp, group_batch = paste(group, batch, sep = "_") ) get_sample_info(mutated_exp) ``` ```{r} # Create a complexity score for variables complex_exp <- mutate_var( toy_exp, complexity = nchar(glycan_composition) ) get_var_info(complex_exp) ``` ### Slicing ```{r} # Take the first 2 samples head_exp <- slice_head_obs(toy_exp, n = 2) get_sample_info(head_exp) ``` ```{r} # Expression matrix automatically adjusts get_expr_mat(head_exp) ``` ```{r} # Sample randomly from variables set.seed(123) # For reproducibility random_exp <- slice_sample_var(toy_exp, n = 3) get_var_info(random_exp) ``` ### Renaming ```{r} # Rename columns in sample info renamed_exp <- rename_obs(toy_exp, experimental_group = group) get_sample_info(renamed_exp) ``` The index column "sample" remains protected, but everything else can be renamed freely. ### Joining These functions can be useful if you have additional information stored in a separate tibble, and you want to add it to your `experiment()` object. ```{r} # Join sample info with variable info more_sample_info <- tibble::tibble( sample = c("S1", "S2", "S3", "S4", "S5", "S6"), age = c(20, 21, 22, 23, 24, 25), gender = c("M", "F", "M", "F", "M", "F") ) joined_exp <- left_join_obs(toy_exp, more_sample_info, by = "sample") get_sample_info(joined_exp) ``` You might have noticed that we don't have alternatives for `dplyr::right_join()` and `dplyr::full_join()`. This is because by design joining functions in `glyexp` should only be used to add new information to your `experiment()` object. However, `right_join()` and `full_join()` will add more observations to the resulting tibbles, which is not suitable for `experiment()` objects. For the same reason, the `relationship` parameter is fixed to "many-to-one" for all joining functions in `glyexp`. You probably don't need to know this, but if you do, check out the documentation of `dplyr::left_join()` for more details. ## Advanced Patterns: Chaining for Complex Operations The real power emerges when you chain multiple operations together. Here are some patterns: ### Pattern 1: Filter → Select → Arrange ```{r} complex_pipeline <- toy_exp |> filter_obs(group == "A") |> select_obs(group, batch) |> arrange_obs(desc(batch)) |> filter_var(protein == "PRO1") |> select_var(glycan_composition, protein) print("Final pipeline result:") print(complex_pipeline) ``` ### Pattern 2: Mutate → Filter → Slice ```{r} analytical_pipeline <- toy_exp |> mutate_var(composition_length = nchar(glycan_composition)) |> filter_var(composition_length >= 4) |> slice_max_var(composition_length, n = 3) get_var_info(analytical_pipeline) ``` ### Pattern 3: Random Sampling for Testing ```{r} # Create a smaller dataset for testing set.seed(456) test_exp <- toy_exp |> slice_sample_obs(n = 3) |> slice_sample_var(n = 4) print("Test dataset dimensions:") print(test_exp) ``` ## When dplyr-Style Functions Cannot Help Sometimes you need functionality beyond what glyexp's dplyr-style functions provide. Extract the tibbles and use any dplyr function you want. ### Why Doesn't glyexp Implement All dplyr Functions? glyexp only implements functions that preserve the synchronized multi-table structure of `experiment()` objects. Functions like `count()`, `distinct()`, `summarise()`, and `pull()` return aggregated results that break the original data relationships. For these operations, extract the relevant tibble and use standard dplyr functions: ```{r} # For complex aggregations toy_exp |> get_sample_info() |> count(group) ``` ```{r} # For distinct values toy_exp |> get_var_info() |> distinct(protein) |> pull(protein) ``` ```{r} # For advanced filtering with multiple conditions complex_filter_conditions <- toy_exp |> get_sample_info() |> filter(group == "A", batch == 2) |> pull(sample) # Then use the results to subset your experiment filtered_by_complex <- filter_obs(toy_exp, sample %in% complex_filter_conditions) ``` ## Common Pitfalls and How to Avoid Them ### Pitfall 1: Using glyexp Functions on Non-Experiment Objects This won't work: ```{r error=TRUE} library(tibble) regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2)) filter_obs(regular_tibble, group == "A") ``` Do this instead: ```{r} regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2)) filter(regular_tibble, group == "A") filtered_exp <- filter_obs(toy_exp, group == "A") get_sample_info(filtered_exp) ``` ### Pitfall 2: Forgetting the Synchronization Don't do this: ```{r eval=FALSE} sample_info <- get_sample_info(toy_exp) filtered_samples <- filter(sample_info, group == "A") ``` Do this instead: ```{r} filtered_exp <- filter_obs(toy_exp, group == "A") ``` ### Pitfall 3: Trying to Remove Index Columns This won't work as expected: ```{r error=TRUE} select_obs(toy_exp, -sample) ``` Embrace the protection: ```{r} clean_exp <- select_obs(toy_exp, group, batch) get_sample_info(clean_exp) ``` ### Pitfall 4: Mismatched Operations Don't mix operations inappropriately: ```{r eval=FALSE} arrange_obs(toy_exp, glycan_composition) ``` Use the right function for the right data: ```{r} arranged_by_composition <- arrange_var(toy_exp, glycan_composition) get_var_info(arranged_by_composition) ``` ## Performance Considerations glyexp's dplyr-style functions are designed to be fast, safe, and consistent. For large datasets, consider: - Filtering early in your pipeline to reduce data size - Using `select_obs()` and `select_var()` to keep only needed columns - Chaining operations efficiently to minimize intermediate copies ```{r} # Efficient pipeline: filter first, then manipulate efficient_pipeline <- toy_exp |> filter_obs(group == "A") |> # Reduce samples early filter_var(protein == "PRO1") |> # Reduce variables early select_obs(group) |> # Keep only needed sample columns select_var(glycan_composition) # Keep only needed variable columns ``` ## Philosophy Behind the Design glyexp's dplyr-style functions embody a simple philosophy: **"Think about your metadata, and let the data follow."** This design means: 1. **Mental Model Alignment**: Think in terms of samples and variables, not matrix indices 2. **Error Prevention**: Automatic synchronization prevents common data analysis mistakes 3. **Familiar Syntax**: If you know dplyr, you already know most of glyexp 4. **Composability**: Functions chain together naturally for complex analyses ## Summary glyexp's dplyr-style functions are experiment-specific data manipulators designed exclusively for `experiment()` objects. They provide: - **Automatic Synchronization**: Operations on metadata automatically update the expression matrix - **Index Column Protection**: Critical relationship columns are protected from deletion - **Familiar Syntax**: Standard dplyr operations with multi-table awareness - **Type-Aware Operations**: `_obs()` for samples, `_var()` for variables Start with `filter_obs()` and `select_var()`, then build complex pipelines.