This vignette describes glyexp’s dplyr-style functions for synchronized data manipulation.
When working with multi-table datasets, filtering one table can desynchronize your data from other components. Rearranging another table can break carefully established relationships.
glyexp’s dplyr-style functions address this by understanding the connection between your expression matrix, sample information, and variable annotations. When you transform one component, everything else follows in synchronization.
Note: These functions only work with
experiment() objects - they cannot be used on regular
data.frames, tibbles, or other data structures.
library(glyexp)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(conflicted)
conflicts_prefer(glyexp::select_var)
#> [conflicted] Will prefer glyexp::select_var over any other package.
conflicts_prefer(dplyr::filter)
#> [conflicted] Will prefer dplyr::filter over any other package.glyexp’s dplyr-style functions work on three components:
In traditional data analysis, filtering samples requires manually updating all related tables. glyexp’s dplyr-style functions handle this synchronization automatically.
Here’s an example:
toy_exp <- toy_experiment
print(toy_exp)
#>
#> ── Others Experiment ───────────────────────────────────────────────────────────
#> ℹ Expression matrix: 6 samples, 4 variables
#> ℹ Sample information fields: group <chr>, batch <dbl>
#> ℹ Variable information fields: protein <chr>, peptide <chr>, glycan_composition <chr>_obs() and _var()Every dplyr-style function in glyexp comes in two variants:
_obs() functions: Work on sample
information in experiment() objects_var() functions: Work on variable
annotations in experiment() objectsBoth variants automatically update the expression matrix to maintain synchronization.
These functions require an experiment() object as input
and return an experiment() object as output. For standard
tibbles or data.frames, use regular dplyr functions directly.
Filtering is the most common operation:
filter_obs()Say you want to focus only on group “A” samples:
# Before filtering - let's see what we have
get_sample_info(toy_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2# Filter for group A samples only
filtered_exp <- filter_obs(toy_exp, group == "A")
get_sample_info(filtered_exp)
#> # A tibble: 3 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1Check the expression matrix:
# Original matrix dimensions:
dim(get_expr_mat(toy_exp))
#> [1] 4 6
# Original matrix:
get_expr_mat(toy_exp)
#> S1 S2 S3 S4 S5 S6
#> V1 1 5 9 13 17 21
#> V2 2 6 10 14 18 22
#> V3 3 7 11 15 19 23
#> V4 4 8 12 16 20 24# Filtered expression matrix - automatically updated!
# Filtered matrix dimensions:
dim(get_expr_mat(filtered_exp))
#> [1] 4 3
# Filtered matrix:
get_expr_mat(filtered_exp)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10
#> V3 3 7 11
#> V4 4 8 12The expression matrix is automatically filtered to match the remaining samples.
filter_var()Now let’s filter variables and watch the same magic happen:
# Filter for specific glycan compositions
var_filtered_exp <- filter_var(toy_exp, glycan_composition == "H5N2")
get_var_info(var_filtered_exp)
#> # A tibble: 2 × 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V1 PRO1 PEP1 H5N2
#> 2 V2 PRO2 PEP2 H5N2# The expression matrix rows automatically follow suit!
get_expr_mat(var_filtered_exp)
#> S1 S2 S3 S4 S5 S6
#> V1 1 5 9 13 17 21
#> V2 2 6 10 14 18 22The matrix rows automatically reduced to match the filtered variables! This is the core power of glyexp - you think about your metadata, and the expression data follows your lead.
Both samples and variables can be filtered by chaining operations:
double_filtered <- toy_exp |>
filter_obs(group == "A") |>
filter_var(glycan_composition %in% c("H5N2", "N3N2"))
# Final dimensions after double filtering:
dim(get_expr_mat(double_filtered))
#> [1] 2 3
get_expr_mat(double_filtered)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10The functions support pipe operations:
Index columns (like “sample” and “variable”) are essential for maintaining data relationships. Removing them would break synchronization.
Let’s see this protection in action:
# Try to select everything EXCEPT the sample index column
protective_exp <- select_obs(toy_exp, -sample)
#> Error:
#> ! You should not explicitly select or deselect the "sample" column in
#> `sample_info`.
#> ℹ The "sample" column will be handled by `select_obs()` or `select_var()`
#> automatically.
get_sample_info(protective_exp)
#> Error:
#> ! object 'protective_exp' not foundglyexp throws an error to protect data integrity:
# Same protection for variable info
protective_var_exp <- select_var(toy_exp, -variable)
#> Error:
#> ! You should not explicitly select or deselect the "variable" column in
#> `var_info`.
#> ℹ The "variable" column will be handled by `select_obs()` or `select_var()`
#> automatically.
get_var_info(protective_var_exp)
#> Error:
#> ! object 'protective_var_exp' not foundSimilarly, glyexp throws an error to protect the “variable” column from being removed.
Without index columns, an experiment() object would lose
its ability to:
Index columns are essential for maintaining data relationships.
glyexp provides dplyr-style equivalents for common data manipulation
functions. Each function comes in both _obs() and
_var() variants, and all automatically maintain matrix
synchronization.
These functions are methods specifically for
experiment() objects.
| Standard dplyr | Sample Operations | Variable Operations | Description |
|---|---|---|---|
filter() |
filter_obs() |
filter_var() |
Subset with sync |
select() |
select_obs() |
select_var() |
Choose with protection |
arrange() |
arrange_obs() |
arrange_var() |
Sort with order |
mutate() |
mutate_obs() |
mutate_var() |
Create with consistency |
rename() |
rename_obs() |
rename_var() |
Rename with safety |
| Standard dplyr | Sample Operations | Variable Operations | Description |
|---|---|---|---|
slice() |
slice_obs() |
slice_var() |
Position-based selection |
slice_head() |
slice_head_obs() |
slice_head_var() |
Top n with sync |
slice_tail() |
slice_tail_obs() |
slice_tail_var() |
Bottom n with sync |
slice_sample() |
slice_sample_obs() |
slice_sample_var() |
Random with consistency |
slice_max() |
slice_max_obs() |
slice_max_var() |
Highest values with order |
slice_min() |
slice_min_obs() |
slice_min_var() |
Lowest values with order |
| Standard dplyr | Sample Operations | Variable Operations | Description |
|---|---|---|---|
left_join() |
left_join_obs() |
left_join_var() |
Add new columns from another table (left join) |
inner_join() |
inner_join_obs() |
inner_join_var() |
Add new columns from another table (inner join) |
semi_join() |
semi_join_obs() |
semi_join_var() |
Filter rows from another table (semi join) |
anti_join() |
anti_join_obs() |
anti_join_var() |
Filter rows from another table (anti join) |
# Select specific columns from sample info
selected_exp <- select_obs(toy_exp, group, batch)
get_sample_info(selected_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2# Select columns from variable info (notice the index protection!)
var_selected_exp <- select_var(toy_exp, glycan_composition)
get_var_info(var_selected_exp)
#> # A tibble: 4 × 2
#> variable glycan_composition
#> <chr> <chr>
#> 1 V1 H5N2
#> 2 V2 H5N2
#> 3 V3 H3N2
#> 4 V4 H3N2Use dplyr-style helpers like starts_with(),
ends_with(), and contains():
# Arrange samples by batch and group
arranged_exp <- arrange_obs(toy_exp, batch, group)
get_sample_info(arranged_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S3 A 1
#> 3 S5 B 1
#> 4 S2 A 2
#> 5 S4 B 2
#> 6 S6 B 2Check how the expression matrix columns rearranged to match:
# Add a new calculated column to sample info
mutated_exp <- mutate_obs(
toy_exp,
group_batch = paste(group, batch, sep = "_")
)
get_sample_info(mutated_exp)
#> # A tibble: 6 × 4
#> sample group batch group_batch
#> <chr> <chr> <dbl> <chr>
#> 1 S1 A 1 A_1
#> 2 S2 A 2 A_2
#> 3 S3 A 1 A_1
#> 4 S4 B 2 B_2
#> 5 S5 B 1 B_1
#> 6 S6 B 2 B_2# Create a complexity score for variables
complex_exp <- mutate_var(
toy_exp,
complexity = nchar(glycan_composition)
)
get_var_info(complex_exp)
#> # A tibble: 4 × 5
#> variable protein peptide glycan_composition complexity
#> <chr> <chr> <chr> <chr> <int>
#> 1 V1 PRO1 PEP1 H5N2 4
#> 2 V2 PRO2 PEP2 H5N2 4
#> 3 V3 PRO3 PEP3 H3N2 4
#> 4 V4 PRO3 PEP4 H3N2 4# Take the first 2 samples
head_exp <- slice_head_obs(toy_exp, n = 2)
get_sample_info(head_exp)
#> # A tibble: 2 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2# Expression matrix automatically adjusts
get_expr_mat(head_exp)
#> S1 S2
#> V1 1 5
#> V2 2 6
#> V3 3 7
#> V4 4 8# Sample randomly from variables
set.seed(123) # For reproducibility
random_exp <- slice_sample_var(toy_exp, n = 3)
get_var_info(random_exp)
#> # A tibble: 3 × 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V3 PRO3 PEP3 H3N2
#> 2 V4 PRO3 PEP4 H3N2
#> 3 V1 PRO1 PEP1 H5N2# Rename columns in sample info
renamed_exp <- rename_obs(toy_exp, experimental_group = group)
get_sample_info(renamed_exp)
#> # A tibble: 6 × 3
#> sample experimental_group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2The index column “sample” remains protected, but everything else can be renamed freely.
These functions can be useful if you have additional information
stored in a separate tibble, and you want to add it to your
experiment() object.
# Join sample info with variable info
more_sample_info <- tibble::tibble(
sample = c("S1", "S2", "S3", "S4", "S5", "S6"),
age = c(20, 21, 22, 23, 24, 25),
gender = c("M", "F", "M", "F", "M", "F")
)
joined_exp <- left_join_obs(toy_exp, more_sample_info, by = "sample")
get_sample_info(joined_exp)
#> # A tibble: 6 × 5
#> sample group batch age gender
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 S1 A 1 20 M
#> 2 S2 A 2 21 F
#> 3 S3 A 1 22 M
#> 4 S4 B 2 23 F
#> 5 S5 B 1 24 M
#> 6 S6 B 2 25 FYou might have noticed that we don’t have alternatives for
dplyr::right_join() and dplyr::full_join().
This is because by design joining functions in glyexp
should only be used to add new information to your
experiment() object. However, right_join() and
full_join() will add more observations to the resulting
tibbles, which is not suitable for experiment()
objects.
For the same reason, the relationship parameter is fixed
to “many-to-one” for all joining functions in glyexp. You
probably don’t need to know this, but if you do, check out the
documentation of dplyr::left_join() for more details.
The real power emerges when you chain multiple operations together. Here are some patterns:
complex_pipeline <- toy_exp |>
filter_obs(group == "A") |>
select_obs(group, batch) |>
arrange_obs(desc(batch)) |>
filter_var(protein == "PRO1") |>
select_var(glycan_composition, protein)
print("Final pipeline result:")
#> [1] "Final pipeline result:"
print(complex_pipeline)
#>
#> ── Others Experiment ───────────────────────────────────────────────────────────
#> ℹ Expression matrix: 3 samples, 1 variables
#> ℹ Sample information fields: group <chr>, batch <dbl>
#> ℹ Variable information fields: glycan_composition <chr>, protein <chr>analytical_pipeline <- toy_exp |>
mutate_var(composition_length = nchar(glycan_composition)) |>
filter_var(composition_length >= 4) |>
slice_max_var(composition_length, n = 3)
get_var_info(analytical_pipeline)
#> # A tibble: 4 × 5
#> variable protein peptide glycan_composition composition_length
#> <chr> <chr> <chr> <chr> <int>
#> 1 V1 PRO1 PEP1 H5N2 4
#> 2 V2 PRO2 PEP2 H5N2 4
#> 3 V3 PRO3 PEP3 H3N2 4
#> 4 V4 PRO3 PEP4 H3N2 4# Create a smaller dataset for testing
set.seed(456)
test_exp <- toy_exp |>
slice_sample_obs(n = 3) |>
slice_sample_var(n = 4)
print("Test dataset dimensions:")
#> [1] "Test dataset dimensions:"
print(test_exp)
#>
#> ── Others Experiment ───────────────────────────────────────────────────────────
#> ℹ Expression matrix: 3 samples, 4 variables
#> ℹ Sample information fields: group <chr>, batch <dbl>
#> ℹ Variable information fields: protein <chr>, peptide <chr>, glycan_composition <chr>Sometimes you need functionality beyond what glyexp’s dplyr-style functions provide. Extract the tibbles and use any dplyr function you want.
glyexp only implements functions that preserve the synchronized
multi-table structure of experiment() objects.
Functions like count(), distinct(),
summarise(), and pull() return aggregated
results that break the original data relationships. For these
operations, extract the relevant tibble and use standard dplyr
functions:
# For complex aggregations
toy_exp |>
get_sample_info() |>
count(group)
#> # A tibble: 2 × 2
#> group n
#> <chr> <int>
#> 1 A 3
#> 2 B 3This won’t work:
library(tibble)
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter_obs(regular_tibble, group == "A")
#> Error in `filter_info_data()`:
#> ! is_experiment(exp) is not TRUEDo this instead:
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter(regular_tibble, group == "A")
#> # A tibble: 1 × 2
#> group value
#> <chr> <dbl>
#> 1 A 1
filtered_exp <- filter_obs(toy_exp, group == "A")
get_sample_info(filtered_exp)
#> # A tibble: 3 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1Don’t do this:
Do this instead:
This won’t work as expected:
select_obs(toy_exp, -sample)
#> Error:
#> ! You should not explicitly select or deselect the "sample" column in
#> `sample_info`.
#> ℹ The "sample" column will be handled by `select_obs()` or `select_var()`
#> automatically.Embrace the protection:
Don’t mix operations inappropriately:
Use the right function for the right data:
glyexp’s dplyr-style functions are designed to be fast, safe, and consistent.
For large datasets, consider:
select_obs() and select_var() to
keep only needed columns# Efficient pipeline: filter first, then manipulate
efficient_pipeline <- toy_exp |>
filter_obs(group == "A") |> # Reduce samples early
filter_var(protein == "PRO1") |> # Reduce variables early
select_obs(group) |> # Keep only needed sample columns
select_var(glycan_composition) # Keep only needed variable columnsglyexp’s dplyr-style functions embody a simple philosophy:
“Think about your metadata, and let the data follow.”
This design means:
glyexp’s dplyr-style functions are experiment-specific data
manipulators designed exclusively for experiment() objects.
They provide:
_obs() for
samples, _var() for variablesStart with filter_obs() and select_var(),
then build complex pipelines.
_obs() and
_var()