---
title: "dplyr-Style Functions: Data Harmony in Action"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{dplyr-Style Functions: Data Harmony in Action}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette describes glyexp's dplyr-style functions for synchronized data manipulation.

When working with multi-table datasets, filtering one table can desynchronize your data from other components.
Rearranging another table can break carefully established relationships.

**glyexp's dplyr-style functions** address this by understanding the connection between your expression matrix,
sample information, and variable annotations.
When you transform one component, everything else follows in synchronization.

**Note:** These functions only work with `experiment()` objects -
they cannot be used on regular data.frames, tibbles, or other data structures.

```{r setup}
library(glyexp)
library(dplyr)
library(conflicted)

conflicts_prefer(glyexp::select_var)
conflicts_prefer(dplyr::filter)
```

## Core Philosophy: One Action, Three Updates

glyexp's dplyr-style functions work on three components:

1. **Expression Matrix**: Numerical data
2. **Sample Info**: Experimental metadata
3. **Variable Info**: Molecular annotations

In traditional data analysis, filtering samples requires manually updating all related tables.
glyexp's dplyr-style functions handle this synchronization automatically.

Here's an example:

```{r}
toy_exp <- toy_experiment
print(toy_exp)
```

## Two Flavors: `_obs()` and `_var()`

Every dplyr-style function in glyexp comes in two variants:

- **`_obs()` functions**: Work on sample information in `experiment()` objects
- **`_var()` functions**: Work on variable annotations in `experiment()` objects

Both variants automatically update the expression matrix to maintain synchronization.

These functions require an `experiment()` object as input and return an `experiment()` object as output.
For standard tibbles or data.frames, use regular dplyr functions directly.

### Filtering

Filtering is the most common operation:

#### Sample-Based Filtering with `filter_obs()`

Say you want to focus only on group "A" samples:

```{r}
# Before filtering - let's see what we have
get_sample_info(toy_exp)
```

```{r}
# Filter for group A samples only
filtered_exp <- filter_obs(toy_exp, group == "A")
get_sample_info(filtered_exp)
```

Check the expression matrix:

```{r}
# Original matrix dimensions:
dim(get_expr_mat(toy_exp))

# Original matrix:
get_expr_mat(toy_exp)
```

```{r}
# Filtered expression matrix - automatically updated!

# Filtered matrix dimensions:
dim(get_expr_mat(filtered_exp))

# Filtered matrix:
get_expr_mat(filtered_exp)
```

The expression matrix is automatically filtered to match the remaining samples.

#### Variable-Based Filtering with `filter_var()`

Now let's filter variables and watch the same magic happen:

```{r}
# Filter for specific glycan compositions
var_filtered_exp <- filter_var(toy_exp, glycan_composition == "H5N2")
get_var_info(var_filtered_exp)
```

```{r}
# The expression matrix rows automatically follow suit!
get_expr_mat(var_filtered_exp)
```

**The matrix rows automatically reduced to match the filtered variables!** 
This is the core power of glyexp - 
you think about your metadata, 
and the expression data follows your lead.

#### Chaining Filters

Both samples and variables can be filtered by chaining operations:

```{r}
double_filtered <- toy_exp |>
  filter_obs(group == "A") |>
  filter_var(glycan_composition %in% c("H5N2", "N3N2"))

# Final dimensions after double filtering:
dim(get_expr_mat(double_filtered))
get_expr_mat(double_filtered)
```

The functions support pipe operations:

## Index Columns: Guardians of Data Integrity

Index columns (like "sample" and "variable") are essential for maintaining data relationships.
Removing them would break synchronization.

Let's see this protection in action:

### Attempting to Remove Index Columns

```{r error=TRUE}
# Try to select everything EXCEPT the sample index column
protective_exp <- select_obs(toy_exp, -sample)
get_sample_info(protective_exp)
```

glyexp throws an error to protect data integrity:

```{r error=TRUE}
# Same protection for variable info
protective_var_exp <- select_var(toy_exp, -variable)
get_var_info(protective_var_exp)
```

Similarly, glyexp throws an error to protect the "variable" column from being removed.

### Why This Protection Matters

Without index columns, an `experiment()` object would lose its ability to:

- Keep expression matrix and metadata synchronized
- Validate data consistency
- Enable seamless subsetting operations
- Work with other glycoverse packages

Index columns are essential for maintaining data relationships.

## Complete Function Reference

glyexp provides dplyr-style equivalents for common data manipulation functions.
Each function comes in both `_obs()` and `_var()` variants, and all automatically maintain matrix synchronization.

These functions are methods specifically for `experiment()` objects.

### Core Data Manipulation Functions

| Standard dplyr | Sample Operations | Variable Operations | Description |
|:---|:---|:---|:---|
| `filter()` | `filter_obs()` | `filter_var()` | Subset with sync |
| `select()` | `select_obs()` | `select_var()` | Choose with protection |
| `arrange()` | `arrange_obs()` | `arrange_var()` | Sort with order |
| `mutate()` | `mutate_obs()` | `mutate_var()` | Create with consistency |
| `rename()` | `rename_obs()` | `rename_var()` | Rename with safety |

### Advanced Slicing Functions

| Standard dplyr | Sample Operations | Variable Operations | Description |
|:---|:---|:---|:---|
| `slice()` | `slice_obs()` | `slice_var()` | Position-based selection |
| `slice_head()` | `slice_head_obs()` | `slice_head_var()` | Top n with sync |
| `slice_tail()` | `slice_tail_obs()` | `slice_tail_var()` | Bottom n with sync |
| `slice_sample()` | `slice_sample_obs()` | `slice_sample_var()` | Random with consistency |
| `slice_max()` | `slice_max_obs()` | `slice_max_var()` | Highest values with order |
| `slice_min()` | `slice_min_obs()` | `slice_min_var()` | Lowest values with order |

### Joining Functions

| Standard dplyr | Sample Operations | Variable Operations | Description |
|:---|:---|:---|:---|
| `left_join()` | `left_join_obs()` | `left_join_var()` | Add new columns from another table (left join) |
| `inner_join()` | `inner_join_obs()` | `inner_join_var()` | Add new columns from another table (inner join) |
| `semi_join()` | `semi_join_obs()` | `semi_join_var()` | Filter rows from another table (semi join) |
| `anti_join()` | `anti_join_obs()` | `anti_join_var()` | Filter rows from another table (anti join) |

## Function-by-Function Examples

### Selection

```{r}
# Select specific columns from sample info
selected_exp <- select_obs(toy_exp, group, batch)
get_sample_info(selected_exp)
```

```{r}
# Select columns from variable info (notice the index protection!)
var_selected_exp <- select_var(toy_exp, glycan_composition)
get_var_info(var_selected_exp)
```

Use `dplyr`-style helpers like `starts_with()`, `ends_with()`, and `contains()`:

```{r}
# Select columns starting with "glycan"
helper_exp <- select_var(toy_exp, starts_with("glycan"))
get_var_info(helper_exp)
```

### Arrangement

```{r}
# Arrange samples by batch and group
arranged_exp <- arrange_obs(toy_exp, batch, group)
get_sample_info(arranged_exp)
```

Check how the expression matrix columns rearranged to match:

```{r}
# Expression matrix columns follow the new sample order
get_expr_mat(arranged_exp)
```

### Mutation

```{r}
# Add a new calculated column to sample info
mutated_exp <- mutate_obs(
  toy_exp,
  group_batch = paste(group, batch, sep = "_")
)
get_sample_info(mutated_exp)
```

```{r}
# Create a complexity score for variables
complex_exp <- mutate_var(
  toy_exp,
  complexity = nchar(glycan_composition)
)
get_var_info(complex_exp)
```

### Slicing

```{r}
# Take the first 2 samples
head_exp <- slice_head_obs(toy_exp, n = 2)
get_sample_info(head_exp)
```

```{r}
# Expression matrix automatically adjusts
get_expr_mat(head_exp)
```

```{r}
# Sample randomly from variables
set.seed(123)  # For reproducibility
random_exp <- slice_sample_var(toy_exp, n = 3)
get_var_info(random_exp)
```

### Renaming

```{r}
# Rename columns in sample info
renamed_exp <- rename_obs(toy_exp, experimental_group = group)
get_sample_info(renamed_exp)
```

The index column "sample" remains protected, but everything else can be renamed freely.

### Joining

These functions can be useful if you have additional information stored in a separate tibble,
and you want to add it to your `experiment()` object.

```{r}
# Join sample info with variable info
more_sample_info <- tibble::tibble(
  sample = c("S1", "S2", "S3", "S4", "S5", "S6"),
  age = c(20, 21, 22, 23, 24, 25),
  gender = c("M", "F", "M", "F", "M", "F")
)
joined_exp <- left_join_obs(toy_exp, more_sample_info, by = "sample")
get_sample_info(joined_exp)
```

You might have noticed that we don't have alternatives for `dplyr::right_join()` and `dplyr::full_join()`.
This is because by design joining functions in `glyexp` should only be used to add new information to your `experiment()` object.
However, `right_join()` and `full_join()` will add more observations to the resulting tibbles,
which is not suitable for `experiment()` objects.

For the same reason, the `relationship` parameter is fixed to "many-to-one" for all joining functions in `glyexp`.
You probably don't need to know this, but if you do,
check out the documentation of `dplyr::left_join()` for more details.

## Advanced Patterns: Chaining for Complex Operations

The real power emerges when you chain multiple operations together. Here are some patterns:

### Pattern 1: Filter → Select → Arrange

```{r}
complex_pipeline <- toy_exp |>
  filter_obs(group == "A") |>
  select_obs(group, batch) |>
  arrange_obs(desc(batch)) |>
  filter_var(protein == "PRO1") |>
  select_var(glycan_composition, protein)

print("Final pipeline result:")
print(complex_pipeline)
```

### Pattern 2: Mutate → Filter → Slice

```{r}
analytical_pipeline <- toy_exp |>
  mutate_var(composition_length = nchar(glycan_composition)) |>
  filter_var(composition_length >= 4) |>
  slice_max_var(composition_length, n = 3)

get_var_info(analytical_pipeline)
```

### Pattern 3: Random Sampling for Testing

```{r}
# Create a smaller dataset for testing
set.seed(456)
test_exp <- toy_exp |>
  slice_sample_obs(n = 3) |>
  slice_sample_var(n = 4)

print("Test dataset dimensions:")
print(test_exp)
```

## When dplyr-Style Functions Cannot Help

Sometimes you need functionality beyond what glyexp's dplyr-style functions provide.
Extract the tibbles and use any dplyr function you want.

### Why Doesn't glyexp Implement All dplyr Functions?

glyexp only implements functions that preserve the synchronized multi-table structure of `experiment()` objects. 

Functions like `count()`, `distinct()`, `summarise()`, and `pull()` return aggregated results
that break the original data relationships.
For these operations, extract the relevant tibble and use standard dplyr functions:

```{r}
# For complex aggregations
toy_exp |>
  get_sample_info() |>
  count(group)
```

```{r}
# For distinct values
toy_exp |>
  get_var_info() |>
  distinct(protein) |>
  pull(protein)
```

```{r}
# For advanced filtering with multiple conditions
complex_filter_conditions <- toy_exp |>
  get_sample_info() |>
  filter(group == "A", batch == 2) |>
  pull(sample)

# Then use the results to subset your experiment
filtered_by_complex <- filter_obs(toy_exp, sample %in% complex_filter_conditions)
```

## Common Pitfalls and How to Avoid Them

### Pitfall 1: Using glyexp Functions on Non-Experiment Objects

This won't work:
```{r error=TRUE}
library(tibble)
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter_obs(regular_tibble, group == "A")
```

Do this instead:
```{r}
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter(regular_tibble, group == "A")

filtered_exp <- filter_obs(toy_exp, group == "A")
get_sample_info(filtered_exp)
```

### Pitfall 2: Forgetting the Synchronization

Don't do this:
```{r eval=FALSE}
sample_info <- get_sample_info(toy_exp)
filtered_samples <- filter(sample_info, group == "A")
```

Do this instead:
```{r}
filtered_exp <- filter_obs(toy_exp, group == "A")
```

### Pitfall 3: Trying to Remove Index Columns

This won't work as expected:
```{r error=TRUE}
select_obs(toy_exp, -sample)
```

Embrace the protection:
```{r}
clean_exp <- select_obs(toy_exp, group, batch)
get_sample_info(clean_exp)
```

### Pitfall 4: Mismatched Operations

Don't mix operations inappropriately:
```{r eval=FALSE}
arrange_obs(toy_exp, glycan_composition)
```

Use the right function for the right data:
```{r}
arranged_by_composition <- arrange_var(toy_exp, glycan_composition)
get_var_info(arranged_by_composition)
```

## Performance Considerations

glyexp's dplyr-style functions are designed to be fast, safe, and consistent.

For large datasets, consider:

- Filtering early in your pipeline to reduce data size
- Using `select_obs()` and `select_var()` to keep only needed columns
- Chaining operations efficiently to minimize intermediate copies

```{r}
# Efficient pipeline: filter first, then manipulate
efficient_pipeline <- toy_exp |>
  filter_obs(group == "A") |>          # Reduce samples early
  filter_var(protein == "PRO1") |>     # Reduce variables early
  select_obs(group) |>                 # Keep only needed sample columns
  select_var(glycan_composition)       # Keep only needed variable columns
```

## Philosophy Behind the Design

glyexp's dplyr-style functions embody a simple philosophy:

**"Think about your metadata, and let the data follow."**

This design means:

1. **Mental Model Alignment**: Think in terms of samples and variables, not matrix indices
2. **Error Prevention**: Automatic synchronization prevents common data analysis mistakes
3. **Familiar Syntax**: If you know dplyr, you already know most of glyexp
4. **Composability**: Functions chain together naturally for complex analyses

## Summary

glyexp's dplyr-style functions are experiment-specific data manipulators designed exclusively for `experiment()` objects. They provide:

- **Automatic Synchronization**: Operations on metadata automatically update the expression matrix
- **Index Column Protection**: Critical relationship columns are protected from deletion
- **Familiar Syntax**: Standard dplyr operations with multi-table awareness
- **Type-Aware Operations**: `_obs()` for samples, `_var()` for variables

Start with `filter_obs()` and `select_var()`, then build complex pipelines.