---
title: "Custom Blueprint"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Custom Blueprint}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(glysmith)
```

A blueprint is simply an analytical plan that tells `forge_analysis()` what to do.
By default, `forge_analysis()` runs a default blueprint:

```{r}
blueprint_default()
```

This vignette shows you how to build your own blueprint for custom analyses.

**Heads up:** Before diving into the details of manual blueprint creation,
you might want to try having an LLM generate one for you with `inquire_blueprint()`.
Check out [this vignette](https://glycoverse.github.io/glysmith/articles/ai.html)
for the details.

## Step functions

`glysmith` comes with various step functions (anything starting with `step_`),
each handling a specific analytical task.
Chain them together with `blueprint()`:

```{r}
bp <- blueprint(
  step_ident_overview(),  # Identification overview
  step_preprocess(),      # Preprocess (normalization, imputation, etc.)
  step_dea_limma(),       # DEA with limma
  step_volcano(),         # Volcano plot
)
```

Now `bp` is ready to go — just pass it as the second argument to `forge_analysis()`:

```r
res <- forge_analysis(exp, bp)
```

Here's the full lineup of available step functions:

| Function | Description | Require Glycan Structure | Experiment Type |
|:---------|:------------|:------------------------:|:---------------:|
| **Preprocessing** | | | |
| `step_ident_overview()` | Summarize the experiment (identification overview) | No | GP/G |
| `step_plot_qc()` | Generate quality control plots | No | GP/G |
| `step_preprocess()` | Preprocess the experiment (normalization, imputation, etc.) | No | GP/G |
| `step_subset_groups()` | Subset the experiment to specific groups | No | GP/G |
| `step_adjust_protein()` | Adjust glycoform quantification by protein abundance | No | GP |
| **Differential Analysis** | | | |
| `step_dea_limma()` | Differential expression analysis using limma | No | GP/G |
| `step_dea_ttest()` | Differential expression analysis using t-test | No | GP/G |
| `step_dea_anova()` | Differential expression analysis using ANOVA | No | GP/G |
| `step_dea_wilcox()` | Differential expression analysis using Wilcoxon test | No | GP/G |
| `step_dea_kruskal()` | Differential expression analysis using Kruskal-Wallis test | No | GP/G |
| **Visualization** | | | |
| `step_volcano()` | Create volcano plot from DEA results | No | GP/G |
| `step_heatmap()` | Create heatmap plot | No | GP/G |
| `step_logo()` | Create logo plot for glycosylation sites | No | GP |
| `step_sig_boxplot()` | Boxplots for significant variables from DEA | No | GP/G |
| **Dimension Reduction** | | | |
| `step_pca()` | Principal Component Analysis | No | GP/G |
| `step_tsne()` | t-SNE analysis | No | GP/G |
| `step_umap()` | UMAP analysis | No | GP/G |
| `step_plsda()` | Partial Least Squares Discriminant Analysis | No | GP/G |
| `step_oplsda()` | Orthogonal Partial Least Squares Discriminant Analysis | No | GP/G |
| **Correlation & Enrichment** | | | |
| `step_correlation()` | Pairwise correlation analysis | No | GP/G |
| `step_sig_enrich_go()` | GO enrichment analysis on DE variables | No | GP |
| `step_sig_enrich_kegg()` | KEGG enrichment analysis on DE variables | No | GP |
| `step_sig_enrich_reactome()` | Reactome enrichment analysis on DE variables | No | GP |
| `step_sig_enrich_ncg()` | NCG enrichment analysis on DE variables | No | GP |
| `step_sig_enrich_wp()` | WikiPathways enrichment analysis on DE variables | No | GP |
| `step_sig_enrich_do()` | Disease Ontology enrichment analysis on DE variables | No | GP |
| **Advanced Analysis** | | | |
| `step_infer_structure()` | Infer glycan structures from glycan compositions | No | GP/G |
| `step_derive_traits()` | Calculate glycan derived traits | **Yes** | GP/G |
| `step_quantify_dynamic_motifs()` | Quantify dynamic glycan motifs | **Yes** | GP/G |
| `step_quantify_branch_motifs()` | Quantify N-glycan branch motifs | **Yes** | GP/G |
| `step_roc()` | ROC analysis for biomarker discovery | No | GP/G |
| **Survival Analysis** | | | |
| `step_cox()` | Cox proportional hazards model | No | GP/G |

*GP: glycoproteomics, G: glycomics*

## Step dependencies

Some steps need data that other steps produce.
For instance, `step_volcano()` needs results from a DEA step like `step_dea_limma()`.
So this blueprint won't work:

```r
blueprint(  # this will error
  step_preprocess(),      # Preprocess (normalization, imputation, etc.)
  step_volcano(),         # Volcano plot
)
```

The easiest way to figure out dependencies is to check the documentation for each step.
For example, `?step_volcano()` tells you it
"requires one of the DEA steps to be run:
`step_dea_limma()`, `step_dea_ttest()`, `step_dea_wilcox()`".
So as long as you run one of these DEA steps before `step_volcano()`, you're good to go:

```{r}
bp <- blueprint(
  step_preprocess(),      # Preprocess (normalization, imputation, etc.)
  step_dea_ttest(),       # DEA with t-test
  step_volcano(),         # Volcano plot
)
```

Steps and their dependencies don't need to be right next to each other — 
all intermediate data gets stored along the way,
so the blueprint is valid as long as the required data exists somewhere upstream:

```{r}
bp <- blueprint(
  step_preprocess(),      # Preprocess (normalization, imputation, etc.)
  step_dea_ttest(),       # DEA with t-test (generates data required by `step_volcano()`)
  step_heatmap(),         # Heatmap
  step_volcano(),         # Volcano plot (requires data generated by `step_dea_ttest()`)
)
```

## The `on` parameter

Many step functions have an `on` parameter that controls their dependencies.
For example, `step_heatmap()` defaults to `on = "exp"`,
which plots a heatmap from the experiment (after preprocessing, if you did that).
But you can point it at other data like "sig_exp" instead — 
just remember you'll need to run a DEA step first:

```{r}
bp <- blueprint(
  step_preprocess(),             # Preprocess (normalization, imputation, etc.)
  step_dea_limma(),              # DEA with limma
  step_heatmap(on = "sig_exp"),  # Heatmap on significantly expressed variables
)
```

Check `?step_dea_limma()` and you'll see it generates a field called exactly "sig_exp".
This is how data flows in `glysmith`:
each step stores its results in a shared context object,
where downstream steps can grab what they need.

Let's see another example:

```{r}
bp <- blueprint(
  # Data required: `exp`
  # Data generated: `exp` (overwrite)
  step_preprocess(),

  # Data required: `exp`
  # Data generated: `trait_exp` (trait experiment)
  step_derive_traits(),

  # Data required: `trait_exp` (specified by `on`)
  # Data generated: `sig_trait_exp` (significantly different trait experiment)
  step_dea_limma(on = "trait_exp"),

  # Data required: `sig_trait_exp`
  step_heatmap(on = "sig_trait_exp"),
)
```

Here's the general rule for dataflow in `glysmith`:
steps don't overwrite data generated by earlier steps.
The exception is `step_preprocess()`, which updates `exp` with the preprocessed version
(but don't worry — the raw data is safely backed up in `raw_exp`).

To build valid blueprints, you'll need to know what each step needs and what it produces.
Let's look at a more complex example to cement this.
Try tracing through the dataflow yourself:

```{r}
bp <- blueprint(
  step_ident_overview(),
  step_preprocess(),
  step_pca(),
  
  # DEA
  step_dea_limma(),
  step_volcano(),
  step_heatmap(on = "sig_exp"),

  # Derived trait analysis
  step_derive_traits(),
  step_dea_limma(on = "trait_exp"),
  step_heatmap(on = "sig_trait_exp"),

  # Branch motif analysis
  step_quantify_branch_motifs(),
  step_dea_limma(on = "branch_motif_exp"),
  step_heatmap(on = "sig_branch_motif_exp")
)
```

## Branches

As we mentioned, steps can't overwrite data from earlier steps.
But what if you want to compare different DEA methods?
You might try something like this:

```r
blueprint(
  step_preprocess(),
  step_dea_limma(),
  step_volcano(),
  step_dea_ttest(),
  step_volcano(),
)
```

This won't fly for two reasons:

1. `step_dea_ttest()` would overwrite the `sig_exp` from `step_dea_limma()`.
2. `step_volcano()` shows up twice, which isn't allowed.

These rules keep table and plot names unambiguous.

To pull this off, use **branches**. So far our blueprints have been linear — steps run one after another.
To branch out, just wrap related steps in `br()`:

```{r}
bp <- blueprint(
  step_preprocess(),
  br("limma", step_dea_limma(), step_volcano()),
  br("t-test", step_dea_ttest(), step_volcano()),
)
```

The first argument names the branch, and the rest are the steps to run inside it.
Each branch gets its own isolated context,
so the `step_volcano()` in the "limma" branch naturally picks up the `sig_exp` from the `step_dea_limma()` in that same branch.
Branches can also access data from before the branch point — 
so both branches here can use the preprocessed `exp` from `step_preprocess()`.

## Summary

In this vignette, we covered how to build custom blueprints,
how the `on` parameter and step dependencies work, and how to use branches.

All of this takes some practice, so if you'd rather not dive deep into the details,
remember you can always [have AI generate blueprints](https://glycoverse.github.io/glysmith/articles/ai.html) for you.