| Title: | Parsing Glycan Structure Text Representations |
|---|---|
| Description: | Provides functions to parse glycan structure text representations into 'glyrepr' glycan structures. Currently, it supports StrucGP-style, pGlyco-style, IUPAC-condensed, IUPAC-extended, IUPAC-short, WURCS, Linear Code, and GlycoCT format. It also provides an automatic parser to detect the format and parse the structure string. |
| Authors: | Bin Fu [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-8567-2997>) |
| Maintainer: | Bin Fu <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.6.0 |
| Built: | 2026-05-14 15:35:45 UTC |
| Source: | https://github.com/glycoverse/glyparse |
Detect the structure string type and use the appropriate parser to parse automatically. Mixed types are supported.
Supported types:
GlycoCT
IUPAC-condensed
IUPAC-extended
IUPAC-short
WURCS
Linear Code
pGlyco
StrucGP
auto_parse(x, on_failure = "error")auto_parse(x, on_failure = "error")
x |
A character vector of structure strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
A glyrepr::glycan_structure() object.
# Single structure x <- "Gal(b1-3)GlcNAc(b1-4)Glc(a1-" # IUPAC-condensed auto_parse(x) # Mixed types x <- c( "Gal(b1-3)GlcNAc(b1-4)Glc(a1-", # IUPAC-condensed "Neu5Aca3Gala3(Fuca6)GlcNAcb-" # IUPAC-short ) auto_parse(x)# Single structure x <- "Gal(b1-3)GlcNAc(b1-4)Glc(a1-" # IUPAC-condensed auto_parse(x) # Mixed types x <- c( "Gal(b1-3)GlcNAc(b1-4)Glc(a1-", # IUPAC-condensed "Neu5Aca3Gala3(Fuca6)GlcNAcb-" # IUPAC-short ) auto_parse(x)
This function parses GlycoCT strings into a glyrepr::glycan_structure().
GlycoCT is a format used by databases like GlyTouCan and GlyGen.
parse_glycoct(x, on_failure = "error")parse_glycoct(x, on_failure = "error")
x |
A character vector of GlycoCT strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
GlycoCT format consists of two parts:
RES: Contains monosaccharides (lines starting with 'b:') and substituents (lines starting with 's:')
LIN: Contains linkage information between residues
For more information about GlycoCT format, see the glycoct.md documentation.
A glyrepr::glycan_structure() object.
glycoct <- paste0( "RES\n", "1b:a-dgal-HEX-1:5\n", "2s:n-acetyl\n", "3b:b-dgal-HEX-1:5\n", "LIN\n", "1:1d(2+1)2n\n", "2:1o(3+1)3d" ) parse_glycoct(glycoct)glycoct <- paste0( "RES\n", "1b:a-dgal-HEX-1:5\n", "2s:n-acetyl\n", "3b:b-dgal-HEX-1:5\n", "LIN\n", "1:1d(2+1)2n\n", "2:1o(3+1)3d" ) parse_glycoct(glycoct)
This function parses IUPAC-condensed strings into a glyrepr::glycan_structure().
For more information about IUPAC-condensed notation, see doi:10.1351/pac199668101919.
parse_iupac_condensed(x, on_failure = "error")parse_iupac_condensed(x, on_failure = "error")
x |
A character vector of IUPAC-condensed strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
The IUPAC-condensed notation is a compact form of IUPAC-extended notation. It is used by the GlyConnect database. It contains the following information:
Monosaccharide name, e.g. "Gal", "GlcNAc", "Neu5Ac".
Substituent, e.g. "9Ac", "4Ac", "3Me", "?S".
Linkage, e.g. "b1-3", "a1-2", "a1-?".
An example of IUPAC-condensed string is "Gal(b1-3)GlcNAc(b1-4)Glc(a1-".
The reducing-end monosaccharide can be with or without anomer information. For example, the two strings below are all valid:
"Neu5Ac(a2-"
"Neu5Ac"
In the first case, the anomer is "a2". In the second case, the anomer is "?2".
A glyrepr::glycan_structure() object.
parse_iupac_short(), parse_iupac_extended()
iupac <- "Gal(b1-3)GlcNAc(b1-4)Glc(a1-" parse_iupac_condensed(iupac)iupac <- "Gal(b1-3)GlcNAc(b1-4)Glc(a1-" parse_iupac_condensed(iupac)
Parse IUPAC-extended-style structure characters into a glyrepr::glycan_structure().
For more information about IUPAC-extended format, see doi:10.1351/pac199668101919.
parse_iupac_extended(x, on_failure = "error")parse_iupac_extended(x, on_failure = "error")
x |
A character vector of IUPAC-extended strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
The function accepts both a Unicode format (using the Greek letters alpha/beta
and the arrow symbol ->) and a plain-text format (using the strings "alpha",
"beta", and "->"). For example,
both "\u03b2-D-Galp-(1\u21923)-\u03b1-D-GalpNAc-(1\u2192" and
"beta-D-Galp-(1->3)-alpha-D-GalpNAc-(1->" are valid inputs.
A glyrepr::glycan_structure() object.
parse_iupac_condensed(), parse_iupac_short()
iupac <- "\u03b2-D-Galp-(1\u21923)-\u03b1-D-GalpNAc-(1\u2192" parse_iupac_extended(iupac) parse_iupac_extended("beta-D-Galp-(1->3)-alpha-D-GalpNAc-(1->")iupac <- "\u03b2-D-Galp-(1\u21923)-\u03b1-D-GalpNAc-(1\u2192" parse_iupac_extended(iupac) parse_iupac_extended("beta-D-Galp-(1->3)-alpha-D-GalpNAc-(1->")
Parse IUPAC-short-style structure characters into a glyrepr::glycan_structure().
For more information about IUPAC-short format, see doi:10.1351/pac199668101919.
parse_iupac_short(x, on_failure = "error")parse_iupac_short(x, on_failure = "error")
x |
A character vector of IUPAC-short strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
The IUPAC-short notation is a compact form of IUPAC-condensed notation. It is rarely used in database, but appears a lot in literature for its conciseness. Compared with IUPAC-condensed notation, IUPAC-short notation ignore the anomer positions, assuming they are known for common monosaccharides. For example, "Neu5Aca3Gala-" assumes the anomer of Neu5Ac is C2 (a2-3 linked). Also, the parentheses around linkages are omitted, and parentheses are used to indicate branching, e.g. "Neu5Aca3Gala3(Fuca3)GlcNAcb-".
In the first case, the anomer is "a2". In the second case, the anomer is "?2".
A glyrepr::glycan_structure() object.
parse_iupac_condensed(), parse_iupac_extended()
iupac <- "Neu5Aca3Gala3(Fuca6)GlcNAcb-" parse_iupac_short(iupac)iupac <- "Neu5Aca3Gala3(Fuca6)GlcNAcb-" parse_iupac_short(iupac)
Parse Linear Code structures into a glyrepr::glycan_structure().
To know more about Linear Code, see this article.
parse_linear_code(x, on_failure = "error")parse_linear_code(x, on_failure = "error")
x |
A character vector of Linear Code strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
A glyrepr::glycan_structure() object.
linear_code <- "Ma3(Ma6)Mb4GNb4GNb" parse_linear_code(linear_code)linear_code <- "Ma3(Ma6)Mb4GNb4GNb" parse_linear_code(linear_code)
Parse pGlyco-style structure characters into a glyrepr::glycan_structure().
See example below for the structure format.
parse_pglyco_struc(x, on_failure = "error")parse_pglyco_struc(x, on_failure = "error")
x |
A character vector of pGlyco-style structure strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
A glyrepr::glycan_structure() object.
glycan <- parse_pglyco_struc("(N(F)(N(H(H(N))(H(N(H))))))") print(glycan, verbose = TRUE)glycan <- parse_pglyco_struc("(N(F)(N(H(H(N))(H(N(H))))))") print(glycan, verbose = TRUE)
Parse StrucGP-style structure characters into a glyrepr::glycan_structure().
See example below for the structure format.
parse_strucgp_struc(x, on_failure = "error")parse_strucgp_struc(x, on_failure = "error")
x |
A character vector of StrucGP-style structure strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
A glyrepr::glycan_structure() object.
glycan <- parse_strucgp_struc("A2B2C1D1E2F1fedD1E2edcbB5ba") print(glycan, verbose = TRUE)glycan <- parse_strucgp_struc("A2B2C1D1E2F1fedD1E2edcbB5ba") print(glycan, verbose = TRUE)
This function parses WURCS strings into a glyrepr::glycan_structure().
Currently, only WURCS 2.0 is supported.
For more information about WURCS, see WURCS.
parse_wurcs(x, on_failure = "error")parse_wurcs(x, on_failure = "error")
x |
A character vector of WURCS strings. NA values are allowed and will be returned as NA structures. |
on_failure |
How to handle parsing failures. |
A glyrepr::glycan_structure() object.
wurcs <- paste0( "WURCS=2.0/3,5,4/", "[a2122h-1b_1-5_2*NCC/3=O][a1122h-1b_1-5][a1122h-1a_1-5]/", "1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1" ) parse_wurcs(wurcs)wurcs <- paste0( "WURCS=2.0/3,5,4/", "[a2122h-1b_1-5_2*NCC/3=O][a1122h-1b_1-5][a1122h-1a_1-5]/", "1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1" ) parse_wurcs(wurcs)