Despite the flaws, the molfile has been a defacto standard for chemical representation for several decades. The core format (atom and bond block) is well supported in many toolkits but more advanced features (dark corners) of the property block may be skipped.
At this year's Fall ACS (Boston '15) I bumped into an old colleague from ChEBI who told me they (ChEBI) couldn't use CDK because they wanted to display repeating brackets on records and CDK didn't do that.
Polymer representation (more precisely Structural Repeat Unit) used by ChEBI falls under the category of a Ctab Sgroup. I'd wanted to add support for Sgroups for some time and now had motivation to do so.
Substructure (or Substance) Groups
Over the years there seems to have been a shift in definition. The original literature uses the term "substructure groups" but more recent materials use "substance groups"[2,3]. Personally I prefer "substructure" since it concisely summarises what they really are about.
Essentially an Sgroup annotates some part of the connection table (a substructure) with meta-information (data). There are several types of Sgroup that formalise the types of annotation present:
- Display Shortcuts
- Multiple Groups
- Structural Repeat Unit (SRU)
- Copolymer (alternating, block, or random)
- Unordered Mixture
- Ordered Mixture (formulation)
Example ChEBI Depictions
Egon reviewed the first patch (pull/149) last week that focussed on representation and molfile round tripping. The second patch enhances the rendering code to handle more than basic SRUs (e.g. >2 brackets) and display shortcuts.
As of ChEBI 131 there are 809 entries with at least one Sgroup. Generating the depictions of these from an SDfile took < 3 seconds, then a further 11 to actually write the files to disk. The rest of this post demonstrates some example of those depictions.
Display Shortcut, Abbreviations
Previously referred to as "superatoms", parts of a structure can be abbreviated to a more concise name (e.g. Ph for a phenyl substituent). The full structure is present but is only displayed when the expansion flag is set.
Display Shortcut, Multiple Group
Multiple groups allow structures with fixed repeating parts to be drawn more concisely. Similar to abbreviations, all the atoms and bonds are present but are hidden from display. They're actually all overlaid on one another with duplicated coordinates but for rendering you still want omit them from display.
The most common Sgroup used in ChEBI is the Structure Repeat Unit (SRU), an SRU defines a repeat unit of variable length. The brackets do not necessarily come in pairs, are parallel, or point towards each other.
A few entries encode copolymers and source-based representations (monomer).
|CHEBI:59599||CHEBI:3814 (overlap in original)|
A structure can have more than one Sgroup and they can be nested. Here we see a multiple group within an SRU. There is also a data Sgroup attached to the Zn-N bond marking it as a coordination bond for Marvin. I've not decided whether to render those yet, but we have the information there.
- Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J. Chem. Inf. Comput. Sci. (1991)
- Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases. Online - StructurePendium Technologies GmbH
- Accelrys Chemical Representation
- CTfile Formats Specification