Tuesday, 8 September 2015

Bringing Molfile Sgroups to the CDK - Demo

Despite the flaws, the molfile has been a defacto standard for chemical representation for several decades. The core format (atom and bond block) is well supported in many toolkits but more advanced features (dark corners) of the property block may be skipped.

At this year's Fall ACS (Boston '15) I bumped into an old colleague from ChEBI who told me they (ChEBI) couldn't use CDK because they wanted to display repeating brackets on records and CDK didn't do that.

Polymer representation (more precisely Structural Repeat Unit) used by ChEBI falls under the category of a Ctab Sgroup. I'd wanted to add support for Sgroups for some time and now had motivation to do so.

Substructure (or Substance) Groups

Over the years there seems to have been a shift in definition. The original literature[1] uses the term "substructure groups" but more recent materials use "substance groups"[2,3]. Personally I prefer "substructure" since it concisely summarises what they really are about.

Essentially an Sgroup annotates some part of the connection table (a substructure) with meta-information (data). There are several types of Sgroup that formalise the types of annotation present:

  • Display Shortcuts
    • Abbreviations
    • Multiple Groups
  • Polymers
    • Structural Repeat Unit (SRU)
    • Monomer
    • Copolymer (alternating, block, or random)
    • Mer
    • Crosslink
    • Graft
    • Modified
    • Any
  • Mixtures
    • Unordered Mixture
    • Ordered Mixture (formulation)
    • Component
  • Generic
  • Data

Example ChEBI Depictions

Egon reviewed the first patch (pull/149) last week that focussed on representation and molfile round tripping. The second patch enhances the rendering code to handle more than basic SRUs (e.g. >2 brackets) and display shortcuts.

As of ChEBI 131 there are 809 entries with at least one Sgroup. Generating the depictions of these from an SDfile took < 3 seconds, then a further 11 to actually write the files to disk. The rest of this post demonstrates some example of those depictions.

Display Shortcut, Abbreviations

Previously referred to as "superatoms", parts of a structure can be abbreviated to a more concise name (e.g. Ph for a phenyl substituent). The full structure is present but is only displayed when the expansion flag is set.

CHEBI:29441 CHEBI:7725

Display Shortcut, Multiple Group

Multiple groups allow structures with fixed repeating parts to be drawn more concisely. Similar to abbreviations, all the atoms and bonds are present but are hidden from display. They're actually all overlaid on one another with duplicated coordinates but for rendering you still want omit them from display.

CHEBI:1233 CHEBI:79399

Polymer, SRUs

The most common Sgroup used in ChEBI is the Structure Repeat Unit (SRU), an SRU defines a repeat unit of variable length. The brackets do not necessarily come in pairs, are parallel, or point towards each other.

CHEBI:16838 CHEBI:4294
CHEBI:53422 CHEBI:59342

Polymer, Others

A few entries encode copolymers and source-based representations (monomer).

CHEBI:59599 CHEBI:3814 (overlap in original)


A structure can have more than one Sgroup and they can be nested. Here we see a multiple group within an SRU. There is also a data Sgroup attached to the Zn-N bond marking it as a coordination bond for Marvin. I've not decided whether to render those yet, but we have the information there.


Additional Reading

  1. Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J. Chem. Inf. Comput. Sci. (1991)
  2. Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases. Online - StructurePendium Technologies GmbH
  3. Accelrys Chemical Representation
  4. CTfile Formats Specification