Stanford University Libraries

Chemical Literature (Chem 184/284)
University of California at Santa Barbara

Lecture 15: Chemical Abstracts Registry File, Part 2: Structure Searching

Structure Searching on STN

  • One of the most powerful features of the chemical substance files on STN is the ability to search by chemical structure.
  • A large number of STN files contain searchable chemical structures of various types.
  • The REGISTRY, BEILSTEIN, GMELIN and Derwent Drug Files all contain records for individual compounds.
Input ------> Output
Specific structure ------> Single compound
Generic structure ------> Set of compounds
  • The CASREACT, CHEMINFORMRX, “Derwent Journal of Synthetic Methods” and CHEMREACT files all contain information on organic chemical reactions. The reactants and products are structure searchable.
Input ------> Output
Specific or generic structures ------> Set of reactions with desired features
  • The MARPAT and MARPATpreviews files contain the generic Markush structures appearing in chemical patents in structure searchable form
Input ------> Output
Specific or generic structures ------> Patents containing appropriate Markush structures

How Structures are Stored

  • In STN files, structures are stored as connection tables — a list of each atom in the structure, which atoms each is linked to, and by what kind of bond.
  • Structures with stereochemistry have additional information about the spatial arrangement of the bonds.

How Structures are Searched

  • The Messenger software searches structure information in two steps: screening and atom-by-atom match.
  • Screening filters out likely matches by looking for certain common features.
  • Atom-by-atom match then compares the whole of the connection table of the query structure with that of the possible matches.

Building Query Structures

  • Messenger has a whole set of commands for “drawing” chemical structures within the system itself.
  • STN Express is a specialized software package which includes structure drawing software and the ability to upload these structures to the online system. For a manual on building structures using STN Express, see Structure Searching in the CAS Registry File at http://www.cas.org/ACAD/casreg.pdf. Note: this is a large PDF file requiring a current version of the Adobe Acrobat Reader for viewing.

Structure Building Commands — STRUCTURE

  • The command STRUCTURE initiates the structure building process.
  • The system responds by prompting for a structure to recall.
  • You may respond with the name of a template, a Registry Number, a previously-built structure L# or NONE.
  • When in structure building mode, the arrow prompt is replaced by a colon.

GRAPH — Creating the Pieces

  • The GRAPH command (abbreviated GRA) tells the system to create atoms or sets of atoms, as either chains or rings. Note that, in general, you do not have to draw hydrogen atoms as part of the structure—they are assumed to be present by the system.
  • The default atom is carbon; the default bond is unspecified.
  • When structure building with GRA commands, the system automatically assigns a number to each node in the order constructed.
    : gra c3
    
    creates a three carbon chain, while
    : gra r6
    
    creates a six carbon ring. (i.e. the beginnings of cyclohexane or benzene)
    : gra r66
    
    creates two six membered rings fused along one side (i.e. the beginnings of naphthalene.)
  • You can also attach chains to specific atoms:
    : gra 2 c4
    
    attaches a 4 carbon chain to atom 2
  • You can create bonds between existing atoms:
    : gra 1 2
    

DELETE — Removing Atoms and Bonds

  • DELETE can be used to remove atoms or groups of atoms:
    : del 1 5 8
    
  • Or it can be used to remove bonds:
    : del 1-2 7-9
    

NODE — Transmuting Elements

  • The NODE command is used to change atoms from one type to another, such as changing the default carbon atoms to other types.
    : nod 1 o
    : nod 2 4 10 n
    : nod 6 Cl, 7 f
    
  • In addition to normal elements, you can use NODE to introduce special groups:
    • X — any halogen
    • M — any metal
    • Q — any atom except C or H
    • A — any atom except H
    • Ak — any alkyl group
    • Cy — any cyclic group
    • Cb — any carbocyclic group
    • Hy — any heterocyclic group
  • There are also specific multi-atom groups called “shortcuts” e.g. Me for methyl groups, Ph for phenyl groups.
  • Shortcut symbols may not be substituted upon in a substructure search, i.e. a COOH shortcut will find only acids, not esters of those acids.

VARIABLE — Defining Your Own Generic Groups

  • The VARIABLE command lets you define your own special groups, called G# groups.
    : nod 5 g1
    : var g1=o/n/no2/me/x/12
    
  • G groups can include elements, shortcuts, generics, structure fragments (which can contain other G# groups!)

BOND — Specifiying Bond Types

  • Initially bonds are drawn as unspecified
  • The BOND command lets you specify bond types; it does not create bonds.
  • You name the end atoms of the bond, and the desired type.
    : bon 9-12 se
    : bon all se; 8-9 de
    
  • The bond types used by STN Messenger are:
    • se — single exact
    • de — double exact
    • t — triple
    • n — normalized
    • s — single or normalized
    • d — double or normalized
  • Normalized Bonds — Normalized bonds are a convention used where a given bond may be either single or double, such as in aromatic rings like benzene or naphthalene, or in carboxylates (where either oxygen might by single or double bonded tothe carboxyl carbon.)

DISPLAY — Seeing What You've Built

  • The DISPLAY command in structure building may be used at any step along the way to see what the current structure looks like. It is frequently added to a structure building command to save time.
    : gra c3, nod 2 o, dis
    
  • DIS SIA displays both the diagram and the attributes (see below) of the structure.

Attribute Commands

  • All these commands don't directly change the structure, but do affect how it will be searched.
  • Hydrogen Count (HCO) — lets you specify the number of hydrogens attached to a node.
    : hco 1 e2  (exactly 2 hydrogens)
    : hco 1 m1  (minimum 1 hydrogen)
    : hco 1 x2  (maxiumm 2 hydrogens)
    
  • Connect (con) — Works the same as HCO, but for non-hydrogen attachments.
  • Ring Specification (rsp) — eliminates the possibility of other rings fused onto the query ring system. The command consists of rsp and the node number of any atom in the ring system you wish to isolate.
  • Node Specification (nsp) — Messenger assumes that atoms can be only in rings or in chains. NSP lets you change that characteristic.
  • There are some other attributes as well, but these are by far the most commonly used ones.

END — Going from Building to Searching

  • When all your structure building is complete, the END command creates an L# for the structure and returns you to the normal search mode.
  • You must END a structure before you can search it.

Types of Structure Searches

  • Messenger allows four types of structure search:
    • EXACT: Looks for the compound exactly as drawn; only possible variations are isotopic (and stereochemical if unspecified)
    • FAMILY: Same as above, but will also pick up salts (of acids) or polymers (of monomers).
    • CSS: Stands for Closed Substructure Search This type of search will only allow substitution where you have specfically allowed it, as with a CONNECT attribute or the use of a variable or generic group.
    • SSS: Stands for Substructure Search. Will allow any substitution at any atom except as you have specifically restricted it.

Ranges of Searching

  • You can also specify how much of the database you wish to search:
    • SAMPLE: This is a fixed, randomly selected, 5% of the database. Always search this before doing a substructure search to see if the search will work. Sample searches are always free!
    • FULL: Self-explanatory
    • RANGE: You may specify a range of Registry Numbers to search; useful for update searches or to continue searches which were to big to complete in one step. Ranges of less than 100K RN’s are cheaper than a full search.
    • SUBSET: Lets you use a previously created L# (by name, mol. formula, ring data, structure or combinations) as the defined set to search on. Can be a very powerful tool.

Structure Search Hints

  • When doing a structure search, always use SEARCH, not S. This way, the system will prompt you for type of search and range of search.
  • SAMPLE searches aren’t necessary for EXACT or FAMILY searches, but are strongly recommended for substructure searches.
  • If a structure is unsearchable (exceeds system limits), consider whether you can create a suitable subset with name fragments or molecular formula or ring information which would bring a subset search within system limits. Alternatively, modify the structure to make it more specific. Note that changing HCO or CON attributes does not affect the search at the screening level, so these limitations do not generally keep a search within system limits.

Structure Building Example: Feropolone

  • First, build the rings:
    :gra r6
    :gra r66, dis
    
  • Then, build the chain connecting the two:
    :gra 1 c6
    :gra 22 7
    
  • Now, build the side chains:
    :gra 2 c1, 2 c1, 3 c1, 5 c1, 5 c1, 11 c1, 19 c1, 19 c1, dis
    
  • Then, use the NOD command to change atoms as necessary:
    :nod 10 22 25 28 o, 23 24 29 me,  27 30 oh, dis
    
  • Now apply the BON command:
    :bon all se, dis
    :bon  3-25 11-28 12-13 de, dis
    :bon 7-8 8-9 9-14 14-15 15-16 16-7 n; dis
    
  • Then apply attribute commands as necessary.
  • When the structure is completed, use the END command to complete the structure and return to the regular search mode.
  • You may display completed structures online with “display query L#”
  • Search the query with “search L# [search type] [search range]”
    => search L1 exact full
    => search L1 sss sample
    EXAMPLE:
    
    => search l3
    
    ENTER TYPE OF SEARCH (SSS), CSS, FAMILY, OR EXACT:sss
    
    ENTER SCOPE OF SEARCH (SAMPLE), FULL, RANGE, OR SUBSET:full
    
    FULL SEARCH INITIATED 23:13:37
    FULL SCREEN SEARCH COMPLETED -     71 TO ITERATE
    100.0% PROCESSED      71 ITERATIONS                            1 ANSWERS
    SEARCH TIME: 00.00.02
    

This page created by Chuck Huber (huber@library.ucsb.edu).