Return to main page.

SDsearch 5.0 and PDBDock 1.0 Documentation

Contents

  1. What SDsearch does and who it is for
  2. SDsearch and PDBdock: Operating systems, installation and software dependencies
  3. SDsearch user options
  4. SDsearch examples

1. What SDsearch does and who it is for

SDsearch is an application that may be used to cherry-pick chemical compounds from vendor collections that meet user-defined chemical properties and/or which contain specified substructures. If a subfragment of the compound can be positioned within the three-dimensional structure of the protein target site (for example, from crystallographic fragment screening or by analogy with the structures of known bound ligands) the software may be used to check the compounds for steric fit to the dimensions of the site.

Although chemical vendors typically provide web-based substructure search interfaces, more precise and flexible searches, together with additional search criteria, are more conveniently and securely performed in the local computing environment. Unfortunately, the implementation of chemical database systems typically involves technical pre-requisites and an outlay of time and cost that is mainly appropriate for a professional chemist in a group setting. SDsearch is intended to provide a simple solution to chemical searching for other scientists, who wish to quickly find, purchase and test compounds for protein binding.

The basic concept of SDsearch is that the user inputs a set of chemical compounds as an SD file (the format available from most chemical vendors) and the application outputs an SD file containing only the compounds that meet the search criteria. Following visual inspection, the filtered list of compounds provides the available molecules that might be purchased for experimental binding or activity analysis.

SDsearch has been used to design small fragment libraries for protein crystallographic screening, to design scaffold libraries containing specific kinase binding motifs for activity-based screening and to select follow-on compounds to initial hits from fragment and scaffold screening.

1.1 Chemical property filters

Chemical property filters that may be applied by SDsearch include:

SDsearch currently understands the annotation tags for molecular weight, numbers of hydrogen bond acceptors and donors, solubility (clogP), number of rotatable bonds and accessible polar surface area for compound collections from several major chemical vendors. Mechanisms are also present to include novel tags for these properties and, if necessary, calculate values. These property filters are typically used to restrict compounds to variations on the ‘rule-of-three’ (Ro3, for fragment and scaffold screening) or the ‘rule-of-five’ (Ro5, for early lead molecules) and presets are available for these selections.

1.2 Chemical substructure filters

Mechanisms for searching for compounds that contain specific substructures include:

1.3 Identity filters

SDsearch may also compare libraries (in the form of SD files or vendor IDs) in order to avoid compound duplication when selecting compounds via different search criteria or from multiple vendors.

1.4 Protein target site filters and scaffold hopping

The protein target site filters are intended to eliminate screening library compounds that appear incompatible with the dimensions of the protein target site. The initial docking of compounds in the library requires a prepositioned fragment in the three-dimensional structure of the protein. This fragment might have been obtained directly from crystallographic small fragment screening data. Alternatively, the fragment might be a small binding substructure (as little as 3 atoms) positioned in the protein target site based on analogy with the crystal structures of known protein-ligand complexes.

SDsearch automatically superimposes each test compound onto the fragment by finding and matching all the atomic sites in the common substructure. Substructure matches are rejected if the rms deviation of positions of the common ligand and fragment atoms differ by more than 0.5Å. Low energy conformers of the compound are generated and checked for interactions with the surrounding protein structure. An optimization procedure on the ligand atomic positions is used to relieve minor clashes between compound and protein. The SDsearch software normally uses OpenBabel to generate a 'base' 3D structure from the 2D representations found in the input SD file. Occasionally, OpenBabel produces poor 3D structures. However if presented with an SD file containing 3D information then the input 3D information will be used for the initial 3D structure.

A compound is accepted as a feasible candidate for protein binding studies if a conformer is found for which:

The ensemble of feasible conformers should contain a pose close to the 'true' binding mode and might be used as the starting point for more detailed molecular modeling studies. The protein target site filters are not aimed at providing an assessment of binding free energy but are intended as a relatively quick way to eliminate useless compounds from experimental consideration.

Note: This docking capability may also be run without any protein atoms in the input PDB file. In this case the user may wish to provide a pseudo-fragment that contains a constellation of critical atoms and the software will simply find compounds that match the 3D arrangement of those atoms. i.e. the software functions to find compounds via 'scaffold hopping' since the matched compounds may contain differing chemical substructures.

1.5 Protein structure alignment with PDBdock

PDBdock is a utility program for automatically superimposing large numbers of closely related structures (identical or very nearly identical sequences) based on the CA atoms within a relatively small region.

A typical use for PDBdock would be to superimpose multiple cocrystal structures based on the compound binding sites, thereby obtaining an accurate alignment of all of the compounds within the binding site. A separate file of each aligned target sturture is written out for analysis. Currently PDBdock supplies one superposition per structure file. If superpositions over multiple copies of a structure within a crystal asymmetric unit are required then the seperate proteins will need to be separated into different files.

PDBdock commands are documented via help dialogs specified by the '?' alongside each item in the GUI. The basic version of PDBdock is free and does not require a license.

2. SDsearch and PDBdock: Operating systems, installation and software dependencies

2.1 Operating systems

SDsearch and PDBdock are written in Python and compiled to run on Windows. Implementations of the software on other operating systems will depend on user demand.

2.2 Installation

The SDsearch and PDBdock software are provided as applications within a zipped folder, FBLDapps5.zip.

To install the SDsearch and PDBdock software:

  1. Uncompress the FBLDapps5.zip archive
  2. Move the resulting FBLDapps5 folder to the top level of Local Disk (C:). i.e. the contents of the application folder are now located in the folder C:\FBLDapps5.
  3. Create a shortcut to the SDsearch graphical interface by making a right-click on the application file C:\FBLDapps5\SDsearchGUI.exe and selecting ‘Create Shortcut’ option. Move the SDsearchGUI.exe shortcut icon to your desktop or other convenient location.
  4. Create a shortcut to the PDBdock graphical interface by making a right-click on the application file C:\FBLDapps5\PDBdock.exe and selecting ‘Create Shortcut’ option. Move the PDBdock.exe shortcut icon to your desktop or other convenient location.
  5. The SDsearch software requires a license in order to use it but PDBdock is freely available for use without licensing.
    Add your license file, license.txt, to the C:\FBLDapps5 folder. In the absence of a license file the SDsearch GUI will load but the software will not run. Licenses may be obtained from DeltaG Technologies. Licenses from previous versions of SDsearch may be transfered to this installation.
  6. Optionally, note that the file C:\FBLDapps5\property_tags.txt provides a standard set of annotation tags for compound properties that have been culled from the SD files supplied by major chemical vendors. These tags will be recognized by the SDsearch application. If your SD files use other tags for the same chemical properties these may be added to the property_tags.txt file so that they will be understood by the program. (A backup copy of this file, called property_tags_backup_copy.txt, is also present in case of editing mistakes!).

The SDsearch and PDBdock GUIs should launch when you double-click on the SDsearchGUI and PDBdockGUI shortcuts. You should see the GUI and a text window that is used to relay in-process calculation information to the user. Selecting Quit exits from the application.

2.3 External software dependencies

PDBdock is a standalone application, without any external software dependencies.

Some of the more advanced aspects of the SDsearch application employ executable programs from the open source software project, OpenBabel. Specifically, molecular property calculations, SMILES string substructure searches and some aspects of compounds docking calculations implement operations performed with the OpenBabel software. OpenBabel is not included with the SDsearch installation because this could violate the terms of the open source license but installers for OpenBabel may be obtained from the OpenBabel installer page. SDsearch is compatible with versions 2.* of OpenBabel.

SDsearch is not a program for viewing the chemical structures of compounds in an SD file. One suitable free viewer for this purpose is the ChemFileBrowser from Hyleos. Note that when an SD file is opened by ChemFileBrowser it is locked and is inaccessible to SDsearch.

3. SDsearch user options

The SDsearch GUI divides the user input into a required input and three sections that may be used to run different types of searches:

The Run button launches calculations and the Quit button is used to shut down the GUI. Information may be entered through the GUI in any order.

3.1 Required Information

All searches require that the user enters the file name for the SD file containing the set of compounds that will be assessed. This input is achieved by clicking on the Load input SD file button, which brings up a file browser, keyed to show files containing the extension .sdf.

The Set output SD file button is used to establish the name and location of output file of compounds that meet the search criteria. The root file name should be entered in the file name field of file browser that appears after clicking this button. Each search will result in three output files with names based on this root – an SD file containing the filtered compounds, a text file containing the vendor compound ID codes for the filtered compounds and a log file containing diagnostics and analysis information from the search.

The Calculate properties checkbox may be selected if compound properties were not supplied by the vendor as annotations within the input SD file. Note that for mixed SD files, where annotation is incomplete across compounds, values for all compound properties will be filled so as to maintain calculation consistancy. Properties that may be calculated are molecular weight, the number of hydrogen bond donors, the number of hydrogen bond acceptors, the polar surface area and solubility (as logP). Be aware that the results of all property calculation programs are not identical, particularly in the classification of atoms as polar (especially S atoms) and the handling of nitrogen atoms within aromatic rings; it is not guaranteed that the program used by a vendor gave identical values to these that will be obtained by local property calculations.

3.2 Chemical Property Filters

This property filters buttons may be used specify the chemical property range for the acceptable compounds by loading preset values. By default, the properties values applied by SDsearch correspond to the No filters selection.

The Fragment option applies criteria that limit compounds to very small compounds (typically containing individual for fully conjugated cyclic groups) that might be included in crystallographic fragment screening libraries.

The Scaffold option applies similar criteria to those for the fragment compounds but is geared to accessing slightly larger compound types (typically molecules containing two cyclic groups connected by a short linker).

The Lead option sets compound properties similar to the Lipinski Rule-of-five.

The No filters option eliminates application of property filters. This option is useful when a compound set has already been filtered to an appropriate range of properties as it both avoids risk of accidently applying different filters and also speeds up subsequent calculations runs.

The values applied for the fragment, scaffold and lead defaults are listed in the table below.

 

No. Atoms

MW (Da)

No. Hbond acceptors

No. Hbond donors

clogP

No. Rot. Bonds

Polar surface area (Å2)

Fragment

8 – 20

100 – 250

0– 3

0  – 3

-4 – 3

0 – 3

5 – 60

Scaffold

12 – 22

150 – 275

0– 3

0  – 3

-4 – 3

0 – 3

5 – 60

Lead

12– 35

150 – 450

0 – 10

0  – 5

-4 – 5

0 – 5

5 – 150

 

Table 1 Property values corresponding to SDSearch GUI presets

The User file option allows the user to create and input a personal template text file containing selected property ranges. Any properties not defined in this template are set to the currently active values. An example of the contents of a template file containing fragment defaults is the file of fragment property presets C:\FBLDapps\standard_fragment_defaults (be careful not change this file!). The notation for the parameters tags within the property preset files, which might also be used in a user-defined property file is explained in the table below.

Property tag

Meaning

Value type

#NATOMS_MIN#

Minimum number of non-H atoms

Integer number

#NATOMS_MAX#

Maximum number of non-H atoms

Integer number

#MW_MIN#

Minimum molecular weight

Real number

#MW_MAX#

Maximum molecular weight

Real number

#HBD_MIN#

Minimum number of H-bond donors

Integer number

#HBD_MAX#

Maximum number of H-bond donors

Integer number

#HBA_MIN#

Minimum number of H-bond acceptors

Integer number

#HBA_MAX#

Maximum number of H-bond acceptors

Integer number

#CLOGP_MIN#

Minimum solubility

Real number

#CLOGP_MAX#

Maximum solubility

Real number

#RB_MIN#

Minimum number rotatable bonds

Integer number

#RB_MAX#

Maximum number rotatable bonds

Integer number

#TPSA_MIN#

Minimum polar surface area

Real number

#TPSA_MAX#

Maximum polar surface area

Real number

#BBB_CALCULATION#

Apply a loose blood-brain-barrier filter

Text  (yes/no)

#PROPERTY_FILTERS#

Apply or void all property filters.
Required to invoke user-defined property filters.

Text (yes/no)

 

Table 2 Annotation tags for use in setting a file of user defined property filters

The notation for user defined docking parameters (not normally applied) is defined in the table below.

Docking tag

Meaning

Value type

#MAXDOCK_CONF#

Maximum number of rotatable bonds allowed for docking (default 5, max 8)

Integer number

#MAX_CONF#

Number of docking conformers (default is to automatically define)

Integer number

#CP_SIGMA#

Standard deviation for defining the range of the atomic marker contact energy (default is 0.75Å)

Real number

 

Table 3 Annotation tags for use in setting a file of user defined docking parameters

3.3 Identity Filters

The identity filters buttons provide two mechanisms for excluding compounds from the output SD file.

Vendor IDs

The ID list file load button activates a file browser which may be used to load a text file containing a list of compound id values (one id per line) for exclusion. The list might be generated by hand, after examining compounds in a SD file or might be an edited version of the id list file output from a previous SDsearch run. It should be noted that compound identification tags are currently treated as a compound property so the ID check will be bypassed if the Property Filters/No defaults option is selected. If compound ID annotations are not present but the SD file contains molecule name information in each compound header block then the molecule name will be used instead.

The ID list file remove button turns off filtering by compound ID values.

Reference SD file

The Reference SD file load button provides a mechanism for loading a second reference SD file and checking the chemical structures of compounds in your input SD file for duplication in the reference SD file. Matched compounds are then removed.  This capability might be used when working with compound collections from multiple vendors in order to avoid duplication. Similarity checking can be relatively slow and may require ~2hrs to check 1200 compounds against a reference set of similar size.

The Reference SD file remove button turns off filtering by chemical structure identity.

3.4 Substructure Filters

SDsearch implements two different mechanisms for selecting compounds through chemical substructures.

SMILES strings

The SMILES file load button allows the user to load a text file containing a substructure as represented by a SMILES string, selecting only compounds containing that substructure. Most chemical drawing programs - for example, the ChemSketch software from ACD – provide a mechanism for generating SMILES strings. The SDsearch application will also accept a list of SMILES strings and in the case will also create separate SD files for all structures containing that SMILES string. By creating a file containing multiple SMILES strings compound collections may be analyzing in terms of chemical core classes. When using multiple SMILES strings the processing is (deliberately) order-dependent. i.e. once a compound is assigned as matching a SMILES string it is not analyzed again. This means that issues with, for example, benzene groups appearing within naphthalene classes can be avoided by placing the naphthalene substructure higher in the SMILES string list than benzene. In general, larger and more specific SMILES strings should appear higher in the list.

 The SMILES file remove button turns off filtering by SMILES string substructure matching.

Exposed atomic groups

The second selection mechanism uses the Motif(s) text box, which is either empty or contains the text ‘none’ when inactive. The text box may contain 1 or 2 separate selections indicating exposed interaction groups. The format of the notation is either A-B or A-B-C where A,B,C are element codes and ‘-‘ indicates a single bond and may be replaced with ‘=’ (double bond) or ‘#’ (triple bond). The example A-B indicates that atom B is exposed and only connected to atom A. The example A-B-C indicates that atom B is exposed and only connected to atoms A and C. The motif mechanism is relatively rapid and useful for capturing compounds when a required type of interaction group is indicated. For example, setting ‘c=o’ requires that all compounds contain a carbonyl and the notation ‘c-n=c’ requires that all compounds contain a nitrogen hydrogen-bond acceptor.   These element codes are case insensitive.

3.5 Protein Target Site Filters

SDsearch includes compound modeling algorithms that can be used to evaluate library compounds that contain substructures at known sites within the protein. These libraries might consist of compounds designed ('grown') from hits in crystallographic fragment screening experiments. More generally, the key binding substructure might have been identified and positioned by analogy with other protein-ligand complexes. This application processes libraries at the rate of several seconds per compound so is mainly suited to analysing small libraries (10's - 100's compounds) that have already been filtered by other methods. Run time depends on the number of rotatable bonds within the ligand (default maximum of 5). Note that the number of rotatable bonds for docking is determined by the software and may differ slightly from that provided within SD file annotation.

The PDB protein coordinate file (which also contains the target fragment) and the fragment id are required items for defining the targets of the filtering process. Compounds for which no available pose can be accomodated within the target site without clashing with the protein are rejected.

The Coordinate file load button must be used to load a PDB coordinate file containing the protein model and the positioned fragment in the target site. All atoms in the fragment will be matched so in some cases it may be necessary to create an edited fragment containing only key atoms from experimentally determined fragment. It is best if the fragment atoms comprise a relatively rigid substructure. If this file does not contain any protein atoms then the application functions in a scaffold hopping mode - it will find all matches to the target atoms in the fragment.. The remove button may be used to remove this selection.

The Fragment ID must be used to identity the fragment within the protein coordinate file. The field should contain the single character chain ID code (if present) followed by the 'residue' identification number for the fragment.

The VDW shrinkage factor field specifies a small decrement (default 0.2Å) on protein and compound VDW radii before a VDW violation is incurred. This factor helps accomodate some inaccuracies in the pose modeling and protein conformational changes. With small ligands it may be possible to reduce this value to 0.0 and with very large ligands it may need to be slightly increased. However, the ability of the software to discriminate between feasible and non-feasible compounds will diminish if this factor is too large.

The Marker file load button provides an option to load a 'marker file' of atoms or pseudoatoms in PDB format that may be used to guide the compound modeling towards required comformations. The marker file could be a previously known ligand that maps our the dockable space in the protein. The file is used to create an atomic population presence density which may be used to score a compound conformation for overlap of marker atoms. The scoring function has unit value for a ligand atom on top of a marker atom and decreasing according to a function exp(-r2/s2) where r is the distance of a compound atom from a marker atom. The default value of s is 0.75Å but may be altered using the #CP_SIGMA# keyword. Note that the input fragment atoms are included in this score. The remove button eliminates application of the marker file.

The Min marker contact score field specifies the minimum acceptable score for overlap of a compound conformation with a set of marker atoms in order to accept the pose as viable. Typical values for the score are 3.0-4.0, for a compound partially overlapping a marker atom set but this may vary significantly in different cases.

The Skip ligand optimization checkbox may be used to omit the optimization of the ligand conformation to fit the target site in the protein. This option speeds up calculations but may result in minor clashes between protein and ligand causing the compound to be rejected. The option may be useful for preliminary calculations when used with relatively large VDW shrink factors.

The Write docked ensemble checkbox controls the processing of ligand conformation generation. If not checked, the conformer generation stops for a compound once an acceptable pose is found. If checked, a full ensemble of ligand conformations is created and all acceptable poses for each compound are written into a an output PDB file. This PDB file is created in the same folder as the output SD file and named after the vendor ID of the compound (if present). Generation of the full ensemble is a much lengthier process than simple screening and is only needed if the user wishes to visualize docked conformations. At the present time SDsearch applies simple but relatively reliable restrictions on possible poses. It is expected that the ensemble contains a low energy ligand conformation close to the true structure but, depending on the available space and flexibility of the compound, might also contain very different poses.

4. SDsearch examples

Note that the example SD file data is not included in the SDsearch distribution.

4.1 A note on run times

Since the number of compounds within an input SD file may vary from 10s – 100,000s it will often be difficult to estimate the software run time in advance. It may be safer to divide a complex search into a series of SDsearch runs, using output from one run as input for the next. In this way it is also easier to backup and rerun parts of the search with modified parameters, without having to repeat the entire procedure, if results at one step are unsatisfactory.

The various types of search operations that may be executed within SDsearch are arranged in an order that will usually minimize run times for common tasks.   If all operations were executed in a single run the order would be: (i) filter on compound atom count, (ii) filter on compound properties, (iii) filter on compound id/similarity, (iv) filter on interaction motif, (v) filter on SMILES description, (vi) filter on protein target site structure. Use of this final filter requires considerably more time than other searches, typically many seconds for each candidate compound, depending on the number of rotatable bonds.

One trick for improving performance in the initial search is to always assume that compound properties are annotated within the SD file i.e. leave the Required Information/Calculate Properties box unchecked. If the run log then indicates that some of the properties were not available (often the molecular weight annotation is absent although in this case the user may wish to just use the atom count check as a sufficient proxy) it will be much faster to rerun with property calculation active on just the compounds that were output on this first run. This is because, although property calculations are relatively fast, when an initial input file contains 500,000 compounds a run time of some hours might be anticipated. However, with some property filters in place (or most compounds fully annotated) the initial generation of a fragment/scaffold library may well reduce the compound list to <5000 compounds that need to be passed through a repeat calculation.

4.2 Create a low molecular weight Ro3 fragment library from a large screening collection

  1. Click on the Load input SD file button and use the resulting file browser to navigate to the folder containing the input SD file. Load the input SD file by clicking on it and then select the Open button in the file browser window.
  2. Click on the Set output SD file button and use the resulting file browser to navigate to the directory where you wish the output to appear. Enter the file name root in the File name field in the file browser and select the Save button in the file browser window.
  3. Fragment properties are the software default so there is no need to select from the Property Filters options unless a prior selection in this session has been made.
  4. Click on the Run button at the bottom of the SDsearch GUI.

In process calculation data is displayed in the text port. At the conclusion of the calculation there will be three output files. If the root file name ‘pass1’ was entered in step 2 above these files will be called ‘pass1.sdf’, ‘pass1_ids.txt’ and ‘pass1_log.log’.

When applied to the Chembridge ‘exp_sdf’ screening collection (~440,000 compounds) the text port information (reproduced in the output log file) shows that ~91,000 compounds were prefiltered by the requirement that compounds contain 8-20 non-hydrogen atoms and that ~18,000 compounds are finally output following the property analysis. The run time for this calculation on a moderately high end laptop computer is under 4 minutes. The output annotation shows that the MW filter was not applied since this annotation is not present in the input SD file. The user may now choose to apply a strict molecular weight filter by calculating this property.

  1. Use the Load input SD file button to load the output SD file from the initial run (‘pass1.sdf’).
  2. Use the Set output SD file button to create a new output file name (say, ‘pass2’).
  3. Select the Calculate properties checkbox.
  4. Click on the Run button at the bottom of the SDsearch GUI.

This calculation takes ~7 minutes and reduces the ~18,000 input compounds to ~9,000 compounds that obey the precise molecular weight criteria. (Note: The calculation time is relatively long because all properties are calculated for the input compounds. It would be faster and, arguably, it may be better to fine-tune the allowed molecular size range using number of non-hydrogen atoms criteria, which could be input from a parameter template file containing modified  #NATOMS_MIN# and #NATOMS_MAX# items through the User defaults button.).

4.3 Determine all compounds in a library that contain an exposed Br atom

  1. Click on the Load input SD file button and use the resulting file browser to navigate to the folder containing the input SD file. Load the input SD file by clicking on it and then select the Open button in the file browser window.
  2. Click on the Set output SD file button and use the resulting file browser to navigate to the directory where you wish the output to appear. Enter the file name root in the File name field in the file browser and select the Save button in the file browser window.
  3. If the compound properties for this input compound set are acceptable, select the Property Filters/No defaults button. Otherwise make an appropriate selection from Property Filters.
  4. Enter the text ‘c-br’ (to indicate a carbon-bromine group) in the Motif(s) text box.
  5. Click on the Run button at the bottom of the SDsearch GUI.

This calculation takes ~0.5min for the ~18,000 compounds exported from the initial pass in the calculation described in section 4.2, with ~1450 Br-containing compounds output.

4.4 Analyze a fragment library for the example cyclic cores identified by Hartshorn et al (Fragment-Based Lead Discovery Using X-ray Crystallography, JACS 48, 403-413, 2005).

  1. Click on the Load input SD file button and use the resulting file browser to navigate to the folder containing the input SD file. Load the input SD file by clicking on it and then select the Open button in the file browser window.
  2. Click on the Set output SD file button and use the resulting file browser to navigate to the directory where you wish the output to appear. Enter the file name root in the File name field in the file browser and select the Save button in the file browser window.
  3. If the compound properties for this input compound set are acceptable, select the Property Filters/No defaults button. Otherwise make an appropriate selection from Property Filters.
  4. Click on the Load SMILES file button and use the resulting file browser to navigate to the folder containing the input list of SMILES string containing the cores. (The installation directory contains an example file, called smiles_hartshorn_fig2.txt). Load the input smiles file by clicking on it and then select the Open button in the file browser window.
  5. Click on the Run button at the bottom of the SDsearch GUI.

This calculation takes about 0.5 min when applied to the 352 compound small fragment library supplied by Zenobia Therapeutics. The output log shows that 18/19 of the cores tested are present in this library and, in addition to the usual output files, a separate SD file is created for each core class. The small number of compounds not assigned to these core classes are captured in an SD file prefixed ‘nosmiles_’. In some case an examination of the core groups indicates that it might be better to provide additional core class types to subdivide compounds in the calculation. For example, any core types containing a benzene as a subgroup but which are not independently identified by a SMILES string above benzene in the SMILES string list are loaded into the benzene class.

Return to main page.