Lomap Legacy API#

class lomap.dbmol.DBMolecules(directory: str, parallel: int = 1, verbose: str = 'off', time: int = 20, ecrscore: float = 0.0, threed: bool = False, max3d: float = 1000.0, element_change: bool = True, output: bool = False, name: str = 'out', output_no_images: bool = False, output_no_graph: bool = False, display: bool = False, allow_tree: bool = False, max: int = 6, cutoff: float = 0.4, radial: bool = False, hub: str | None = None, fast: bool = False, links_file: str | None = None, known_actives_file: str | None = None, max_dist_from_actives: int = 2, use_common_core: bool = True, shift: bool = True)#

This class is used as a container for all the Molecules

Initialization of the Molecule Database Class

Parameters:
  • directory (str) – the mol2/sdf directory file name

  • parallel (int) – the number of cores used to generate the similarity score matrices

  • verbose (str) – verbose mode, one of ‘off’/’info’/’pedantic’

  • time (int) – the maximum time in seconds used to perform the MCS search

  • ecrscore (float) – the electrostatic score to be used (if != 0) if two molecule have different charges

  • threed (bool) – If true, symmetry-equivalent MCSes are filtered to prefer the one with the best real-space alignment

  • max3d (float) – The MCS is filtered to remove atoms which are further apart than this threshold. The default of 1000 is effectively “no filter”

  • element_change (bool) – Whether to allow changes in elements between two mappings. Defaults to True

  • output (bool) – a flag used to generate or not the output files

  • name (str) – the file name prefix used to produce the output files

  • output_no_images (bool) – a flag used to disable the generation of the output image files

  • output_no_graph (bool) – a flag used to disable the generation of the output graph (.dot) file

  • display (bool) – a flag used to display or not a network made by using matplotlib

  • allow_tree (bool) – if set, then the final graph does not need a cycle covering and will be a tree

  • max (int) – the maximum diameter of the resulting graph

  • cutoff (float) – the Minimum Similarity Score (MSS) used to build the graph

  • links_file (str) – the name of a file containing links to seed the graph with

  • known_actives_file (str) – the name of a file containing mols whose activity is known

  • max_dist_from_actives (int) – The maximum number of links from any molecule to an active

  • use_common_core (bool, optional) – Whether to search among all input molecules for a common core to speed up pairwise MCS calculations, default True

  • shift (bool, optional) – When using ‘threed’ option, if to translate the two molecules to superimpose before checking real space alignment, default True

build_graph()#

This function coordinates the Graph generation

build_matrices()#

This function coordinates the calculation of the similarity score matrices by distributing chunks of the matrices between the allocated processes

compute_mtx(a, b, strict_mtx, loose_mtx, true_strict_mtx, MCS_map)#

Compute a chunk of the similarity score matrices. The chunk is selected by the start index a and the final index b. The matrices are indeed treated as linear array

Parameters:
  • a (int) – the start index of the chunk

  • b (int) – the final index of the chunk

  • strict_mtx (python multiprocessing array) – strict similarity score matrix. This array is used as shared memory array managed by the different allocated processes. Each process operates on a separate chunk selected by the indexes a and b

  • loose_mtx (python multiprocessing array) – loose similarity score matrix. This array is used as shared memory array managed by the different allocated processes. Each process operates on a separate chunk selected by the indexes a and b

  • true_strict_mtx (python multiprocessing array) – Holds the strict score before that is potentially modified by the prespecified link function (which sets the link score to 1.0).

  • MCS_map (dict (multiprocessing)) – Holds a dict of (index tuple) -> string with the strings being the MCS atom index map between the two molecules

next()#

Select the molecule during an iteration

nums()#

This function recovers the total number of molecules currently stored in the molecule database

read_molecule_files()#

Read in all the mol2 or SDF files

Returns:

molid_list – the container list of all the allocated Molecule objects

Return type:

list of Molecule objects

write_dic()#

This function writes out a text file with the mapping between the generated molecule indexes and the corresponding molecule file names

class lomap.dbmol.SMatrix(shape, dtype=<class 'float'>, buffer=None, offset=0, strides=None, order=None)#

This class implements a “basic” interface for symmetric matrices subclassing ndarray. The class internally stores a bi-dimensional numpy array as a linear array A[k], however the user can still access to the matrix elements by using a two indeces notation A[i,j]

mat_size()#

This function returns the size of the square similarity score matrix

Returns:

n – the size of the similarity score matrix

Return type:

int

to_numpy_2D_array()#

This function returns the symmetric similarity score numpy matrix generated from the linear array

Returns:

np_mat – the symmetric similarity score numpy matrix built by using the linear array

Return type:

numpy matrix

class lomap.dbmol.Molecule(molecule, mol_id, molname)#

This Class stores the Rdkit molecule objects, their identification number and the total number of instantiated molecules

Initialization class function

Parameters:
  • molecule (Rdkit molecule object) – the molecule

  • mol_id (int) – the molecule identification number

  • molname (str) – the molecule file name

getID()#

Get the molecule ID number

Returns:

  • int

  • the molecule ID number

getMolecule()#

Get the Rdkit molecule object

Returns:

mol_copy – The copy of the RDkit molecule

Return type:

Rdkit molecule object

getName()#

Get the molecule file name

Returns:

  • str

  • the molecule string file name

isActive()#

Get whether the molecule is active

Returns:

  • bool

  • the molecule active status

setActive(active)#

Set whether the molecule is active

class lomap.mcs.MCS(moli, molj, time: int = 20, verbose: str = 'info', max3d: float = 1000.0, threed: bool = False, element_change: bool = True, seed: str = '', shift: bool = True)#

This class is used to compute the Maximum Common Subgraph (MCS) between two RDkit molecule objects and to score their similarity by using defined rules

Initialization function

Parameters:
  • moli (RDKit molecule object) – the first molecule used to perform the MCS calculation

  • molj (RDKit molecule object) – the second molecule used to perform the MCS calculation

  • time (int, optional) – timeout on MCS, default 20

  • verbose (str, optional) – logging level, default ‘info’

  • max3d (float, optional) – The MCS is trimmed to remove atoms which are further apart than this distance (in units of Angstrom), default 1,000.0 (i.e. do not trim)

  • threed (bool, optional) – When disambiguating the substructure found back to the original molecules, if True 3D coordinates are used, otherwise the number of elemental changes is minimised. default False.

  • element_change (bool, optional) – whether to allow elemental changes in mappings, default True

  • seed (string, optional) – Initial SMARTS seed for MCS search.

  • shift (bool, optional) – When using threed, if to shift the molecules coordinates to maximise overlap, default True

  • versionchanged: (..) – 2.1.0: Added element_change kwarg

  • versionchanged: – 2.2.0: Added seed option

  • versionchanged: – 2.3.0: Added shift option

all_atom_match_list()#

Gives a string listing the MCS match between the two molecules as atom_m1:atom_m2,atom_m1:atom_m2,…

All atoms including hydrogens. The string is sorted by first index. We need to be careful that this function is symmetric, and that hydrogens are mapped correctly.

atomic_number_rule()#

This rule checks how many elements have been changed in the MCS and a score based on the fraction of MCS matches that are the same atomic number. When used with beta=0.1 and multiplied by mcsr, this is equivalent to counting mismatched atoms at only half weight.

This has been extended to modify the amount of mismatch according to the atoms being mapped.

static getMapping(moli, molj, hydrogens=False, fname=None, time_out=150)#

Compute the MCS between two passed molecules

Parameters:
  • moli (RDKit molecule object) – the first molecule used to perform the MCS calculation

  • molj (RDKit molecule object) – the second molecule used to perform the MCS calculation

  • hydrogens (bool) – incluse or not the hydrogens in the MCS calculation

  • fname (string) – the filename used to output a png file depicting the MCS mapping

  • time_out (int) – the max time in seconds used to compute the MCS

Returns:

map_moli_molj – the list of tuple which contains the atom mapping indexes between the two molecules. The indexes (i,j) are resplectively related to the first (moli) and the second (molj) passed molecules

Return type:

python list of tuple […(i,j)…]

heavy_atom_match_list()#
Gives a string listing the MCS match between the two molecules as

atom_m1:atom_m2,atom_m1:atom_m2,…

Heavy atoms only

heavy_atom_mcs_map()#

Gives a list of tuples mapping atoms from moli to molj Heavy atoms only, returned sorted by first index

heterocycles_rule(penalty=4)#

This rule checks to see if we are growing a heterocycle from a hydrogen, and returns <1 if we are. This means that if this rule is used we penalise this transition. Testing has shown that growing a pyridine or other heterocycle is unlikely to work (better to grow phenyl then mutate)

Parameters:

penalty (the number of atom mismatches that failing this rule will lower the score by) –

hybridization_rule(penalty_weight=1.5)#

This rule checks how many atoms have changed hybridization state. The penalty weight means how many “atoms” different a hybridization state change is: 1 means that the atom is effectively removed from the MCS for scoring purposes, 0 means that hybridization changes are free. When used with beta=0.1 and multiplied by mcsr, this is equivalent to counting mismatched atoms at a weight of (1-penalty_weight)

mcsr()#

This rule computes the similarity between the two passed molecules used to compute the MCS

Returns:

scr_mcsr – the rule score

Return type:

float

mncar(ths=4)#

This rule cut the similarity score between two molecules if they do not share the selected number of atoms

Parameters:

ths (float) – the minumum number of atoms to share

Returns:

scr_mncar – the rule score

Return type:

float

sulfonamides_rule(penalty=4)#

This rule checks to see if we are growing a complete sulfonamide, and returns 0 if we are. This means that if this rule is used we effectively disallow this transition. Testing has shown that growing -SO2NH2 from scratch performs very badly.

Parameters:

penalty (the number of atom mismatches that failing this rule will lower the score by) –

transmuting_methyl_into_ring_rule(penalty=6)#

Rule to prevent turning a methyl into a ring atom and similar transformations (you can grow a ring, but you can’t transmute into one)

Parameters:

penalty (the number of atom mismatches that failing this rule will lower the score by) –

transmuting_ring_sizes_rule()#

Rule to prevent turning a ring atom into a ring atom with a different ring size (you can grow a ring, but you can’t turn a cyclopentyl into a cyclohexyl)

Hard rule: sets score to near zero if violated

class lomap.graphgen.GraphGen(score_matrix: ndarray, ids: list, names: list[str], max_path_length, actives: list[bool], max_dist_from_active: int, similarity_cutoff: float, require_cycle_covering, radial: bool, fast: bool, hub: str | None = None)#

This class is used to set and generate the graph used to plan binding free energy calculation

Parameters:
  • score_matrix (np.ndarray) – array of scores between each molecule. Should be a symmetric (n x n) matrix

  • ids (list[int]) – indices for each molecule. Should be the same length as the score_matrix. These ids are used as the ‘ID’ attribute in the resulting graph

  • names (list[str]) – list of string identifiers for each ligand these names are used as the ‘fname_comp’ attribute in the resulting graph

  • max_path_length

    ???

  • actives (list[bool]) – for each ligand in input, if they are considered active. This is used in conjunction with the max_dist_from_active argument

  • max_dist_from_active (int) –

    ???

  • similarity_cutoff (float) – the value above which edges must be to be considered viable. 0.0 would allow all edges

  • require_cycle_covering (bool) –

    ???

  • radial (bool) – whether to construct a radial graph. Note that this radial graph will still include cycles

  • fast (bool) –

    ???

  • hub (str, optional) – the name of the ligand to use as the center of the hub

add_surrounding_edges(subgraphs: list, score_matrix: ndarray, lead_index: int, similarity_score_limit: float)#

Add surrounding edges in each subgraph to make sure all nodes are in cycle

check_constraints(subgraph, numComp, require_cycle_covering)#

Determine if the given subgraph still meets the constraints

Parameters:
  • subgraph (NetworkX subgraph obj) – the subgraph to check for the constraints

  • numComp (int) – the number of connected componets

  • require_cycle_covering (bool) – if to enforce cycle covering

Returns:

constraintsMet – True if all the constraints are met, False otherwise

Return type:

bool

static check_cycle_covering(subgraph, non_cycle_edges_set)#

Checks if the subgraph has a cycle covering. Note that this has been extended from the original algorithm: we not only care if the number of acyclic nodes has increased, but we also care if the number of acyclic edges (bridges) has increased. Note that if the number of acyclic edges hasn’t increased, then the number of acyclic nodes hasn’t either, so that test is included in the edges test.

Parameters:
  • subgraph (NetworkX subgraph obj) – the subgraph to check for connection after the edge deletion

  • non_cycle_edges_set

Returns:

hasCovering – True if the subgraph has a cycle covering, False otherwise

Return type:

bool

check_distance_to_active(subgraph, distance_to_active_failures, max_distance_from_active)#

Check to see if we have increased the number of distance-to-active failures

Parameters:
  • subgraph (NetworkX subgraph obj) – the subgraph to check for the max distance between nodes

  • distance_to_active_failures

  • max_distance_from_active

Returns:

ok – True if we have not increased the number of failed nodes

Return type:

bool

static check_max_distance(subgraph, max_path_length) bool#

Check to see if the graph has paths from all compounds to all other compounds within the specified limit

Parameters:
  • subgraph (NetworkX subgraph obj) – the subgraph to check for the max distance between nodes

  • max_path_length

Returns:

withinMaxDistance – True if the subgraph has all the nodes within the specified max distance

Return type:

bool

connect_graph_components_brute_force()#

Adds edges to the resultGraph to connect all components that can be connected, only one edge is added per component, to form a tree like structure between the different components of the resultGraph

Returns:

True if the addition of edges was possible in strict mode, False otherwise

Return type:

bool

connect_graph_components_brute_force_2()#

Adds a second edge between each of the (former) components of the resultGraph to try to provide cycles between (former) components

Returns:

True if the addition of edges was possible in loose mode, False otherwise

Return type:

bool

connect_subgraphs()#

Adds edges to the resultGraph to connect as many components of the final graph possible

static count_distance_to_active_failures(subgraph, max_dist_from_active)#

Count the number of compounds that don’t have a minimum-length path to an active within the specified limit

Parameters:
  • subgraph (NetworkX subgraph obj) – the subgraph to check for the max distance between nodes

  • max_dist_from_active

Returns:

failures – Number of nodes that are not within the max distance to any active node

Return type:

int

draw(dbase, max_images: int = 2000, max_nodes: int = 100, edge_labels: bool = True)#

This function plots the NetworkX graph by using Matplotlib

Parameters:
  • dbase

  • max_images (int) – Max number of displayed chemical compound images as graph nodes

  • max_nodes (int) – Max number of displayed nodes in the graph

  • edge_labels (bool) –

generate_depictions(dbase, max_images: int = 2000, max_mol_size: float = 50.0, edge_labels: bool = True)#
Parameters:
  • dbase

  • max_images (int) – Max number of displayed chemical compound images as graph nodes

  • max_mol_size (float) – The maximum threshold distance in angstroms unit used to select if a molecule is depicted

  • edge_labels (bool) – if to add labels on edges

static generate_initial_subgraph_list(fast_map, strict_mtx, ids, names, is_active, lead_index: int)#

This function generates a starting graph connecting with edges all the compounds with a positive strict similarity score

Parameters:
  • fast_map (bool) – chooses one of two algorithms

  • strict_mtx (np.ndarray) – matrix of scores between molecules

  • ids (list) – list of identifiers for each molecule

  • names (list) – names of each molecule

  • is_active (list[bool]) – for each molecule, whether it is active

  • lead_index (int) –

Returns:

initialSubgraphList – the list of connected component graphs

Return type:

list of NetworkX graph

static generate_subgraph_scores_lists(subgraphList)#

This function generate a list of lists where each inner list is the weights of each edge in a given subgraph in the subgraphList, sorted from lowest to highest

Returns:

subgraphScoresLists – each list contains a tuple with the graph node indexes and their similatiry as weigth

Return type:

list of lists

static generate_working_subgraphs_list(subgraph_list)#

After the deletition of the edges that have a weigth less than the selected threshould the subgraph maybe disconnected and a new master list of connected subgraphs is genereted

Returns:

workingSubgraphsList – each list contains a tuple with the graph node indexes and their similatiry as weigth

Return type:

list of lists

static merge_all_subgraphs(working_subgraphs)#

Generates a single networkx graph object from the subgraphs that have been processed

Returns:

finalGraph – the final graph produced merging all the subgraphs. The produced graph may have disconneted parts

Return type:

NetworkX graph obj

minimize_edges(require_cycle_covering)#

Minimize edges in each subgraph while ensuring constraints are met

static pick_lead(hub: str, names: list[str], strict_mtx) int#

Pick lead compount

Parameters:
  • hub (str) – input of desired hub

  • names (list[str]) – names of each molecule

  • strict_mtx – scoring matrix

Return type:

index of lead compound

static remains_connected(subgraph, numComponents) bool#

Determine if the subgraph remains connected after an edge has been removed

Parameters:
  • subgraph (NetworkX subgraph obj) – the subgraph to check for connection after the edge deletion

  • numComponents (int) – the number of connected components

Returns:

isConnected – True if the subgraph is connected, False otherwise

Return type:

bool

static remove_edges_below_hard_limit(subgraphlist, scores, similarity_scores_limit)#

This function removes edges below the set hard limit from each subGraph and from each weightsList

Operates on subgraphlist in-place!

Parameters:
  • subgraphlist (list) –

  • scores (list) –

  • similarity_scores_limit

write_graph(dbase, output_no_images, output_no_graph)#

This function writes to a file the final generated NetworkX graph as .dot and the .ps files. The mapping between molecule IDs and compounds name is saved as text file