The Animals (API)

Animals listed in alphabetical order.

BONOBO

To run BONOBO you can first initialize the class with the expression data, and then run the compute_bonobo function to get the results:

bonobo_obj_sparse = Bonobo(expression_file)
bonobo_obj_sparse.run_bonobo(keep_in_memory=True, output_fmt='.hdf', sparsify=True, output_folder='../data/processed/bonobo_sparse_pvals/', save_pvals=False)

Here are the main functions and classes in the BONOBO module:

Other functions

CONDOR

class netZooPy.condor.condor.condor_object(network_file=None, sep=',', index_col=0, header=0, dataframe=None, silent=False)[source]

Initialization of the condor object. The function gets a network in edgelist format as a path to a file or encoded in a pandas dataframe. Builds a condor_object with an edgelist,an igraph network, names of the targets and regulators.

Note: The edgelist is assumed to contain a bipartite network. The program will relabel the nodes so that the edgelist represents a bipartite network anyway. It is on the user to know that the network they are using is suitable for the method.

Parameters:

network_file (str) – Path to file encoding an edgelist.
sep (str) – Separator used in the file.
index_col (int) – Column that stores the index of the edgelist. E.g. None, 0…
header (int) – Row that stores the header of the edgelist. E.g. None, 0…
dataframe (dataFrame) – Pandas DataFrame containing the edgelist. Use as alternative to filename
silent (Silent mode, will not print anything on console.)

Returns:

condor_object (object) – The object has the following attributes: - net: Contains the network edgelist as a pandas DataFrame. - graph: iGraph object containing the bipartite network. - reg_names: list of the nodes in the first column. - tar_names: list of the nodes in the second column. - index_dict: dictionary keeping track of the indices in the “graph” variable and the actual names of the nodes.
warning – Condor uses iGraph for the node assignment. For the initialization of the assignment, there is some stochasticity involved, that can be somehow controlled by setting the python random seed.

For instance, running condor twice on the same dataset, in the same python process, might result in slightly different assignments. To avoid this behavior, you can set the seed ( random.seed(0) ) before calling condor.
```
>>> import random
>>> random.seed(1)
>>> c1 = condor(...)
>>> random.seed(1)
>>> c2 = condor(...)
```
In this case c1 and c2 are exactly the same.

On the contrary, if condor is called twice during two different python calls, you will have the exact same results, as the random seed will have resetted. Instead, the stochasticity of the initial assignment can be kept by setting a random seed at the beginning
```
>>> import random
>>> random.seed(random.randint(1,10000000))
>>> condor(...)
```

bipartite_modularity(B, m, R, T)[source]

Computation of the bipartite modularity as described in Michael J. Barber. Modularity and community detection in bipartite networks.

Parameters:

B (array) – modularity matrix.
m (array) – sum of the weights (or number of edges in the unweighted case).
R (array) – community assignement matrix for reg nodes.
T (array) – community assignement matrix for tar nodes.

Returns:

Q – Modularity score.

Return type:

_

Notes

self.Qcoms: _: Modularity contribution by each community.
self.modularity: _: Modularity score.

References

brim(deltaQmin='def', c='def', resolution=1)[source]

Implementation of the BRIM algorithm to iteratively maximize bipartite modularity. Note that c is the maximum number of communities. Dynamic choice of c is not yet implemented.

Parameters:

deltaQmin (str) – Difference modularity threshold for stopping the iterative process
c (int) – max number of communities.
resolution (float)

Notes

Updates condor object with the following: - self.modularity: Modularity score for the final assignement. - self.tar_memb: Final community assignement for tar nodes. - self.reg_memb: Final community assignement for reg nodes.

Note

c : Has to be bigger than the number of communities given by the initial community assignement. Otherwise the program will crash. The default option gives room for 20% more communities which rarely fails.

initial_community(method='LDN', project=False, resolution=1)[source]

Computation of the initial community structure based on unipartite methods.

Parameters:

method (str) –
Method to determine intial community assignment.
- ”LCS”: Multilevel method.
- ”LDN”: Leiden method
project (bool) – Whether to apply the initial community structure on the bipartite network disregarding the bipartite structure or apply it to the unipartite network resulting from the projection onto one of these nodes.

Outputs:

selfupdates condor object with: tar_memb: DataFrame of initial target node membership. reg_memb: DataFrame of initial reg node membership.

matrices(c, resolution)[source]

Computation of modularity matrix and initial community matrix.

Parameters:

c (int) – max number of communities.

Returns:

B (array) – Modularity matrix.
m (int) – Sum of the weights (or number of edges in the unweighted case.
T0 (array) – Initial community structure matrix for target nodes.
R0 (array) – Initial community structure matrix for reg nodes.
gn (dict) – Index dictionary for tar node names.
rg (dict) – Index dictionary for reg node names.

qscores()[source]: Computes the qscores (contribution of a vertex to its community modularity) for each vertex in the network.

Other function

netZooPy.condor.run_condor(network_file, sep=',', index_col=0, header=0, initial_method='LDN', initial_project=False, com_num='def', deltaQmin='def', resolution=1, return_output=False, tar_output='tar_memb.txt', reg_output='reg_memb.txt', silent=False)[source]

Computation of the whole condor process. It creates a condor object and runs all the steps of BRIM on it. The function outputs

Note: The edgelist is assumed to contain a bipartite network. The program will relabel the nodes so that the edgelist represents a bipartite network anyway. It is on the user to know that the network they are using is suitable for the method.

Parameters:

network_file (str) – Path to file encoding an edgelist.
sep (str) – Separator used in the file.
index_col (int) – Column that stores the index of the edgelist. E.g. None, 0…
header (int) – Row that stores the header of the edgelist. E.g. None, 0…
initial_method (str) – Method to determine intial community assignment. (By default Leiden method).
initial_project (bool) – Whether to project the network onto one of the bipartite sets for the initial community detection.
com_num (int) – Max number of communities. It is recomended to leave this to default, otherwise if the initial community assignement is bigger the program will crash.
deltaQmin (float) – Difference modularity threshold for stopping the iterative process.
resolution (int) – Not yet implemented.
return_output (bool) – Whether the function returns the created condor object.
tar_output (str) – Filename for saving the tar node final membership.
reg_output (str) – Filename for saving the reg node final membership.
silent (Run in silent mode)

Return type:

Files “tar_memb.txt” and “reg_memb.txt” encoding the final tar and reg node membership.

LIONESS

class netZooPy.lioness.lioness.Lioness(obj, computing='cpu', precision='double', ncores=1, start=1, end=None, subset_numbers=None, subset_names=None, save_dir='lioness_output', save_fmt='npy', output='network', alpha=0.1, save_single=False, export_filename=None, ignore_final=False, online_coexpression=False)[source]

Using LIONESS to infer single-sample gene regulatory networks.

Reading in PANDA network and preprocessed middle data
Computing coexpression network
Normalizing coexpression network
Running PANDA algorithm
Writing out LIONESS networks

Parameters:

obj (object) – PANDA object, generated with keep_expression_matrix=True.
computing (str) – ‘cpu’ uses Central Processing Unit (CPU) to run PANDA ‘gpu’ use the Graphical Processing Unit (GPU) to run PANDA
precision (str) – ‘double’ computes the regulatory network in double precision (15 decimal digits). ‘single’ computes the regulatory network in single precision (7 decimal digits) which is fastaer, requires half the memory but less accurate.
subset_numbers (list) – List of sample index onto which lioness should be run. ([1,10,20])
subset_names (list) – List of sample names onto which lioness should be run. ([‘s1’,’s2’,’s3’])
start (int) – Index of first sample to compute the network. If subset_numbers or subset_names is passed, this is ignored
end (int) – Index of last sample to compute the network.If subset_numbers or subset_names is passed, this is ignored
all_background (bool) – Pass the flag if you want to keep the whole samples as background
save_dir (str) – Directory to save the networks.
save_fmt (str) – Save format. - ‘.npy’: (Default) Numpy file of the network. - ‘.txt’: Text file, only values are saved, no tf or gene names. Will be deprecated. - ‘.csv’: text file with index (tf) and column (gene) names - ‘.h5’: hdf file, fastest way to save the lioness dataframe with index/column names - ‘.mat’: MATLAB file.
output (str) –
- ‘network’ returns all networks in a single edge-by-sample matrix (lioness_obj.total_lioness_network is the unlabeled variable and lioness_obj.export_lioness_results is the row-labeled variable). For large sample sizes, this variable requires large RAM memory.
- ’gene_targeting’ returns gene targeting scores for all networks in a single gene-by-sample matrix (lioness_obj.total_lioness_network).
- ’tf_targeting’ returns tf targeting scores for all networks in a single gene-by-sample matrix (lioness_obj.total_lioness_network).
alpha (float) – learning rate, set to 0.1 by default but has to be changed manually to match the learning rate of the PANDA object.
save_single (bool) – when set to True it will save each lioness network with its sample name inside the lioness output folder
export_filename (str) – if passed, the final lioness table will be saved with all tf-gene edges as dataframe index and samples as column name
ignore_final (bool) – if True, no lioness network is kept in memory. This requires saving single networks at each step
online_coexpression (bool) – if True, each LIONESS correlation is computed using the online coexpression method.

Returns:

export_lioness_results – Depeding on the output argument, this can be either all the lioness networks or their gene/tf targeting scores.

Return type:

_

Notes

Example on how to use Lioness and plot the network

>>> from netZooPy.lioness.lioness import Lioness
>>> #To run the Lioness algorithm for single sample networks, first run PANDA using the keep_expression_matrix flag, then use Lioness as follows:
>>> panda_obj = Panda('../../tests/ToyData/ToyExpressionData.txt', '../../tests/ToyData/ToyMotifData.txt', '../../tests/ToyData/ToyPPIData.txt', remove_missing=False, keep_expression_matrix=True)
>>> lioness_obj = Lioness(panda_obj)

>>> #Save Lioness results:
>>> lioness_obj.save_lioness_results('Toy_Lioness.txt')
>>> #Return a network plot for one of the Lioness single sample networks:
>>> plot = AnalyzeLioness(lioness_obj)
>>> plot.top_network_plot(column= 0, top=100, file='top_100_genes.png')

Example lioness output:

TF, Gene and Motif order is identical to the panda output file.

Sample1 Sample2 Sample3 Sample4
-0.667452814003 -1.70433776179 -0.158129613892 -0.655795512803
-0.843366539284 -0.733709815256 -0.84849895139 -0.915217389738
3.23445386464 2.68888472802 3.35809757371 3.05297381396
2.39500370135 1.84608635425 2.80179804094 2.67540878165
-0.117475863987 0.494923925853 0.0518448588965 -0.0584810456421

References

Authors: Cho-Yi Chen, David Vi, Daniel Morgan

__compute_subset_panda(i)

Compute the subset panda network using the correlation matrix and the motif matrix.

Parameters:: correlation_matrix (array) – The coexpression network to be used for computing the subset panda network.
Returns:: subset_panda_network – The subset panda network.
Return type:: array

__lioness_loop(i)

Initialize instance of Lioness class and load data.

Returns:: self.total_lioness_network – An edge-by-sample matrix containing sample-specific networks.
Return type:: array

__lioness_to_disk(net, path)

__par_lioness_loop

_normalize_network(x)

Standardizes the input data matrices.

Parameters:: x (array) – Input adjacency matrix.
Returns:: normalized_matrix – Standardized adjacency matrix.
Return type:: array

export_lioness_table(output_filename='lioness_table.txt', header=False, output='network')[source]

Saves LIONESS network with edge names. This saves a dataframe with the corresponding header and indexes.

Parameters:: output_filename (str) – Path to save the network. Specify relative path and format. Choose between .csv, .tsv and .txt. (Defaults to .lioness_table.txt))

panda_loop(correlation_matrix, motif_matrix, ppi_matrix, computing='cpu', alpha=0.1)

The PANDA algorithm.

Parameters:

correlation_matrix (array) – Input coexpression matrix.
motif_matrix (array) – Input motif regulation prior network.
ppi_matrix (array) – Input PPI matrix.
computing (str) – ‘cpu’ uses Central Processing Unit (CPU) to run PANDA. ‘gpu’ use the Graphical Processing Unit (GPU) to run PANDA.

processData(modeProcess, motif_file, expression_file, ppi_file, remove_missing, keep_expression_matrix, start=1, end=None, with_header=False, cobra_design_matrix=None, cobra_covariate_to_keep=0)

Processes data files into data matrices.

Parameters:

expression_file (str) – Path to file containing the gene expression data or pandas dataframe. By default, the expression file does not have a header, and the cells ares separated by a tab. Pass with_header=True if the expression data includes the sample names
motif_file (str) – Path to file containing the transcription factor DNA binding motif data in the form of TF-gene-weight(0/1) or pandas dataframe. If set to none, the gene coexpression matrix is returned as a result network.
ppi_file (str) – Path to file containing the PPI data. or pandas dataframe. The PPI can be symmetrical, if not, it will be transformed into a symmetrical adjacency matrix.
remove_missing (bool) – Removes the gens and TFs that are not present in one of the priors. Works only if modeProcess=’legacy’.
keep_expression_matrix (bool) – Keeps the input expression matrix in the result Panda object.
modeProcess (str) – The input data processing mode. - ‘legacy’: refers to the processing mode in netZooPy<=0.5 - (Default)’union’: takes the union of all TFs and genes across priors and fills the missing genes in the priors with zeros. - ‘intersection’: intersects the input genes and TFs across priors and removes the missing TFs/genes.
with_header (bool) – pass True when the expression file has a header with the sample names
tmp_folder (str) – Path to the folder to save temporary files

return_panda_indegree(): Computes indegree of PANDA network, only if save_memory = False.

return_panda_outdegree(): computes outdegree of PANDA network, only if save_memory = False.

save_lioness_results(lioness_output_filename=None)[source]: Saves LIONESS network. Uses self.save_fmt, self.save_dir to save the data into self.total_lioness_network

save_panda_results(path='panda.npy', save_adjacency=False, old_compatible=True)

Saves PANDA network.

Parameters:

path (str) – Path to save the network.
save_adjacency (bool) – if True the output is an adjacency matrix and not the edge list
old_compatible (bool) – if True saves the data as it was saved until netzoopy 0.9.11

top_network_plot(top=100, file='panda_top_100.png', plot_bipart=False)

Selects top genes.

Parameters:

top (int) – Top number of genes to plot.
file (str) – File to save the network plot.
plot_bipart (bool) – Plot the network as a bipartite layout.

class netZooPy.lioness.analyze_lioness.AnalyzeLioness(lioness_data)[source]

Plots LIONESS network.

Parameters:: lioness_data (object) – lioness object.

top_network_plot(index=0, top=100, file='lioness_top_100.png')[source]

Network of top genes.

Parameters:

index (int (defaults to 0)) – Index of sample to plot.
top (int (defaults to 100)) – Top number of genes to plot.
file (str) – File to save the network plot.

PANDA

class netZooPy.panda.panda.Panda(expression_file, motif_file, ppi_file, computing='cpu', precision='double', save_memory=True, save_tmp=False, remove_missing=False, keep_expression_matrix=False, modeProcess='union', alpha=0.1, start=1, end=None, with_header=False, cobra_design_matrix=None, cobra_covariate_to_keep=0, tmp_folder='./tmp/', process_data_only=False)[source]

Using PANDA to infer gene regulatory network.

Reading in input data (expression data, motif prior, TF PPI data)
Computing coexpression network
Normalizing networks
Running PANDA algorithm
Writing out PANDA network

Parameters:

expression_file (str) – Either i) a string of a path to file containing the gene expression data or ii) a pandas dataframe. By default, the expression file does not have a header, and the cells ares separated by a tab.
motif_file (str) – Either i) a string of a path to file containing the transcription factor DNA binding motif data in the form of TF-gene-weight(0/1) as a tab-separated file without a header or ii) a pandas dataframe. If set to none, the gene coexpression matrix is returned as a result network.
ppi_file (str) – Either i) a path to file containing the PPI data or a ii) pandas dataframe. The PPI has to reflect an undirected network (A - B), if not, it will be transformed into an undirected network by building a symmetrical adjacency matrix (A -> B, B -> A).
computing (str) – ‘cpu’ uses Central Processing Unit (CPU) to run PANDA. ‘gpu’ use the Graphical Processing Unit (GPU) to run PANDA.
precision (str) –
- ‘double’ computes the regulatory network in double precision (15 decimal digits).
- ’single’ computes the regulatory network in single precision (7 decimal digits) which is fastaer, requires half the memory but less accurate.
save_memory (bool) –
- True : removes temporary results from memory. The result network is weighted adjacency matrix of size (nTFs, nGenes).
- False: keeps the temporary files in memory. The result network has 4 columns in the form gene - TF - weight in motif prior - PANDA edge.
save_tmp (bool) – Save temporary variables.
remove_missing (bool) – Removes the gens and TFs that are not present in one of the priors. Works only if modeProcess=’legacy’.
keep_expression_matrix (bool) – Keeps the input expression matrix in the result Panda object.
modeProcess (str) – The input data processing mode. - ‘legacy’: refers to the processing mode in netZooPy<=0.5 - (Default)’union’: takes the union of all TFs and genes across priors and fills the missing genes in the priors with zeros. - ‘intersection’: intersects the input genes and TFs across priors and removes the missing TFs/genes.
alpha (str) – Learning rate (default: 0.1)
start (int) – First sample of the expression dataset. This replicates the behavior of Lioness (default : 1)
end (int) – Last sample of the expression dataset. This replicates the behavior of Lioness (default : None )
cobra_design_matrix (np.ndarray, pd.DataFrame) – COBRA design matrix of size (n, q), n = number of samples, q = number of covariates
cobra_covariate_to_keep (int) – Zero-indedex base of COBRA co-expression component to use

Examples

Note these examples use a small toy data that may not reflect an actual use case. To use actual gene expression, motif, and PPI data, please refer to [GRAND](https://grand.networkmedicine.org/) database. >>> #Import the classes in the pypanda library: >>> from netZooPy.panda.panda import Panda >>> #Run the Panda algorithm, leave out motif and PPI data to use Pearson correlation network: >>> panda_obj = Panda(‘../../tests/ToyData/ToyExpressionData.txt’, ‘../../tests/ToyData/ToyMotifData.txt’, ‘../../tests/ToyData/ToyPPIData.txt’, remove_missing=False) >>> #Save the results: >>> panda_obj.save_panda_results(‘Toy_Panda.pairs.txt’) >>> #Return a network plot: >>> panda_obj.top_network_plot(top=70, file=’top_genes.png’) >>> #Calculate in- and outdegrees for further analysis: >>> indegree = panda_obj.return_panda_indegree() >>> outdegree = panda_obj.return_panda_outdegree()

Notes

Toy data: The example gene expression data that we have available here contains gene expression profiles for different samples in the columns. Of note, this is just a small subset of a larger gene expression dataset. We provided these “toy” data so that the user can test the method.

Gene naming nomeclature: Gene names have to be consistent between gene expresssion and motif columns; and TF PPI matrix and motif rows. For example, gene expression and motif columns can be in Ensembl gene IDs (ENSG), and TF PPI and motif rows can be in HUGO gene symbols.

Sample PANDA results:

TF Gene Motif Force
CEBPA AACSL 0.0 -0.951416589143
CREB1 AACSL 0.0 -0.904241609324
DDIT3 AACSL 0.0 -0.956471642313
E2F1 AACSL 1.0 3.685316051
EGR1 AACSL 0.0 -0.695698519643

References

Authors: Cho-Yi Chen, David Vi, Alessandro Marin, Marouen Ben Guebila, Daniel Morgan

__create_plot(unique_genes, links, file='panda.png', plot_bipart=False)

Runs the plot.

Parameters:

unique_genes (list) – Unique list of PANDA genes.
links (list) – Edges of the subset PANDA network to the top genes.
file (str) – File to save the network plot.
plot_bipart (bool) – Plot the network as a bipartite layout.

Notes

split_label: Splits the plot label over several lines for plotting purposes.

__pearson_results_data_frame(): Saves PANDA network in edges format.

__remove_missing(): Removes the genes and TFs that are not present in one of the priors. Works only if modeProcess=’legacy’.

__shape_plot_network(subset_panda_results, file='panda.png', plot_bipart=False)

Creates plot.

Parameters:

subset_panda_results (array) – Reduced PANDA network to the top genes.
file (str) – File to save the network plot.
plot_bipart (bool) – Plot the network as a bipartite layout.

_normalize_network(x)[source]

Standardizes the input data matrices.

Parameters:: x (array) – Input adjacency matrix.
Returns:: normalized_matrix – Standardized adjacency matrix.
Return type:: array

panda_loop(correlation_matrix, motif_matrix, ppi_matrix, computing='cpu', alpha=0.1)[source]

The PANDA algorithm.

Parameters:

correlation_matrix (array) – Input coexpression matrix.
motif_matrix (array) – Input motif regulation prior network.
ppi_matrix (array) – Input PPI matrix.
computing (str) – ‘cpu’ uses Central Processing Unit (CPU) to run PANDA. ‘gpu’ use the Graphical Processing Unit (GPU) to run PANDA.

processData(modeProcess, motif_file, expression_file, ppi_file, remove_missing, keep_expression_matrix, start=1, end=None, with_header=False, cobra_design_matrix=None, cobra_covariate_to_keep=0)[source]

Processes data files into data matrices.

Parameters:

expression_file (str) – Path to file containing the gene expression data or pandas dataframe. By default, the expression file does not have a header, and the cells ares separated by a tab. Pass with_header=True if the expression data includes the sample names
motif_file (str) – Path to file containing the transcription factor DNA binding motif data in the form of TF-gene-weight(0/1) or pandas dataframe. If set to none, the gene coexpression matrix is returned as a result network.
ppi_file (str) – Path to file containing the PPI data. or pandas dataframe. The PPI can be symmetrical, if not, it will be transformed into a symmetrical adjacency matrix.
remove_missing (bool) – Removes the gens and TFs that are not present in one of the priors. Works only if modeProcess=’legacy’.
keep_expression_matrix (bool) – Keeps the input expression matrix in the result Panda object.
modeProcess (str) – The input data processing mode. - ‘legacy’: refers to the processing mode in netZooPy<=0.5 - (Default)’union’: takes the union of all TFs and genes across priors and fills the missing genes in the priors with zeros. - ‘intersection’: intersects the input genes and TFs across priors and removes the missing TFs/genes.
with_header (bool) – pass True when the expression file has a header with the sample names
tmp_folder (str) – Path to the folder to save temporary files

return_panda_indegree()[source]: Computes indegree of PANDA network, only if save_memory = False.

return_panda_outdegree()[source]: computes outdegree of PANDA network, only if save_memory = False.

save_panda_results(path='panda.npy', save_adjacency=False, old_compatible=True)[source]

Saves PANDA network.

Parameters:

path (str) – Path to save the network.
save_adjacency (bool) – if True the output is an adjacency matrix and not the edge list
old_compatible (bool) – if True saves the data as it was saved until netzoopy 0.9.11

top_network_plot(top=100, file='panda_top_100.png', plot_bipart=False)[source]

Selects top genes.

Parameters:

top (int) – Top number of genes to plot.
file (str) – File to save the network plot.
plot_bipart (bool) – Plot the network as a bipartite layout.

class netZooPy.panda.analyze_panda.AnalyzePanda(panda_data)[source]

Plots PANDA network.

Parameters:: panda_data (object) – PANDA object

__create_plot(unique_genes, links, file='panda.png')

Plot panda network on specified genes and edges

Parameters:

unique_genes (list) – Unique list of PANDA genes.
links (edges) – Edgdes of the subset PANDA network to the top genes.
file (str (Defaults to panda.png)) – File to save the network plot.

__shape_plot_network(subset_panda_results, file='panda.png')

Creates plot.

Parameters:

subset_panda_results (array) – Reduced PANDA network to the top genes.
file (str (Defaults to panda.png)) – File to save the network plot.

top_network_plot(top=100, file='panda_top_100.png')[source]

Selects top genes.

Parameters:

top (int (Defaults to 100)) – Top number of genes to plot.
file (str (Defaults to panda_top_100.png)) – File to save the network plot.

PUMA

class netZooPy.puma.puma.Puma(expression_file, motif_file, ppi_file, mir_file, modeProcess='union', computing='cpu', precision='double', save_memory=False, save_tmp=False, remove_missing=False, keep_expression_matrix=False, alpha=0.1, start=1, end=None, df_correlation_matrix=None)[source]

Using PUMA to infer gene regulatory network.

Reading in input data (expression data, motif prior, TF PPI data, miR)
Computing coexpression network
Normalizing networks
Running PUMA algorithm
Writing out PUMA network

Parameters:

expression_file (str) – Path to file containing the gene expression data or pandas dataframe. By default, the expression file does not have a header, and the cells ares separated by a tab.
motif_file (str) – Path to file containing the regulation prior as a tab-separated file without a header. This can be a miRNA-Gene predicted network from TargetScan/miRanda. However, this can be combined with transcription factor DNA binding motif data in the form of TF-gene-weight(0/1) to estimate gene regulation by TF and miRNA. Alternatively, can be a dataframe with the motif network If set to none, the gene coexpression matrix is returned as a result network.
ppi_file (str) – Path to file containing the TF PPI data, or pandas dataframe. This can be provided as ‘None’ if no TF data is given and PUMA will estimate a miRNA-Gene networks.
mir_file (str) – Path to file containing miRNA list or a list. A standard mir_file can be read as: >>> with open(mir_file, “r”) as f: >>> miR = f.read().splitlines()
computing (str) –
- ‘cpu’ uses Central Processing Unit (CPU) to run PANDA.
- ’gpu’ use the Graphical Processing Unit (GPU) to run PANDA.
precision (str) – ‘double’ computes the regulatory network in double precision (15 decimal digits). ‘single’ computes the regulatory network in single precision (7 decimal digits) which is fastaer, requires half the memory but less accurate.
save_memory (bool) – True : removes temporary results from memory. The result network is weighted adjacency matrix of size (nTFs, nGenes). False: keeps the temporary files in memory. The result network has 4 columns in the form gene - TF - weight in motif prior - PUMA edge.
save_tmp (bool) – Save temporary variables.
remove_missing (bool) – Removes the gens and TFs that are not present in one of the priors. Works only if modeProcess=’legacy’.
keep_expression_matrix (bool) – Keeps the input expression matrix in the result Puma object.
modeProcess (str) – The input data processing mode. - ‘legacy’: refers to the processing mode in netZooPy<=0.5 - (Default)’union’: takes the union of all TFs and genes across priors and fills the missing genes in the priors with zeros. - ‘intersection’: intersects the input genes and TFs across priors and removes the missing TFs/genes.
alpha (float) – Learning rate (default: 0.1)
start (int) – first sample of expression (default 1)
end (int) – last sample of expression (default None)

Examples

Run the PUMA algorithm, leave out motif and PPI data to use Pearson correlation network: from netZooPy.puma.puma import Puma puma_obj = Puma(‘../../tests/ToyData/ToyExpressionData.txt’, ‘../../tests/ToyData/ToyMotifData.txt’, ‘../../tests/ToyData/ToyPPIData.txt’,’../../tests/ToyData/ToyMiRList.txt’)

References

..[1]__ Kuijjer, Marieke L., et al. “PUMA: PANDA Using MicroRNA Associations.” OUP Bioinformatics (2019).

Authors:cychen, davidvi, alessandromarin

__create_plot(unique_genes, links, file='puma.png')

Runs the plot.

Parameters:

unique_genes (list) – Unique list of PUMA genes.
links (list) – Edges of the subset PUMA network to the top genes.
file (str) – File to save the network plot.

__pearson_results_data_frame(): Saves PUMA network in edges format.

__remove_missing(): Removes the gens and TFs that are not present in one of the priors. Works only if modeProcess=’legacy’.

__shape_plot_network(subset_puma_results, file='puma.png')

Creates plot.

Paramters:

subset_puma_resultsarray: Reduced PUMA network to the top genes.
filestr: File to save the network plot.

_normalize_network(x)[source]

Standardizes the input data matrices.

Parameters:: x (array) – Input adjacency matrix.
Returns:: normalized_matrix – Standardized adjacency matrix.
Return type:: array

puma_loop(correlation_matrix, motif_matrix, ppi_matrix, computing='cpu', alpha=0.1)[source]

The PUMA algorithm.

Parameters:

correlation_matrix (array) – Input coexpression matrix.
motif_matrix (array) – Input motif regulation prior network.
ppi_matrix (array) – Input PPI matrix.
computing (str) –
- ‘cpu’ uses Central Processing Unit (CPU) to run PANDA.
- ’gpu’ use the Graphical Processing Unit (GPU) to run PANDA.

return_puma_indegree()[source]: Computes indegree of PUMA network, only if save_memory = False.

return_puma_outdegree()[source]: Computes outdegree of PUMA network, only if save_memory = False.

save_puma_results(path='puma.npy')[source]

Saves PUMA network.

Parameters:: path (str) – Path to save the network.

top_network_plot(top=100, file='puma_top_100.png')[source]

Selects top genes and plot network

Parameters:

top (int) – Top number of genes to plot.
file (str) – File to save the network plot.

SAMBAR

netZooPy.sambar.sambar(mut_file='/home/docs/checkouts/readthedocs.org/user_builds/netzoopy/checkouts/stable/netZooPy/sambar/mut.ucec.csv', esize_file='/home/docs/checkouts/readthedocs.org/user_builds/netzoopy/checkouts/stable/netZooPy/sambar/esizef.csv', genes_file='/home/docs/checkouts/readthedocs.org/user_builds/netzoopy/checkouts/stable/netZooPy/sambar/genes.txt', gmtfile='/home/docs/checkouts/readthedocs.org/user_builds/netzoopy/checkouts/stable/netZooPy/sambar/h.all.v6.1.symbols.gmt', normPatient=True, kmin=2, kmax=4, gmtMSigDB=True, subcangenes=True, distance='binomial', linkagem='complete', cluster=True)[source]

Runs SAMBAR and outputs the pt matrix, the mt matrix and the clustering matrix.

Parameters:

mutfile (matrix of mutations with genes as columns and samples as rows. Format CSV.)
esize_file (file with genes and their length.)
genes_file (file with list of cancer-associated genes.)
gmt_file (genelist by pathway, format from MSigDB.)
normPatien (Normalize mutation data by number of mutations in a sample.)
kmin (Number of groups in the clustering.)
kmax (Number of groups in the clustering.)
gmtMSigDB (Whether the signature file comes from MSigDB or not. (Important for processing the file).)
subcangenes (Makes optional subsetting to cancer-associated genes.)
distance (Similarity metric for the clustering. Default is binomial distance but any distance from scipy.spatial.distance can be used.)
linkagem (Linkage method. Default is complete.)
cluster (Whether the clustering has to be compute or the output will just be the pathway mutation scores.)

Returns:

pt (_) – the pathway matrix.
groups (_) – the groups dataframe as python objects.
mt_out.csv (_) – processed gene mutation scores.
pt_out.csv (_) – pathway mutation scores
clustergroups.csv (_) – matrix of pertinence to a group in the clustering.
dist_matrix.csv (_) – Computation of the distance matrix is resource-consuming so the matrix is writen so it doesn’t have to be computed again.

Reference:: Kuijjer, Marieke Lydia, et al. “Cancer subtype identification using somatic mutation data.” British journal of cancer 118.11 (2018): 1492-1501.

Extra sambar functions

netZooPy.sambar.corgenelength(mut, cangenes, esize, normbysample=True, subcangenes=True)[source]

Function to normalize gene mutation scores by gene length. mut should be a dataframe of mutation scores with genes as columns and samples as rows. (VERY IMPORTANT, IF OTHERWISE MATRIX SHOULD BE TRANSPOSED OR IT WON’T WORK!!!)

Parameters:

mut (list) – Mutation scores.
cangenes (list) – A set of cancer associated genes.
esize (dataFrame) – A dataframe of gene lengths.
normbysample (bool) –
- True : Normalizes the gene mutation scores in a sample by the total mutations within the sample.
- False: Deactivate normalization.
subcangenes (bool) –
- True: Subsets mutation data to cancer-associated genes.
- False: Takes all genes.

Returns:

mut – Mutation scores normalized by gene length.

Return type:

_

netZooPy.sambar.convertgmt(gmtfile, cangenes, gmtMSigDB=True, subcangenes=True)[source]

This function takes as input the name of a gmt file containing lists of genes associated to pathways. It outputs an adjacency matrix of genes and pathways. It also subsets the genes to a list of cancer-associated genes.

Parameters:

gmtfile (str) – Path the gmt file.
cangenes (list) – A set of cancer associated genes.
gmtMSigDB (bool, optional) – If true: gmt file from MSigDB . Defaults to True.
subcangenes (bool, optional) – If true: Subsets mutation data to cancer-associated genes. otherwise takes all genes. Defaults to True.

Returns:

sign_matrix – Adjacency matrix of genes and pathways.

Return type:

_

netZooPy.sambar.desparsify(mutdata, exonsize, gmtfile, cangenes, normMut=True, gmtMSigDB=True, subcangenes=True)[source]

Applies the sambar method to de-sparsify the mutation data using the pathway signatures in the gmtfile.

Parameters:

mutdata – Mutation scores.
exonsize – A dataframe of gene lengths.
gmtfile – Path the gmt file.
cangenes – A set of cancer associated genes.
normbysample – True : Normalizes the gene mutation scores in a sample by the total mutations within the sample. False: Deactivate normalization.
gmtMSigDB – True : gmt file from MSigDB False: file not form MSigDB
subcangenes – True : Subsets mutation data to cancer-associated genes. False: Takes all genes.

Returns:

mt (_) – Genes in both pathways and mutation data
pathway_scores (_) – Pathway scores.

netZooPy.sambar.binomial_dist(u, v)[source]

Implementation of the binomial dissimilarity funcion or Millar distance from the vegan:vegdist package in R.

Parameters:

u (First vector to compare distance from.)
v (Second vector to compare distance to.)

Returns:

bd – binomial dissimilarity.

Return type:

_

netZooPy.sambar.clustering(pt, kmin, kmax, distance, linkagem)[source]

Computes the clustering for the pathways matrix and returns a dataframe with the groups with k clusters from kmin to kmax.

Parameters:

pt (Pathway scores.)
kmin (Min number of groups in the clustering.)
kmax (Max number of groups in the clustering.)
distance (Similarity metric for the clustering. Default is binomial distance but any distance from scipy.spatial.distance can be used.)
linkagem (Linkage method. Default is complete.)

Returns:

df – Cluster assignement dataframe.

Return type:

_