mpc-bioinformatics / ProtGraph

ProtGraph - A Graph-Generator for Proteins

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential Bottlenecks in ProtGraph

Luxxii opened this issue · comments

I ran a line profiler on the generate_graph_consumer method on the human_review dataset (20k proteins)

It gives me the following output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   114                                           def generate_graph_consumer(entry_queue, graph_queue, common_out_queue, proc_id, **kwargs):
   115                                               """
   116                                               TODO
   117                                               describe kwargs and consumer until a graph is generated and digested etc ...
   118                                               """
   119                                               # Set proc id
   120         1          3.0      3.0      0.0      kwargs["proc_id"] = proc_id
   121                                           
   122                                               # Set feature_table dict boolean table
   123         1          1.0      1.0      0.0      ft_dict = dict()
   124         1          1.0      1.0      0.0      if kwargs["feature_table"] is None or len(kwargs["feature_table"]) == 0 or "ALL" in kwargs["feature_table"]:
   125         1          6.0      6.0      0.0          ft_dict = dict(VARIANT=True, VAR_SEQ=True, SIGNAL=True, INIT_MET=True, MUTAGEN=True, CONFLICT=True)
   126                                               else:
   127                                                   for i in kwargs["feature_table"]:
   128                                                       ft_dict[i] = True
   129                                           
   130                                               # Initialize the exporters for graphs
   131         1         58.0     58.0      0.0      graph_exporters = Exporters(**kwargs)
   132                                           
   133                                               while True:
   134                                                   # Get next entry
   135     20387    7568810.0    371.3      0.7          entry = entry_queue.get()
   136                                           
   137                                                   # Stop if entry is None
   138     20387      18156.0      0.9      0.0          if entry is None:
   139                                                       # --> Stop Condition of Process
   140         1          3.0      3.0      0.0              break
   141                                           
   142                                                   # Beginning of Graph-Generation
   143                                                   # We also collect interesting information here!
   144                                           
   145                                                   # Generate canonical graph (initialization of the graph)
   146     20386    4416910.0    216.7      0.4          graph = _generate_canonical_graph(entry.sequence, entry.accessions[0])
   147                                           
   148                                                   # FT parsing and appending of Nodes and Edges into the graph
   149                                                   # The amount of isoforms, etc.. can be retrieved on the fly
   150     20386      22188.0      1.1      0.0          num_isoforms, num_initm, num_signal, num_variant, num_mutagens, num_conficts =\
   151     20386  312577641.0  15333.0     29.4              _include_ft_information(entry, graph, ft_dict)
   152                                           
   153                                                   # Replace Amino Acids based on user defined rules: E.G.: "X -> A,B,C"
   154     20386      83272.0      4.1      0.0          replace_aa(graph, kwargs["replace_aa"])
   155                                           
   156                                                   # Digest graph with enzyme (unlimited miscleavages)
   157     20386  457306111.0  22432.4     43.0          num_of_cleavages = digest(graph, kwargs["digestion"])
   158                                           
   159                                                   # Merge (summarize) graph if wanted
   160     20386      29893.0      1.5      0.0          if not kwargs["no_merge"]:
   161     20386  268518281.0  13171.7     25.3              merge_aminoacids(graph)
   162                                           
   163                                                   # Collapse parallel edges in a graph
   164     20386      29694.0      1.5      0.0          if not kwargs["no_collapsing_edges"]:
   165     20386   10804029.0    530.0      1.0              collapse_parallel_edges(graph)
   166                                           
   167                                                   # Annotate weights for edges and nodes (maybe even the smallest weight possible to get to the end node)
   168     20386     948172.0     46.5      0.1          annotate_weights(graph, **kwargs)
   169                                           
   170                                                   # Calculate statistics on the graph:
   171     20386      11921.0      0.6      0.0          (
   172     20386      12094.0      0.6      0.0              num_nodes, num_edges, num_paths, num_paths_miscleavages, num_paths_hops,
   173     20386       9768.0      0.5      0.0              num_paths_var, num_path_mut, num_path_con
   174     20386     297176.0     14.6      0.0          ) = get_statistics(graph, **kwargs)
   175                                           
   176                                                   # Verify graphs if wanted:
   177     20386      11624.0      0.6      0.0          if kwargs["verify_graph"]:
   178                                                       verify_graph(graph)
   179                                           
   180                                                   # Persist or export graphs with speicified exporters
   181     20386      38415.0      1.9      0.0          graph_exporters.export_graph(graph, common_out_queue)
   182                                           
   183                                                   # Output statistics we gathered during processing
   184     20386      10500.0      0.5      0.0          if kwargs["no_description"]:
   185                                                       entry_protein_desc = None
   186                                                   else:
   187     20386      37338.0      1.8      0.0              entry_protein_desc = entry.description.split(";", 1)[0]
   188     20386      37142.0      1.8      0.0              entry_protein_desc = entry_protein_desc[entry_protein_desc.index("=") + 1:]
   189                                           
   190     40772     312422.0      7.7      0.0          graph_queue.put(
   191     20386      12337.0      0.6      0.0              (
   192     20386      11818.0      0.6      0.0                  entry.accessions[0],  # Protein Accesion
   193     20386      10432.0      0.5      0.0                  entry.entry_name,  # Protein displayed name
   194     20386       9196.0      0.5      0.0                  num_isoforms,  # Number of Isoforms
   195     20386       9231.0      0.5      0.0                  num_initm,  # Number of Init_M (either 0 or 1)
   196     20386       9244.0      0.5      0.0                  num_signal,  # Number of Signal Peptides used (either 0 or 1)
   197     20386       9232.0      0.5      0.0                  num_variant,  # Number of Variants applied to this protein
   198     20386       9227.0      0.5      0.0                  num_mutagens,  # Number of applied mutagens on the graph
   199     20386       9231.0      0.5      0.0                  num_conficts,  # Number of applied conflicts on the graph
   200     20386       9274.0      0.5      0.0                  num_of_cleavages,  # Number of cleavages (marked edges) this protein has
   201     20386       9240.0      0.5      0.0                  num_nodes,  # Number of nodes for the Protein/Peptide Graph
   202     20386       9269.0      0.5      0.0                  num_edges,  # Number of edges for the Protein/Peptide Graph
   203     20386       9311.0      0.5      0.0                  num_paths,  # Possible (non repeating paths) to the end of a graph. (may conatin repeating peptides)
   204     20386       9318.0      0.5      0.0                  num_paths_miscleavages,  # As num_paths, but binned to the number of miscleavages (by list idx, at 0)
   205     20386       9288.0      0.5      0.0                  num_paths_hops,  # As num_paths, only that we bin by hops (E.G. useful for determine DFS or BFS depths)
   206     20386       9363.0      0.5      0.0                  num_paths_var,  # Num paths of feture variant
   207     20386       9519.0      0.5      0.0                  num_path_mut,  # Num paths of feture mutagen
   208     20386       9508.0      0.5      0.0                  num_path_con,  # Num paths of feture conflict
   209     20386       9476.0      0.5      0.0                  entry_protein_desc,  # Description name of the Protein (can be lenghty)
   210                                                       )
   211                                                   )
   212                                           
   213                                               # Close exporters (maybe opened files, database connections, etc... )
   214         1         13.0     13.0      0.0      graph_exporters.close()

Bottlenecks are:

  • Merge Aminoacids (~25%)
  • Apply Features (~29%)
  • Digestion (~43%)