Potential Bottlenecks in ProtGraph
Luxxii opened this issue · comments
Dominik Lux commented
I ran a line profiler on the generate_graph_consumer
method on the human_review dataset (20k proteins)
It gives me the following output:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
114 def generate_graph_consumer(entry_queue, graph_queue, common_out_queue, proc_id, **kwargs):
115 """
116 TODO
117 describe kwargs and consumer until a graph is generated and digested etc ...
118 """
119 # Set proc id
120 1 3.0 3.0 0.0 kwargs["proc_id"] = proc_id
121
122 # Set feature_table dict boolean table
123 1 1.0 1.0 0.0 ft_dict = dict()
124 1 1.0 1.0 0.0 if kwargs["feature_table"] is None or len(kwargs["feature_table"]) == 0 or "ALL" in kwargs["feature_table"]:
125 1 6.0 6.0 0.0 ft_dict = dict(VARIANT=True, VAR_SEQ=True, SIGNAL=True, INIT_MET=True, MUTAGEN=True, CONFLICT=True)
126 else:
127 for i in kwargs["feature_table"]:
128 ft_dict[i] = True
129
130 # Initialize the exporters for graphs
131 1 58.0 58.0 0.0 graph_exporters = Exporters(**kwargs)
132
133 while True:
134 # Get next entry
135 20387 7568810.0 371.3 0.7 entry = entry_queue.get()
136
137 # Stop if entry is None
138 20387 18156.0 0.9 0.0 if entry is None:
139 # --> Stop Condition of Process
140 1 3.0 3.0 0.0 break
141
142 # Beginning of Graph-Generation
143 # We also collect interesting information here!
144
145 # Generate canonical graph (initialization of the graph)
146 20386 4416910.0 216.7 0.4 graph = _generate_canonical_graph(entry.sequence, entry.accessions[0])
147
148 # FT parsing and appending of Nodes and Edges into the graph
149 # The amount of isoforms, etc.. can be retrieved on the fly
150 20386 22188.0 1.1 0.0 num_isoforms, num_initm, num_signal, num_variant, num_mutagens, num_conficts =\
151 20386 312577641.0 15333.0 29.4 _include_ft_information(entry, graph, ft_dict)
152
153 # Replace Amino Acids based on user defined rules: E.G.: "X -> A,B,C"
154 20386 83272.0 4.1 0.0 replace_aa(graph, kwargs["replace_aa"])
155
156 # Digest graph with enzyme (unlimited miscleavages)
157 20386 457306111.0 22432.4 43.0 num_of_cleavages = digest(graph, kwargs["digestion"])
158
159 # Merge (summarize) graph if wanted
160 20386 29893.0 1.5 0.0 if not kwargs["no_merge"]:
161 20386 268518281.0 13171.7 25.3 merge_aminoacids(graph)
162
163 # Collapse parallel edges in a graph
164 20386 29694.0 1.5 0.0 if not kwargs["no_collapsing_edges"]:
165 20386 10804029.0 530.0 1.0 collapse_parallel_edges(graph)
166
167 # Annotate weights for edges and nodes (maybe even the smallest weight possible to get to the end node)
168 20386 948172.0 46.5 0.1 annotate_weights(graph, **kwargs)
169
170 # Calculate statistics on the graph:
171 20386 11921.0 0.6 0.0 (
172 20386 12094.0 0.6 0.0 num_nodes, num_edges, num_paths, num_paths_miscleavages, num_paths_hops,
173 20386 9768.0 0.5 0.0 num_paths_var, num_path_mut, num_path_con
174 20386 297176.0 14.6 0.0 ) = get_statistics(graph, **kwargs)
175
176 # Verify graphs if wanted:
177 20386 11624.0 0.6 0.0 if kwargs["verify_graph"]:
178 verify_graph(graph)
179
180 # Persist or export graphs with speicified exporters
181 20386 38415.0 1.9 0.0 graph_exporters.export_graph(graph, common_out_queue)
182
183 # Output statistics we gathered during processing
184 20386 10500.0 0.5 0.0 if kwargs["no_description"]:
185 entry_protein_desc = None
186 else:
187 20386 37338.0 1.8 0.0 entry_protein_desc = entry.description.split(";", 1)[0]
188 20386 37142.0 1.8 0.0 entry_protein_desc = entry_protein_desc[entry_protein_desc.index("=") + 1:]
189
190 40772 312422.0 7.7 0.0 graph_queue.put(
191 20386 12337.0 0.6 0.0 (
192 20386 11818.0 0.6 0.0 entry.accessions[0], # Protein Accesion
193 20386 10432.0 0.5 0.0 entry.entry_name, # Protein displayed name
194 20386 9196.0 0.5 0.0 num_isoforms, # Number of Isoforms
195 20386 9231.0 0.5 0.0 num_initm, # Number of Init_M (either 0 or 1)
196 20386 9244.0 0.5 0.0 num_signal, # Number of Signal Peptides used (either 0 or 1)
197 20386 9232.0 0.5 0.0 num_variant, # Number of Variants applied to this protein
198 20386 9227.0 0.5 0.0 num_mutagens, # Number of applied mutagens on the graph
199 20386 9231.0 0.5 0.0 num_conficts, # Number of applied conflicts on the graph
200 20386 9274.0 0.5 0.0 num_of_cleavages, # Number of cleavages (marked edges) this protein has
201 20386 9240.0 0.5 0.0 num_nodes, # Number of nodes for the Protein/Peptide Graph
202 20386 9269.0 0.5 0.0 num_edges, # Number of edges for the Protein/Peptide Graph
203 20386 9311.0 0.5 0.0 num_paths, # Possible (non repeating paths) to the end of a graph. (may conatin repeating peptides)
204 20386 9318.0 0.5 0.0 num_paths_miscleavages, # As num_paths, but binned to the number of miscleavages (by list idx, at 0)
205 20386 9288.0 0.5 0.0 num_paths_hops, # As num_paths, only that we bin by hops (E.G. useful for determine DFS or BFS depths)
206 20386 9363.0 0.5 0.0 num_paths_var, # Num paths of feture variant
207 20386 9519.0 0.5 0.0 num_path_mut, # Num paths of feture mutagen
208 20386 9508.0 0.5 0.0 num_path_con, # Num paths of feture conflict
209 20386 9476.0 0.5 0.0 entry_protein_desc, # Description name of the Protein (can be lenghty)
210 )
211 )
212
213 # Close exporters (maybe opened files, database connections, etc... )
214 1 13.0 13.0 0.0 graph_exporters.close()
Bottlenecks are:
- Merge Aminoacids (~25%)
- Apply Features (~29%)
- Digestion (~43%)