calamari 2.1.* slower than 2.0.*, slower than 1.0.5
wosiu opened this issue · comments
After bumping to:
Calamari 2.1.4
TF 2.4.3
and setting
self.predictor.data.params.pre_proc.run_parallel = False
self.predictor.data.params.post_proc.run_parallel = False
it's indeed faster compared to the previous 2.1.*, however the newest calamari is still 4-5 times slower compared to 1.0.5.
I did a quick benchmark:
1.0.5: 588ms
2.1.4: 2849ms
I managed to run isolated benchmark.
For all versions after 1.0.5, I run a benchmark script from this repository (calamari_ocr/scripts/benchmark.py
) . Separate virtual env was created for each.
Summary:
Calamari v 1.0.5 is the fastest one. No matter which tensorflow version was used for 1.0.5 - the performance was similar.
Calamari v. 2.0.* is slightly slower.
Calamari v. 2.1.* is even more slow for all combinations of parameters.
Hardware
Model Name: MacBook Pro
Processor Name: 8-Core Intel Core i9
Processor Speed: 2,4 GHz
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 64 GB
Benchmarks
Python 3.7.11
columns: batch size
rows: input size
calamari v1.0.5, tensorflow 2.7.0
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.007948589324951173 | 0.0076048851013183595 | 0.007726287841796875 | 0.007823896408081055 | 0.00724799633026123 |
10 | 0.26404080390930174 | 0.2289862871170044 | 0.1603088855743408 | 0.15910162925720214 | 0.16165380477905272 |
100 | 0.7921005010604858 | 0.3960395812988281 | 0.34024958610534667 | 0.30459458827972413 | 0.3046471834182739 |
1000 | 6.125180435180664 | 2.2707329034805297 | 1.6477508068084716 | 1.266060209274292 | 1.2575418949127197 |
calamari v1.0.5, tensorflow 2.4.4
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.007915997505187988 | 0.0074397802352905275 | 0.007631373405456543 | 0.007540416717529297 | 0.007537722587585449 |
10 | 0.3019905090332031 | 0.23226828575134278 | 0.16275520324707032 | 0.16215720176696777 | 0.15758802890777587 |
100 | 0.7962063074111938 | 0.3950160026550293 | 0.33384888172149657 | 0.2988450050354004 | 0.30655417442321775 |
1000 | 6.195057010650634 | 2.2250738859176638 | 1.6299351692199706 | 1.2984890937805176 | 1.276983118057251 |
calamari v1.0.5, tensorflow 2.3.4
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.007065892219543457 | 0.006709504127502442 | 0.006894207000732422 | 0.006652998924255371 | 0.006623005867004395 |
10 | 0.25193932056427004 | 0.23570570945739747 | 0.16034402847290039 | 0.1643200159072876 | 0.1634167194366455 |
100 | 0.6924072980880738 | 0.37671122550964353 | 0.3360576152801514 | 0.30185441970825194 | 0.30163798332214353 |
1000 | 5.139006614685059 | 2.152212619781494 | 1.547921085357666 | 1.2356892824172974 | 1.282170796394348 |
calamari v2.0.0, tensorflow 2.3.4, tfaip 1.0.1
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.17449259757995605 | 0.17991750240325927 | 0.18773868083953857 | 0.19791319370269775 | 0.20436220169067382 |
10 | 0.23921489715576172 | 0.2374429941177368 | 0.2518466949462891 | 0.2559601068496704 | 0.26382319927215575 |
100 | 0.5276783227920532 | 0.45921080112457274 | 0.44627621173858645 | 0.46705169677734376 | 0.464670991897583 |
1000 | 3.062366080284119 | 2.370705986022949 | 2.1865496158599855 | 2.0342562198638916 | 1.9911694765090941 |
calamari v2.0.1, tensorflow 2.3.4, tfaip 1.0.1
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.17778341770172118 | 0.18811919689178466 | 0.1945850133895874 | 0.20300788879394532 | 0.22730581760406493 |
10 | 0.2520615816116333 | 0.23547468185424805 | 0.2761404037475586 | 0.3625166893005371 | 0.2599898099899292 |
100 | 0.494154691696167 | 0.4478676080703735 | 0.43113758563995364 | 0.4207663297653198 | 0.43292860984802245 |
1000 | 2.4657802104949953 | 2.1499099254608156 | 2.1115876913070677 | 1.9334032773971557 | 1.8338363885879516 |
calamari v2.0.2, tensorflow 2.3.4, tfaip 1.0.1
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.17584502696990967 | 0.17831790447235107 | 0.1874023199081421 | 0.19749109745025634 | 0.20571470260620117 |
10 | 0.278959321975708 | 0.27796192169189454 | 0.30696120262146 | 0.33047828674316404 | 0.34730279445648193 |
100 | 0.6894578218460083 | 0.5354434967041015 | 0.5252492904663086 | 0.6007394790649414 | 0.5599848031997681 |
1000 | 2.5389096975326537 | 2.1279093980789185 | 1.9120651960372925 | 1.7973292112350463 | 1.8482789754867555 |
calamari v2.1.0, tensorflow 2.4.4, tfaip 1.1.1
It took ages with plenty of:
"You imported tensorflow at some point in the parallel pipeline. Running the code will work, but consume more memory and takes more time for initialization. Consider to remove all tensorflow imports from your data workers."
Thus interrupted.
Skipping calamari versions 2.1.1, 2.1.2, 2.1.3 due to the same problem.
2.1.4 Has fix for this:
predictor.data.params.pre_proc.run_parallel = False
predictor.data.params.post_proc.run_parallel = False
calamari v2.1.4, tensorflow 2.4.4, tfaip 1.2.5
For n = 1 it prints each time:
WARNING:tensorflow:11 out of the last 11 calls to <function Model.make_predict_function..predict_function at 0x7ffbfbc8f560> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.7328215122222901 | 0.722108793258667 | 0.7123070240020752 | 0.732320499420166 | 0.8124824047088623 |
10 | 0.7532444000244141 | 0.7343388080596924 | 0.7387914896011353 | 0.7460015296936036 | 0.7545709133148193 |
100 | 0.9022891998291016 | 0.9495761156082153 | 0.9188367128372192 | 0.9964001893997192 | 1.011446976661682 |
1000 | 2.6287662982940674 | 2.5199357986450197 | 2.5726132154464723 | 2.483173394203186 | 2.546015906333923 |
Hi @wosiu, thanks for the benchmarks, that's really interesting! To avoid messing up the tensorflow imports, I'd suggest putting them in a subprocess:
import argparse
import tempfile
import time
import numpy as np
from prettytable import PrettyTable
import multiprocessing
def benchmark_prediction(queue, model, batch_size, processes, n_examples, runs=10):
from calamari_ocr.ocr.predict.predictor import Predictor, PredictorParams
params = PredictorParams(silent=True)
predictor = Predictor.from_checkpoint(params, model)
predictor.data.params.pre_proc.run_parallel = False
predictor.data.params.post_proc.run_parallel = False
data = (np.random.random((400, 48)) * 255).astype(np.uint8)
print("Running with bs={}, proc={}, n={}".format(batch_size, processes, n_examples))
start = time.time()
for i in range(runs):
list(predictor.predict_raw([data] * n_examples))
end = time.time()
queue.put((end-start)/runs)
#return (end - start) / runs
def benchmark_subprocess(*args, **kwargs):
""" Run in subprocess to avoid tf imports messing up performance """
c = multiprocessing.get_context("spawn")
q = c.Queue()
p = c.Process(target=benchmark_prediction, args=(q, *args), kwargs=kwargs)
p.start()
res = q.get()
p.join()
return res
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--working_dir", default=tempfile.gettempdir(), type=str)
parser.add_argument("--model", required=True)
parser.add_argument("--processes", default=multiprocessing.cpu_count(), type=int)
args = parser.parse_args()
batch_sizes = [1, 5, 10, 20, 50]
tab = PrettyTable(["n"] + list(map(str, batch_sizes)))
for n_examples in [1, 10, 100, 1000]:
results = [benchmark_subprocess(args.model, bs, args.processes, n_examples) for bs in batch_sizes]
tab.add_row([n_examples] + results)
print(tab)
if __name__ == "__main__":
main()
Here are my results on a Quad Core Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz:
py 3.9, tf 2.7, calamari 1.0.5
(newer python version, newer tensorflow, no subprocess for the benchmark)
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.01315162181854248 | 0.011742305755615235 | 0.011842823028564453 | 0.01209120750427246 | 0.01231849193572998 |
10 | 0.22974395751953125 | 0.16737902164459229 | 0.082615327835083 | 0.08360829353332519 | 0.08581769466400146 |
100 | 1.200769853591919 | 0.8308326959609985 | 0.41378633975982665 | 0.3671584129333496 | 0.35127384662628175 |
1000 | 10.828134250640868 | 4.3135666847229 | 3.5323989152908326 | 3.127266025543213 | 2.448887801170349 |
py 3.8, tf 2.4.4, calamari 1.0.5
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.013362288475036621 | 0.013656997680664062 | 0.012458014488220214 | 0.012681293487548827 | 0.0136399507522583 |
10 | 1.9097131729125976 | 1.8371705055236816 | 1.7854376554489135 | 1.7675397157669068 | 1.7485485076904297 |
100 | 2.851202964782715 | 2.189711308479309 | 2.0937893390655518 | 2.024563717842102 | 2.0267707109451294 |
1000 | 16.534863686561586 | 9.73701639175415 | 9.043330550193787 | 8.469449067115784 | 8.457994651794433 |
py 3.8, tf 2.4.4, calamari 2.1.4, pre/post_proc.run_parallel=True
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 3.06041624546051 | 3.0336771249771117 | 3.0510018825531007 | 3.0461528062820435 | 3.0369704723358155 |
10 | 3.04496431350708 | 3.126779556274414 | 3.0625845193862915 | 3.0559497356414793 | 3.0517446041107177 |
100 | 3.2692791700363157 | 3.309327816963196 | 3.3099854230880736 | 3.3720837593078614 | 3.361532998085022 |
1000 | 7.211246156692505 | 7.0749754190444945 | 7.302390193939209 | 7.71282467842102 | 7.275307273864746 |
py 3.8, tf 2.4.4, calamari 2.1.4, pre/post_proc.run_parallel=False
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 1.018608522415161 | 1.0248649358749389 | 1.0350969314575196 | 1.0125337839126587 | 1.0169085264205933 |
10 | 1.167862606048584 | 1.1205103158950807 | 1.0690792798995972 | 1.2309166431427 | 1.0451623916625976 |
100 | 1.2943114042282104 | 1.2826362371444702 | 1.2807279586791993 | 1.2944347858428955 | 1.2827207803726197 |
1000 | 3.8914117574691773 | 4.033728289604187 | 4.072843170166015 | 4.284870386123657 | 3.9819567918777468 |
Conclusion?
The positive effects of increasing batch size seems to be missing in C2. Parallel pre- and post-processing is slowing down prediction if we are running on a limited number of CPU cores and without GPU. When running with the same python and tensorflow version, C2 seems to be faster when run on more than just half a page of line images, at least on my machine (could be different for larger amount of cores?).
@ChWick: tf 2.7 seems to be quite a bit faster than 2.4.4. Something prevents us from upgrading, what was that again?
The retracing warnings can be ignored, I suppose. Tensorflow insists on counting the occurrences of calling a prediction function and complaining if that happens too often.
Edit: py 3.9, tf 2.7, calamari 2.1.4, pre/post_proc.run_parallel=False
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 1.001286768913269 | 0.9967088222503662 | 1.0192787408828736 | 1.0088038206100465 | 0.9938497066497802 |
10 | 1.0237231254577637 | 1.0264565706253053 | 1.0123045444488525 | 1.015321946144104 | 1.0173277854919434 |
100 | 1.2190410852432252 | 1.2147423505783081 | 1.2920137882232665 | 1.3734655141830445 | 1.2653615951538086 |
1000 | 3.4991682052612303 | 3.545653557777405 | 3.4963221073150637 | 3.5203847169876097 | 3.661887550354004 |
Some more benchmarks, this time with GPU
Quadcore Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, GeForce GTX 1050 Ti
py 3.9, tf 2.4.1, calamari 1.0.5
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.007213759422302246 | 0.007359981536865234 | 0.007574772834777832 | 0.0069646120071411135 | 0.006994128227233887 |
10 | 0.7749871015548706 | 0.7228883028030395 | 0.6644667625427246 | 0.6639476537704467 | 0.675786304473877 |
100 | 1.2960800647735595 | 0.8731077194213868 | 0.8346931457519531 | 0.7854649305343628 | 0.7720790863037109 |
1000 | 8.359328556060792 | 4.0110067367553714 | 3.4892958402633667 | 3.207125020027161 | 3.0731616973876954 |
py 3.9, tf 2.4.1, calamari 2.1.4, pre/post_proc.run_parallel=True
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 1.5037858009338378 | 1.5199187040328979 | 1.522319531440735 | 1.515892791748047 | 1.5198529243469239 |
10 | 1.5575855493545532 | 1.544282031059265 | 1.5332003593444825 | 1.5490708827972413 | 1.5521154880523682 |
100 | 1.6002052068710326 | 1.586380672454834 | 1.594750738143921 | 1.6014111042022705 | 1.5885547637939452 |
1000 | 2.8979954719543457 | 2.9154956579208373 | 2.9702220916748048 | 2.8750507116317747 | 2.975974941253662 |
py 3.9, tf 2.4.1, calamari 2.1.4, pre/post_proc.run_parallel=False
n | 1 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|
1 | 0.7080289602279664 | 0.6896860599517822 | 0.6918380260467529 | 0.6905864715576172 | 0.6862347841262817 |
10 | 0.7059344053268433 | 0.7037941932678222 | 0.6990894556045533 | 0.7024942874908447 | 0.7001413822174072 |
100 | 0.8140160322189331 | 0.8167436599731446 | 0.81710786819458 | 0.8103108644485474 | 0.8118137121200562 |
1000 | 2.1732884883880614 | 2.1853001356124877 | 2.1897367000579835 | 2.1922594547271728 | 2.1852585554122923 |
All in all it seems there is some big overhead introduced after 1.0.5, which is the most impactful for small n
:(
That's true. It is optimized for the use case of large collections (books or groups of books) and GPU. Also, the main focus so far was on training time, much less on prediction.