calamari 2.1.* slower than 2.0.*, slower than 1.0.5

Question

calamari 2.1.* slower than 2.0.*, slower than 1.0.5

wosiu opened this issue 3 years ago · comments

After bumping to:
Calamari 2.1.4
TF 2.4.3
and setting

                self.predictor.data.params.pre_proc.run_parallel = False
                self.predictor.data.params.post_proc.run_parallel = False

it's indeed faster compared to the previous 2.1.*, however the newest calamari is still 4-5 times slower compared to 1.0.5.
I did a quick benchmark:
1.0.5: 588ms
2.1.4: 2849ms

Michał Woś · Answer 1 · Sun Nov 14 2021 09:11:45 GMT+0800 (China Standard Time)

I managed to run isolated benchmark.
For all versions after 1.0.5, I run a benchmark script from this repository (calamari_ocr/scripts/benchmark.py) . Separate virtual env was created for each.

Summary:

Calamari v 1.0.5 is the fastest one. No matter which tensorflow version was used for 1.0.5 - the performance was similar.
Calamari v. 2.0.* is slightly slower.
Calamari v. 2.1.* is even more slow for all combinations of parameters.

Hardware

Model Name: MacBook Pro
Processor Name: 8-Core Intel Core i9
Processor Speed: 2,4 GHz
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 64 GB

Benchmarks

Python 3.7.11
columns: batch size
rows: input size

calamari v1.0.5, tensorflow 2.7.0

n	1	5	10	20	50
1	0.007948589324951173	0.0076048851013183595	0.007726287841796875	0.007823896408081055	0.00724799633026123
10	0.26404080390930174	0.2289862871170044	0.1603088855743408	0.15910162925720214	0.16165380477905272
100	0.7921005010604858	0.3960395812988281	0.34024958610534667	0.30459458827972413	0.3046471834182739
1000	6.125180435180664	2.2707329034805297	1.6477508068084716	1.266060209274292	1.2575418949127197

calamari v1.0.5, tensorflow 2.4.4

n	1	5	10	20	50
1	0.007915997505187988	0.0074397802352905275	0.007631373405456543	0.007540416717529297	0.007537722587585449
10	0.3019905090332031	0.23226828575134278	0.16275520324707032	0.16215720176696777	0.15758802890777587
100	0.7962063074111938	0.3950160026550293	0.33384888172149657	0.2988450050354004	0.30655417442321775
1000	6.195057010650634	2.2250738859176638	1.6299351692199706	1.2984890937805176	1.276983118057251

calamari v1.0.5, tensorflow 2.3.4

n	1	5	10	20	50
1	0.007065892219543457	0.006709504127502442	0.006894207000732422	0.006652998924255371	0.006623005867004395
10	0.25193932056427004	0.23570570945739747	0.16034402847290039	0.1643200159072876	0.1634167194366455
100	0.6924072980880738	0.37671122550964353	0.3360576152801514	0.30185441970825194	0.30163798332214353
1000	5.139006614685059	2.152212619781494	1.547921085357666	1.2356892824172974	1.282170796394348

calamari v2.0.0, tensorflow 2.3.4, tfaip 1.0.1

n	1	5	10	20	50
1	0.17449259757995605	0.17991750240325927	0.18773868083953857	0.19791319370269775	0.20436220169067382
10	0.23921489715576172	0.2374429941177368	0.2518466949462891	0.2559601068496704	0.26382319927215575
100	0.5276783227920532	0.45921080112457274	0.44627621173858645	0.46705169677734376	0.464670991897583
1000	3.062366080284119	2.370705986022949	2.1865496158599855	2.0342562198638916	1.9911694765090941

calamari v2.0.1, tensorflow 2.3.4, tfaip 1.0.1

n	1	5	10	20	50
1	0.17778341770172118	0.18811919689178466	0.1945850133895874	0.20300788879394532	0.22730581760406493
10	0.2520615816116333	0.23547468185424805	0.2761404037475586	0.3625166893005371	0.2599898099899292
100	0.494154691696167	0.4478676080703735	0.43113758563995364	0.4207663297653198	0.43292860984802245
1000	2.4657802104949953	2.1499099254608156	2.1115876913070677	1.9334032773971557	1.8338363885879516

calamari v2.0.2, tensorflow 2.3.4, tfaip 1.0.1

n	1	5	10	20	50
1	0.17584502696990967	0.17831790447235107	0.1874023199081421	0.19749109745025634	0.20571470260620117
10	0.278959321975708	0.27796192169189454	0.30696120262146	0.33047828674316404	0.34730279445648193
100	0.6894578218460083	0.5354434967041015	0.5252492904663086	0.6007394790649414	0.5599848031997681
1000	2.5389096975326537	2.1279093980789185	1.9120651960372925	1.7973292112350463	1.8482789754867555

calamari v2.1.0, tensorflow 2.4.4, tfaip 1.1.1

It took ages with plenty of:
"You imported tensorflow at some point in the parallel pipeline. Running the code will work, but consume more memory and takes more time for initialization. Consider to remove all tensorflow imports from your data workers."
Thus interrupted.

Skipping calamari versions 2.1.1, 2.1.2, 2.1.3 due to the same problem.
2.1.4 Has fix for this:

    predictor.data.params.pre_proc.run_parallel = False
    predictor.data.params.post_proc.run_parallel = False

calamari v2.1.4, tensorflow 2.4.4, tfaip 1.2.5

For n = 1 it prints each time:

WARNING:tensorflow:11 out of the last 11 calls to <function Model.make_predict_function..predict_function at 0x7ffbfbc8f560> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.

n	1	5	10	20	50
1	0.7328215122222901	0.722108793258667	0.7123070240020752	0.732320499420166	0.8124824047088623
10	0.7532444000244141	0.7343388080596924	0.7387914896011353	0.7460015296936036	0.7545709133148193
100	0.9022891998291016	0.9495761156082153	0.9188367128372192	0.9964001893997192	1.011446976661682
1000	2.6287662982940674	2.5199357986450197	2.5726132154464723	2.483173394203186	2.546015906333923

Andreas Büttner · Answer 2 · Mon Nov 15 2021 20:48:47 GMT+0800 (China Standard Time)

Hi @wosiu, thanks for the benchmarks, that's really interesting! To avoid messing up the tensorflow imports, I'd suggest putting them in a subprocess:

import argparse
import tempfile
import time
import numpy as np
from prettytable import PrettyTable

import multiprocessing


def benchmark_prediction(queue, model, batch_size, processes, n_examples, runs=10):
    from calamari_ocr.ocr.predict.predictor import Predictor, PredictorParams
    params = PredictorParams(silent=True)
    predictor = Predictor.from_checkpoint(params, model)
    predictor.data.params.pre_proc.run_parallel = False
    predictor.data.params.post_proc.run_parallel = False

    data = (np.random.random((400, 48)) * 255).astype(np.uint8)
    print("Running with bs={}, proc={}, n={}".format(batch_size, processes, n_examples))
    start = time.time()
    for i in range(runs):
        list(predictor.predict_raw([data] * n_examples))
    end = time.time()
    queue.put((end-start)/runs)
    #return (end - start) / runs

def benchmark_subprocess(*args, **kwargs):
    """ Run in subprocess to avoid tf imports messing up performance """
    c = multiprocessing.get_context("spawn")
    q = c.Queue()
    p = c.Process(target=benchmark_prediction, args=(q, *args), kwargs=kwargs)
    p.start()
    res = q.get()
    p.join()
    return res

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--working_dir", default=tempfile.gettempdir(), type=str)
    parser.add_argument("--model", required=True)
    parser.add_argument("--processes", default=multiprocessing.cpu_count(), type=int)
    args = parser.parse_args()

    batch_sizes = [1, 5, 10, 20, 50]
    tab = PrettyTable(["n"] + list(map(str, batch_sizes)))
    for n_examples in [1, 10, 100, 1000]:
        results = [benchmark_subprocess(args.model, bs, args.processes, n_examples) for bs in batch_sizes]
        tab.add_row([n_examples] + results)

    print(tab)


if __name__ == "__main__":
    main()

Here are my results on a Quad Core Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz:

py 3.9, tf 2.7, calamari 1.0.5

(newer python version, newer tensorflow, no subprocess for the benchmark)

n	1	5	10	20	50
1	0.01315162181854248	0.011742305755615235	0.011842823028564453	0.01209120750427246	0.01231849193572998
10	0.22974395751953125	0.16737902164459229	0.082615327835083	0.08360829353332519	0.08581769466400146
100	1.200769853591919	0.8308326959609985	0.41378633975982665	0.3671584129333496	0.35127384662628175
1000	10.828134250640868	4.3135666847229	3.5323989152908326	3.127266025543213	2.448887801170349

py 3.8, tf 2.4.4, calamari 1.0.5

n	1	5	10	20	50
1	0.013362288475036621	0.013656997680664062	0.012458014488220214	0.012681293487548827	0.0136399507522583
10	1.9097131729125976	1.8371705055236816	1.7854376554489135	1.7675397157669068	1.7485485076904297
100	2.851202964782715	2.189711308479309	2.0937893390655518	2.024563717842102	2.0267707109451294
1000	16.534863686561586	9.73701639175415	9.043330550193787	8.469449067115784	8.457994651794433

py 3.8, tf 2.4.4, calamari 2.1.4, pre/post_proc.run_parallel=True

n	1	5	10	20	50
1	3.06041624546051	3.0336771249771117	3.0510018825531007	3.0461528062820435	3.0369704723358155
10	3.04496431350708	3.126779556274414	3.0625845193862915	3.0559497356414793	3.0517446041107177
100	3.2692791700363157	3.309327816963196	3.3099854230880736	3.3720837593078614	3.361532998085022
1000	7.211246156692505	7.0749754190444945	7.302390193939209	7.71282467842102	7.275307273864746

py 3.8, tf 2.4.4, calamari 2.1.4, pre/post_proc.run_parallel=False

n	1	5	10	20	50
1	1.018608522415161	1.0248649358749389	1.0350969314575196	1.0125337839126587	1.0169085264205933
10	1.167862606048584	1.1205103158950807	1.0690792798995972	1.2309166431427	1.0451623916625976
100	1.2943114042282104	1.2826362371444702	1.2807279586791993	1.2944347858428955	1.2827207803726197
1000	3.8914117574691773	4.033728289604187	4.072843170166015	4.284870386123657	3.9819567918777468

Conclusion?

The positive effects of increasing batch size seems to be missing in C2. Parallel pre- and post-processing is slowing down prediction if we are running on a limited number of CPU cores and without GPU. When running with the same python and tensorflow version, C2 seems to be faster when run on more than just half a page of line images, at least on my machine (could be different for larger amount of cores?).

@ChWick: tf 2.7 seems to be quite a bit faster than 2.4.4. Something prevents us from upgrading, what was that again?

The retracing warnings can be ignored, I suppose. Tensorflow insists on counting the occurrences of calling a prediction function and complaining if that happens too often.

Edit: py 3.9, tf 2.7, calamari 2.1.4, pre/post_proc.run_parallel=False

n	1	5	10	20	50
1	1.001286768913269	0.9967088222503662	1.0192787408828736	1.0088038206100465	0.9938497066497802
10	1.0237231254577637	1.0264565706253053	1.0123045444488525	1.015321946144104	1.0173277854919434
100	1.2190410852432252	1.2147423505783081	1.2920137882232665	1.3734655141830445	1.2653615951538086
1000	3.4991682052612303	3.545653557777405	3.4963221073150637	3.5203847169876097	3.661887550354004

Andreas Büttner · Answer 3 · Tue Nov 16 2021 03:02:02 GMT+0800 (China Standard Time)

Some more benchmarks, this time with GPU

Quadcore Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, GeForce GTX 1050 Ti

py 3.9, tf 2.4.1, calamari 1.0.5

n	1	5	10	20	50
1	0.007213759422302246	0.007359981536865234	0.007574772834777832	0.0069646120071411135	0.006994128227233887
10	0.7749871015548706	0.7228883028030395	0.6644667625427246	0.6639476537704467	0.675786304473877
100	1.2960800647735595	0.8731077194213868	0.8346931457519531	0.7854649305343628	0.7720790863037109
1000	8.359328556060792	4.0110067367553714	3.4892958402633667	3.207125020027161	3.0731616973876954

py 3.9, tf 2.4.1, calamari 2.1.4, pre/post_proc.run_parallel=True

n	1	5	10	20	50
1	1.5037858009338378	1.5199187040328979	1.522319531440735	1.515892791748047	1.5198529243469239
10	1.5575855493545532	1.544282031059265	1.5332003593444825	1.5490708827972413	1.5521154880523682
100	1.6002052068710326	1.586380672454834	1.594750738143921	1.6014111042022705	1.5885547637939452
1000	2.8979954719543457	2.9154956579208373	2.9702220916748048	2.8750507116317747	2.975974941253662

py 3.9, tf 2.4.1, calamari 2.1.4, pre/post_proc.run_parallel=False

n	1	5	10	20	50
1	0.7080289602279664	0.6896860599517822	0.6918380260467529	0.6905864715576172	0.6862347841262817
10	0.7059344053268433	0.7037941932678222	0.6990894556045533	0.7024942874908447	0.7001413822174072
100	0.8140160322189331	0.8167436599731446	0.81710786819458	0.8103108644485474	0.8118137121200562
1000	2.1732884883880614	2.1853001356124877	2.1897367000579835	2.1922594547271728	2.1852585554122923

Michał Woś · Answer 4 · Tue Nov 16 2021 03:44:30 GMT+0800 (China Standard Time)

All in all it seems there is some big overhead introduced after 1.0.5, which is the most impactful for small n :(

Andreas Büttner · Answer 5 · Tue Nov 16 2021 04:01:01 GMT+0800 (China Standard Time)

That's true. It is optimized for the use case of large collections (books or groups of books) and GPU. Also, the main focus so far was on training time, much less on prediction.