Issues with parallelizing thousands of tasks

Question

Issues with parallelizing thousands of tasks

drunksaint opened this issue 4 years ago · comments

Hi! I am trying to run a script that can be parallelized trivially - basically each line of the input is independent from one another. I have a 15k line input file that takes around 10 minutes to process on a single core. A single line takes ~1.4s to run. I split the input file into 15k files containing single lines, generated the makefile and passed it to lambda through gg with -j=2000. I encountered a few issues:

the time the full job took to run was almost 3 minutes. It should have taken a few seconds plus download of around 200 MB from s3 to lambda which shouldn't be much. Is there a way to debug which step is taking time on lambda and why? It is probably the file download, but could be something else.
the job never runs to completion the first time. there are always ~1% jobs stuck. Killing the process and restarting it once or twice always completes the execution though. Is there a way to specify retries in case of error? Or something else that I can do to fix this?
the creation of thunks takes around 45 seconds using gg infer make -j$(nproc) on a 6 core machine. And the wrapper uses gg create-thunk because I used pyinstaller to build the binary. The wrapper does just two things - it collects the single line input file and creates a thunk for the command. The binary and common input file are collected just once manually outside the wrapper function. I was considering using gg to convert the whole process to something that takes around 5-10sec max to run so that it can be run as an api. Is there a faster way to create these thunks?

The command I use is of the format

command single_line_input_file.txt big_constant_input_file.txt single_line_output_file.txt

The single_line_input_file.txt and single_line_output_file.txt change for each execution. The command and big_constant_input_file.txt are the same for every execution. the command is around 70MB and the big_constant_input_file.txt is around 140MB. So the same file is being downloaded to 2k lambdas parallely. I remember @sadjad, you mentioned in a talk that gg does each file transfer in parallel chunks. Perhaps this combined with the 2k parallel lambdas trying to download the same 2 files is hitting the s3 concurrency limit?

Sadjad Fouladi · Answer 1 · Wed Aug 05 2020 00:23:39 GMT+0800 (China Standard Time)

Hi @drunksaint,

the time the full job took to run was almost 3 minutes. It should have taken a few seconds plus download of around 200 MB from s3 to lambda which shouldn't be much. Is there a way to debug which step is taking time on lambda and why? It is probably the file download, but could be something else.

When you run gg force it tells you how much time it spent on uploading and downloading. And I agree with you, downloading 200 MBs from S3 should be fast. I wonder if the issue is the stragglers; some workers that just take too long to finish their task and just drag out the runtime. Could you please run it again and tell me the total time vs upload/download time?

the job never runs to completion the first time. there are always ~1% jobs stuck. Killing the process and restarting it once or twice always completes the execution though. Is there a way to specify retries in case of error? Or something else that I can do to fix this?

Good question! This is quite common with Lambdas, and gg force has built in support to fight stragglers. Start gg force with --timeout [SECONDS] option; if a Lambda task takes more than SECONDS to finish, gg will spawn a new Lambda and redo the task, while leaving the first one continue running. This should effectively mitigate this issue.

the creation of thunks takes around 45 seconds using gg infer make -j$(nproc) on a 6 core machine. And the wrapper uses gg create-thunk because I used pyinstaller to build the binary. The wrapper does just two things - it collects the single line input file and creates a thunk for the command. The binary and common input file are collected just once manually outside the wrapper function. I was considering using gg to convert the whole process to something that takes around 5-10sec max to run so that it can be run as an api. Is there a faster way to create these thunks?

Although gg create-thunk is convenient, it's really slow, and is not suggested when you're dealing with a large number of thunks. Similarly, gg infer's main goal is to bridge the gap between legacy code and Lambda, and is not recommended if you're building an application from scratch. There are a couple alternatives:

(my personal favorite way of generating thunks) Using gg's C++ API. Take a look at the https://github.com/StanfordSNR/gg/blob/master/examples folder, probably https://github.com/StanfordSNR/gg/blob/master/examples/gtest/create-thunks.cc#L226 is the best place to start. You need to write a single C++ program that generates all your thunks.
We have a new tool, gg repl that can help you generate thunks faster. This tool receives commands via stdin and basically gives you a stateful command interface to generate thunks. [https://github.com/r2t2-project/pbrt-v3/blob/master/scripts/dump_partials_cloud.sh](Here's one example).

The idea is that this utility keeps five stacks: (1) arguments, (2) data (thunks and values), (3) executables, (4) outputs and (5) symlinks. Use you commands like push arg test, push output out, pop arg and pop output to manipulate these stacks, and when you issue the create thunk command, it will use the current values for those 5 stacks to generate the thunk. The stacks won't reset after generating a thunk, allowing you to reuse the values already in there, eliminating the need to rehash files.

But of course, if you're going to go down any of these two routes, that means you need to forgo gg infer.

So the same file is being downloaded to 2k lambdas parallely. I remember @sadjad, you mentioned in a talk that gg does each file transfer in parallel chunks. Perhaps this combined with the 2k parallel lambdas trying to download the same 2 files is hitting the s3 concurrency limit?

gg downloads files in parallel, but never breaks down a single file to chunks. I really doubt that's the issue (for example, in the talk, my demo downloads 2000 instances of gcc at the same time from S3 without breaking a sweat).

I think setting the --timeout option will fix the issues you're facing, but if not, I'm more than happy to work with you to get to the bottom of this.

Sadjad Fouladi · Answer 2 · Wed Aug 05 2020 07:35:56 GMT+0800 (China Standard Time)

Another point I missed: make sure your AWS account actually has the capacity to run 2,000 parallel Lambdas.

The red number in the status bar shows the number of running Lambdas:

drunksaint · Answer 3 · Wed Aug 05 2020 08:47:50 GMT+0800 (China Standard Time)

Hey @sadjad, thanks for the super detailed response!

For the thunk creation step, I'll try out one of the steps you mentioned - will probably go with the c++ api since you recommended it. Is either step faster than the other?

For the execution step, earlier when I mentioned the 3 minute time, I had timed it manually from the time the upload from my local machine completed, to the time the download to my local machine started. So I measured just the total lambda execution time, since I'm running these tests from outside ec2 and upload/download times to my machine are higher than they should be (should probably start doing this from ec2). The timeout parameter fixed the straggler issue though! But the total time still didn't really improve by much. I ran two tests, one with 1000 jobs and one with 2000. This time, with 2000 jobs and a 15 sec timeout, it still took 2.5 min to run just the lambda step. I've appended the outputs below:

First test:

$ gg force --jobs 1000 --engine lambda --timeout 10 *.output.txt
→ Loading the thunks...  done (186 ms).
↗ Uploading 4148 files (205.6 MiB)... done (153212 ms).
rmdir /tmp/thunk-execute.6bVRXk: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TvG3e42nAXIhLImjTFvikEiPj9sIMdJs4sVwIs5i6zf0000002e6
[warning] rate limited: TKxjxyPWSYUzPeLPRcf3doyZYgXYQQWuGpsf2H9sTS4w000002e6
rmdir /tmp/thunk-execute.3dGcNc: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TV1YlVimHOtPd7mO3IY9bJxQltwF5oL1LjwTU_Mi0eRc000002e6
rmdir /tmp/thunk-execute.2qBDth: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: THxOqTrQaRSIUdVO2wvUvlvMl5xjLnuHfELAOcj1vopE000002ea
[info] duplicating 149 jobs
[info] duplicating 399 jobs
[info] duplicating 69 jobs
[info] duplicating 43 jobs
[info] duplicating 238 jobs
[info] duplicating 79 jobs
[info] duplicating 101 jobs
[info] duplicating 80 jobs
[info] duplicating 201 jobs
[info] duplicating 37 jobs
[info] duplicating 98 jobs
[info] duplicating 109 jobs
[info] duplicating 101 jobs
[info] duplicating 53 jobs
[info] duplicating 74 jobs
[info] duplicating 95 jobs
[info] duplicating 101 jobs
[info] duplicating 45 jobs
[info] duplicating 64 jobs
[info] duplicating 138 jobs
[info] duplicating 59 jobs
[info] duplicating 26 jobs
[info] duplicating 1 job
↘ Downloading output files (17.6 MiB)... done (9187 ms).

Second test (I cancelled it during the upload when I realized that I'd like to update the parameters):

$ gg force --jobs 1000 --engine lambda --timeout 20 *.output.txt
→ Loading the thunks...  done (228 ms).
↗ Uploading 15174 files (206.2 MiB)... ^Cterminate called without an active exception
Aborted (core dumped)
$ gg force --jobs 2000 --engine lambda --timeout 15 *.output.txt
→ Loading the thunks...  done (163 ms).
↗ Uploading 181 files (205.4 MiB)... done (147898 ms).
rmdir /tmp/thunk-execute.zujCy7: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TypeR0fToxn3kjFRIMbP1r0.UD4Kv9jXcVk6lUAustDc000002e6
[info] duplicating 1159 jobs
rmdir /tmp/thunk-execute.LMIDqS: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TNvivcWNITQf3SRSDb__rz38.qSw789cMLhiKhJvSd0U000002ea
[info] duplicating 207 jobs
[info] duplicating 161 jobs
[info] duplicating 856 jobs
[info] duplicating 373 jobs
[info] duplicating 152 jobs
[info] duplicating 204 jobs
[info] duplicating 681 jobs
[info] duplicating 347 jobs
[info] duplicating 233 jobs
[info] duplicating 214 jobs
[info] duplicating 98 jobs
[info] duplicating 124 jobs
[info] duplicating 326 jobs
[info] duplicating 147 jobs
[info] duplicating 92 jobs
[warning] socket failure: TYYQB0rx_acg7sYE2cUn2.AlxkzJbew8koPz6iKNyfuo000002ea
[warning] socket failure: T5SZiZGGjE_2QDYUOyrxFBEj.S_EqALdvgVQoOqyvvNc000002ea
[warning] socket failure: TKu3v_OblGbusnBMwVVcItGnMtWyVISZspk0lDWqHFKY000002ea
[warning] socket failure: TQZysrHv9M4spxI5PQoD1ApkVNcTC3zrj7ExsSd.9cew000002ea
[warning] socket failure: TM3SwZvUy2EUuUwYKQbgLTzurjN5nPuIoht8IXzlshAE000002ea
[warning] socket failure: TfpRJzBXByhT3gVg2lA.FBdHc2qBo17_30PPOwhQtu6A000002ea
[warning] socket failure: T6vQej2134h7IxTuLkAr4luV_cVi72_cvY_urWY1JJ14000002ea
[warning] socket failure: TzqAhHqt7HBPIPO0Vtm246MfS4DY6KsXAsSsWKgLxE4A000002ea
[warning] socket failure: TkI5y8PbTTOXitiy2TruAthv7BhLFCQbqDvrYnl_5E8o000002ea
[warning] socket failure: TnJ5SP6aDwZRbJMQNPqKqYJfB0QzSoFGaDXn_SybC2Xw000002ea
[warning] socket failure: TVV7VCpQr_giVcLBLipCxw_eLHWyjV9H2MDYSLVW0JNE000002ea
[warning] socket failure: TvIG5iszbcgQrA4uB5scYmd98u.gl8gSaatngpbpdrUU000002ea
[warning] socket failure: TtcDAubANGN547EpeAUqoGPkF2vJYg644WhjbeFEJMoQ000002ea
[warning] socket failure: Tq4_5mxh..CMnW7.KlBx64WrLmEhOO8_ar6OK6hwFOho000002ea
[warning] socket failure: TgM4bjwvZqsJDbwxWKxj8fJpvmv7GpkjtVOK0r.O1LlM000002ea
[warning] socket failure: T_Qmv_OsMG.ocsO.HWkV87UirG3HD7onvSqy1VSwdaSg000002ea
[warning] socket failure: TwGPj9G_KHCUh0qF8P8LYdPh2KoRIOyaxPRBuMnrrjOQ000002ea
[info] duplicating 8 jobs
[info] duplicating 1 job
↘ Downloading output files (17.6 MiB)... done (11257 ms).

And yes, I could see 2000 lambdas running (the number in red).

Sadjad Fouladi · Answer 4 · Wed Aug 05 2020 10:04:34 GMT+0800 (China Standard Time)

Could you please measure the time necessary for forcing just one thunk on lambda? You can run gg force with --no-download to skip downloading and just time the whole command forcing one thunk (given that the files are already uploaded).

My hunch is that the overhead of each task (launching a Lambda + downloading the inputs + uploading the output + winding down) compared to the actual job is just too high. And you know, that would be okay if you had 15,000 workers (same as the number of thunks you have); all of them start in parallel, take some time to initiate, and the run their task and finish and deliver the results. But now, you need to pay that price multiple times.

Based on what you said, each function needs around 210 MiB of input data—which would take 4-6s to download—and a small output—which will be uploaded quickly. But your link to the cloud can also add to these overheads, and I can't help but wonder if running the job from an EC2 machine would make a different. I think this is something you should definitely try.

For the thunk creation step, I'll try out one of the steps you mentioned - will probably go with the c++ api since you recommended it. Is either step faster than the other?

C++ API is faster, and both are way faster than gg create-thunk.

Hope this helps!

drunksaint · Answer 5 · Thu Aug 06 2020 02:54:28 GMT+0800 (China Standard Time)

Forcing a single thunk without download took ~6s which seems about right.

$ time gg force --engine lambda --no-download output.txt
→ Loading the thunks...  done (0 ms).
↗ Uploading 1 file (148.0 B)... done (228 ms).

real	0m6.332s
user	0m0.022s
sys	0m0.008s

Running it a second time with a different input takes less time (~3s) because it perhaps uses the same instance with the cached binary and big input files:

$ time gg force --engine lambda --no-download output.txt
→ Loading the thunks...  done (0 ms).
↗ Uploading 1 file (97.0 B)... done (167 ms).

real	0m2.761s
user	0m0.021s
sys	0m0.013s

I reran the experiment by splitting the input into 8 line input pieces (<2k files) so that each worker can run just once. This had much better results. Using the old gg infer (with gg create-thunk) took less time (perhaps slightly less than previously expected).

$ time gg infer make -j$(nproc) >/dev/null 2>/dev/null 3>/dev/null

real	0m3.869s
user	0m15.167s
sys	0m5.574s

And forcing the thunks took around 30s which i guess is as expected because of the 15s timeout. perhaps this is due to failures the first time.

$ gg force --jobs 2000 --engine lambda --timeout 15 *.output.txt
→ Loading the thunks...  done (24 ms).
↗ Uploading 1897 files (870.4 KiB)... done (8465 ms).
rmdir /tmp/thunk-execute.opWqyJ: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TVnRJKH9ZXaZIyXRvdHufeeiug2LxrsKz9g30plGWDBI000002e6
rmdir /tmp/thunk-execute.5OqGFo: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: Tn2odK3pXai6tAkUhEcZ9hkQJmQFlpVbt.VvjZPo64OM000002e6
rmdir /tmp/thunk-execute.So1R9r: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TmtwB1RVfwN.ww0ZJJvzN.1pE3J0Dt90M_sDfzFqyELM000002e6
rmdir /tmp/thunk-execute.mxHqub: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: T1UYAmQ5w17LbmxgZb__CQ1Z_6A3m3CbabLNTzvGrv.A000002e6
rmdir /tmp/thunk-execute.izEKdX: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TMyRR0rSha4Rj6i08Rru.WnvjvA447N3jrZgjXzH5YKM000002e6
rmdir /tmp/thunk-execute.GEy3eJ: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: Tucwqf0nF9hSlZs55FikeihfYfcaZNHvmSK6E9FHMrmQ000002ea
rmdir /tmp/thunk-execute.03EnhB: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: T7h_HO_OP1bs7wiGi7h5seMCHxGYbCOx2nTRzf6df5z8000002e6
rmdir /tmp/thunk-execute.gx65Yl: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: Tcbl2q5qhp9a7t9y1YyZnSY9eiuOxRFAQ_TwSSN0sCdk000002ea
rmdir /tmp/thunk-execute.pSg7YC: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TbxWHEe.CPbwsIEah3rxwVQJrI9tVVi2mKY_vjnGc1Z8000002ea
rmdir /tmp/thunk-execute.wBtBiW: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: ThEiYfYsfBtQwDZ8e04U6wC1f.rRw72Eyv6N3YcyyeWU000002ea
rmdir /tmp/thunk-execute.BtfjXE: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TDVeqWRjOSYbsnSuKrwVH9pD9OgCCn7Lq5xEECz_eZ4k000002e6
[info] duplicating 88 jobs
↘ Downloading output files (10.1 MiB)... done (2192 ms).

I ran a second test with a 10s timeout and this also took around 30s. Perhaps this time there were failures 2 consecutive times.

$ time gg force --jobs 2000 --engine lambda --timeout 10 *.output.txt
→ Loading the thunks...  done (32 ms).
↗ Uploading 1897 files (870.4 KiB)... done (9648 ms).
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'Vr2Cz0xI_PP_57Xd63bssY.5mLFUIO3sE6QJkTYg_Svc000001c7': HTTP/1.1 404 Not Found\n"}
[warning] operational failure: T4l4_41poxM2rXpL9jWbxXe97IIhvoRXwH1_uNZJpP1E000002ea
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'Vr2Cz0xI_PP_57Xd63bssY.5mLFUIO3sE6QJkTYg_Svc000001c7': HTTP/1.1 404 Not Found\n"}
[warning] operational failure: T4l4_41poxM2rXpL9jWbxXe97IIhvoRXwH1_uNZJpP1E000002ea
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'Vr2Cz0xI_PP_57Xd63bssY.5mLFUIO3sE6QJkTYg_Svc000001c7': HTTP/1.1 404 Not Found\n"}
[warning] operational failure: T4l4_41poxM2rXpL9jWbxXe97IIhvoRXwH1_uNZJpP1E000002ea
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'VxDi_0YAklL2LZDOCLRr1glEgxeAIqSMYmkoZPYowWxs000001c6': HTTP/1.1 404 Not Found\n"}
[warning] operational failure: TW7Bk2S1u5_mxUoy8hfKo7uXCrJfCFf9df0B8NXRrT0k000002ea
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'Vr2Cz0xI_PP_57Xd63bssY.5mLFUIO3sE6QJkTYg_Svc000001c7': HTTP/1.1 404 Not Found\n"}
[warning] operational failure: T4l4_41poxM2rXpL9jWbxXe97IIhvoRXwH1_uNZJpP1E000002ea
rmdir /tmp/thunk-execute.fQd3KB: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TfprX2jivLu4FizWi92XWkTR891GpisK_akuiavvvYCs000002e6
rmdir /tmp/thunk-execute.QHhmc6: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TUA2qaiLT5i_m4sG7Qs2JPjOVO2H52iS.IkRgcwHZJv0000002ea
[info] duplicating 1481 jobs
rmdir /tmp/thunk-execute.K1dc04: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: TyIkFnfUnwEnnGERg8eK5PbknEEh5zxUgfD0LpXVrPCE000002ea
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'Vr2Cz0xI_PP_57Xd63bssY.5mLFUIO3sE6QJkTYg_Svc000001c7': HTTP/1.1 404 Not Found\n"}
[warning] operational failure: T4l4_41poxM2rXpL9jWbxXe97IIhvoRXwH1_uNZJpP1E000002ea
[info] duplicating 5 jobs
[info] duplicating 165 jobs
↘ Downloading output files (10.1 MiB)... done (1692 ms).

real	0m44.202s
user	0m6.549s
sys	0m8.387s

Is it possible to debug these error messages? I see the ChildProcess and 404 messages quite frequently.
I'm going to setup the test from ec2 now and see if I see any improvement. Will keep you posted on how that goes.

Unrelated question: Is there a command to clear the thunk cache for just the current gg project from s3 and local?

drunksaint · Answer 6 · Thu Aug 06 2020 04:41:03 GMT+0800 (China Standard Time)

I just ran the test on ec2. Good news! you were right! it took less time despite getting a ton of rate limited warnings.

$ time gg force --jobs 2000 --engine lambda --timeout 10 *.output.txt
→ Loading the thunks...  done (39 ms).
↗ Uploading 1899 files (206.2 MiB)... done (5845 ms).
[warning] rate limited: TQZxVGrlkGDnSkDSvz8kWKLa04UsfRI_JhPC4QlR82Yo000002ea
[warning] rate limited: TA_C9r.pN2YlTIqxS.Wwvnnb2yPCyiojxRLB39DJw2bI000002ea
[warning] rate limited: T6JXnu5O55PH3_RC9EH6rmbWpt.zwRyFLsuTCgjHnoDY000002ea
[warning] rate limited: Twn8tKkiEUfiq22w5UTLCodEGIx.awOMOi2BVUqcnlik000002ea
.
.
.
[warning] rate limited: T_Mo_HWO9m4CRPTOeIejLI73Hcecjk7nOA2_Kl0J0F9I000002ea
[warning] rate limited: T2z3jKYXXQYCEijHpCJjuI..aGJLBNzYviI3G6aBGBcE000002ea
[warning] rate limited: TMiWcy9WeimaHEA8BTmwKX1OnlzEyeqIvdyFTjlr823k000002ea
[info] duplicating 2 jobs
↘ Downloading output files (10.1 MiB)... done (1043 ms).

real	0m25.222s
user	0m10.136s
sys	0m3.295s

Looks like the lambda execution took around 18s (25s total - 6s upload - 1s download). This is despite the ~8000 rate limited warning messages in the output. It also had 6 of the ChildProcess errors. And looks like only 2 jobs failed the first time, so it was more reliable running on ec2. I'm guessing if the rate limiting issue is figured out, this could be sub 10s! Seems like this requires talking to AWS support to get the limit increased. Have you seen these many rate limited messages in your runs?

Inference took much longer using the gg infer way because I was running this on a t3.micro instance. So I'll try out the c++ api as you suggested or maybe gg repl to improve this.

$ time gg infer make -j$(nproc) >/dev/null 2>/dev/null 3>/dev/null
real	0m19.489s
user	0m25.846s
sys	0m12.209s

drunksaint · Answer 7 · Thu Aug 06 2020 05:00:23 GMT+0800 (China Standard Time)

also, is it possible to get gg force to skip the error cases after timeout without retrying?

drunksaint · Answer 8 · Thu Aug 06 2020 09:55:28 GMT+0800 (China Standard Time)

I looked at gg repl and seemed quite easy to use. So I thought I'll try this out and see if the performance is good enough for this use case.
I'm trying to create and execute a thunk for filetest input.txt output.txt. There are 2 issues I had while trying a few things out after looking at the file you linked.

the description of the same thunk using gg infer has a @{GGHASH:...} argument, but the thunk created using gg repl didn't have this hash value. Do I have to hash the input file outside gg repl and create the hash string to pass in? Or does gg repl handle it?
the output file didn't exist in the directory after using gg repl, so I am not sure what to gg force

The old way using gg infer:

$ gg infer filetest i.txt o.txt 
$ cat o.txt 
#!/usr/bin/env gg-force-and-run
TSVo_wrmjmC3H7wcmwu8ZGNDldsdc9vb6Iu2ZDxJr2QY00000108
$ gg describe TSVo_wrmjmC3H7wcmwu8ZGNDldsdc9vb6Iu2ZDxJr2QY00000108
{
 "function": {
  "hash": "VYjZ0uPlVFYiusJnE5Saokwsp6k8EuQm_uY8jr3G3hiQ00002160",
  "args": [
   "filetest",
   "@{GGHASH:VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016}",
   "o.txt"
  ],
  "envars": []
 },
 "values": [
  "VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016"
 ],
 "thunks": [],
 "executables": [
  "VYjZ0uPlVFYiusJnE5Saokwsp6k8EuQm_uY8jr3G3hiQ00002160"
 ],
 "outputs": [
  "o.txt"
 ],
 "links": [],
 "timeout": 0
}

What I tried with gg repl

$ gg repl
push data i.txt
push executable filetest
push arg filetest
push arg i.txt
push arg out.txt
push output out.txt
create thunk
T3818yEAgkwX8vuBV6uhGKduacn4bWNLS8YAScYy5nLI000000d2
exit
$ gg describe T3818yEAgkwX8vuBV6uhGKduacn4bWNLS8YAScYy5nLI000000d2
{
 "function": {
  "hash": "VYjZ0uPlVFYiusJnE5Saokwsp6k8EuQm_uY8jr3G3hiQ00002160",
  "args": [
   "filetest",
   "i.txt",
   "out.txt"
  ],
  "envars": []
 },
 "values": [
  "VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016"
 ],
 "thunks": [],
 "executables": [
  "VYjZ0uPlVFYiusJnE5Saokwsp6k8EuQm_uY8jr3G3hiQ00002160"
 ],
 "outputs": [
  "out.txt"
 ],
 "links": [],
 "timeout": 0
}
$ gg force T3818yEAgkwX8vuBV6uhGKduacn4bWNLS8YAScYy5nLI000000d2
gg-force: open (T3818yEAgkwX8vuBV6uhGKduacn4bWNLS8YAScYy5nLI000000d2): No such file or directory
$ gg force out.txt
gg-force: open (out.txt): No such file or directory

sending push arg @{GGHASH:VziaXgzsiNzeCIBBjDSZ9oHqywVIPYTBP5.ksiphPP_000000016} instead of push arg i.txt created the same thunk, but this required running gg hash i.txt outside gg repl to get the hash. But this still didn't create the out.txt file with the thunk id that I could force to get the output.
I'm obviously missing something here. I'd appreciate your help with this. Thanks :)

drunksaint · Answer 9 · Fri Aug 07 2020 03:20:34 GMT+0800 (China Standard Time)

From AWS Lambda console:

Looks like my account was allowed just 1k lambdas eventhough the number in red from the gg force output was 2k. I've reached out to them to get this increased.

Sadjad Fouladi · Answer 10 · Sat Aug 08 2020 00:20:06 GMT+0800 (China Standard Time)

Hi @drunksaint -- apologies for my delayed response.

Is it possible to debug these error messages? I see the ChildProcess and 404 messages quite frequently.

The ChildProcess error is something that I'm wondering about, too, and I guess it has something to do with a race condition in kernel. We use this class to fork the process and execute the thunk, and there's a check that makes sure the parent process is not a multi-threaded program. But for some reason, that check sometimes fails, although no other threads are running... @keithw: Any ideas why this is happening?

re: 404 messages, are you sure those blobs exist in the S3 bucket? That usually means that the object is missing and can happen when there's a mismatch between the local index and the remote bucket.

Unrelated question: Is there a command to clear the thunk cache for just the current gg project from s3 and local?

Sadly, no, not yet. But it's a good thing to consider.

Seems like this requires talking to AWS support to get the limit increased. Have you seen these many rate limited messages in your runs?

We also talked to AWS support to get our limit increased. If you wanna get 2,000 concurrent Lambdas, you need them to increase your limit to around 10,000 concurrent requests (expected duration × expected requests per second).

also, is it possible to get gg force to skip the error cases after timeout without retrying?

Unfortunately, no. All the thunks has to be completed for a job to finish.

the description of the same thunk using gg infer has a @{GGHASH:...} argument, but the thunk created using gg repl didn't have this hash value. Do I have to hash the input file outside gg repl and create the hash string to pass in? Or does gg repl handle it?

gg repl doesn't directly support that. Instead, you can try pushing a link; which creates a symlink in the execution directory to the blob. For example, in your case, after push data i.txt, you can do a push link NAME TARGET, and then your program can just access NAME without any of those @{GGHASH:...} shenanigans.

the output file didn't exist in the directory after using gg repl, so I am not sure what to gg force

Normally, gg repl only creates the thunk in the blobs directory. If you want to see the output file to use with gg force, you need to pass create placeholder OUTPUT_TAG PLACEHOLDER_PATH command to gg repl.

Looks like my account was allowed just 1k lambdas eventhough the number in red from the gg force output was 2k. I've reached out to them to get this increased.

Nice, yeah, I think you need to get your limited increased.

Please let me know if there's anything else I can help with!

--Sadjad

drunksaint · Answer 11 · Tue Aug 11 2020 03:31:59 GMT+0800 (China Standard Time)

Hey @sadjad! Thanks for your response!

re: 404 messages, are you sure those blobs exist in the S3 bucket? That usually means that the object is missing and can happen when there's a mismatch between the local index and the remote bucket.

I saw 404s initially, but the execution went through eventually. And the ouput file was correct and had everything in place. So I'm guessing this might be due to some sort of race condition.

I got gg repl to work on my local machine after the changes you suggested! And it's much faster! But I've encountered a new issue. The uploading step from ec2 to lambda is getting killed for some reason. The exact same thing works when I run it from my local machine to lambda or locally on ec2 though.

user@ec2$ gg force --engine lambda --timeout 10 outputpieces/*.out.txt
→ Loading the thunks...  done (0 ms).
↗ Uploading 2 files (205.4 MiB)... Killed
user@ec2$

me@localmachine$ gg force --engine lambda --timeout 10 outputpieces/*.out.txt
→ Loading the thunks...  done (0 ms).
↗ Uploading 4 files (205.4 MiB)... done (148358 ms).
↘ Downloading output files (2.4 KiB)... done (145 ms).
me@localmachine$

user@ec2$ gg force --timeout 10 outputpieces/*.out.txt
→ Loading the thunks...  done (1 ms).
user@ec2$

Do you know why this might be happening? Is there a way I can debug this?

Sadjad Fouladi · Answer 12 · Tue Aug 11 2020 03:54:18 GMT+0800 (China Standard Time)

Hi @drunksaint,

My best guess is that the EC2 machine runs out of memory, and then the process is killed. Could you please check if that is the case?

--Sadjad

drunksaint · Answer 13 · Tue Aug 11 2020 04:06:45 GMT+0800 (China Standard Time)

Yup, you're right! I'll move to a fatter instance. I was surprised because the machine has more than 750MB free for ~200MB files.

drunksaint · Answer 14 · Tue Aug 11 2020 07:26:57 GMT+0800 (China Standard Time)

Got it working end to end on ec2! The whole thing (16k lines in 2k tasks - 8 lines per task) now takes around 23s for the full upload+execution+download cycle, of which upload+download is around 5s. So execution is around 18s (with a timeout of 5s and 2k lambdas).

A single task when run independently takes around 2.5s to run from ec2 though. I must've run close to 20 tests now. It always takes around 23s-24s. I always see >50 of the Childprocess and 404 errors - ratio and number vary between runs. But the runs always complete with a valid output. So the 404s are most likely a race condition. The ChildProcess I'm not sure about though. Is it possible that these two errors are leading to the higher than expected execution times?

The two errors:

rmdir /tmp/thunk-execute.mCj4Vp: Directory not empty
gg-execute-static: ChildProcess constructed in multi-threaded program

[warning] operational failure: T4OFoGB7grd8LhkRCXfwp.WiNGYcBeCMkTpUQVQ3c3Gg00000308

and

invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'VlKG0pQdehrKSj79PAvohBCG7YoEDXXnJevvcOVxirkk000001d6': HTTP/1.1 404 Not Found\n"}
[warning] operational failure: TCfcC03B0eKzA4V9V90irN.FswxhvZy9Tif_BiHKbmFE00000300

Sadjad Fouladi · Answer 15 · Tue Aug 11 2020 08:26:00 GMT+0800 (China Standard Time)

That's still seems too much... I wonder if we're hitting some limits within S3. Would you be willing to get me some timing logs from an actual run? Here are the steps:

Set the GG_TIMELOG=1 environment variable, before running gg force
After the job is done, run aws s3 cp --recursive s3://[YOUR_BUCKET]/timelog $1/ to download all the timing logs.

If the job failed for any reason, please get rid of timelog directory in your bucket first and then try again.

drunksaint · Answer 16 · Wed Aug 12 2020 02:48:59 GMT+0800 (China Standard Time)

seems like it's the download from s3 that's causing problems. execution times seem very consistent.

Top few values from get_dependencies:

I'm assuming the first value in the timelog file is the timestamp. This was run with a 5s timeout, so there's a dip after 5s. Looks like the replacement jobs run better.

Top few upload times:

raw data

are there any other stats that you'd like to see?
Note: time from start or time from first is t-t0. t0 is min timestamp from the logs.

Sadjad Fouladi · Answer 17 · Wed Aug 12 2020 08:48:14 GMT+0800 (China Standard Time)

Hi @drunksaint,

Thank you for getting the data!

So, get_dependencies doesn't look too bad, but do_cleanup seems to be much worse (worst case is about 25s, which is unreasonable). Let's fix this one first, and then will figure something out about S3. For starters, can you please comment out this line and see if anything changes?

Also, can you run this script over the timing logs? It will convert them to csv format, and I can use it to generate something similar to Figure 10 in the paper.

Thank you!

--Sadjad

drunksaint · Answer 18 · Thu Aug 13 2020 06:44:08 GMT+0800 (China Standard Time)

Hi @sadjad!

I thought the do_cleanup numbers were in ms XD.

After removing the cleanup stage, run times varied quite a bit - from a second or two less to almost 10s less! i see many times the run fails due to disk space issues, perhaps because of the lack of cleanup. I'm seeing the 'slow down' error code more often now. the charts look different from run to run, so i've attached stats from multiple runs below:

stats from 3 of the successful runs that took >20s are here, here and here (processed using the script that you linked). It's strange that these still show workers performing cleanup.
The best successful run's stats: parse_runlogs.out.txt. This run had the least number of errors by a big margin.

→ Loading the thunks...  done (37 ms).
↗ Uploading 1897 files (870.4 KiB)... done (3594 ms).
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'unix_error'\n  what():  write: No space left on device\n"}
[warning] operational failure: ThqY972yCH_FX212HmJtViNAr4L_PArxfyilRvOEsDUg00000308
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'unix_error'\n  what():  write: No space left on device\n"}
[warning] operational failure: ThqY972yCH_FX212HmJtViNAr4L_PArxfyilRvOEsDUg00000308
[info] duplicating 20 jobs
invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'unix_error'\n  what():  write: No space left on device\n"}
[warning] operational failure: TIi2wYixq1N7MxdF_GCMpHpzvjJnG_qwNe1OvragJBBg00000308
↘ Downloading output files (10.0 MiB)... done (1369 ms).

real	0m14.613s
user	0m2.359s
sys	0m1.011s

another successful run that took 17s
stats from one of the failed runs is here

The timelog folder has the same number of files everytime. But I see a different number of error messages each time. Does this log only the successful executions?

Let me know if you'd like me to send any more stats or run it more times.

Thanks!
drunksaint

drunksaint · Answer 19 · Thu Aug 13 2020 23:16:20 GMT+0800 (China Standard Time)

Some of the do_cleanup histograms from the above successful runs:

Excel automatically adjusts the axes. So there are high values with low occurrences that aren't visible in all charts. For eg: the values that can't be seen in the last chart:

Could this possibly be the reason?

Sadjad Fouladi · Answer 20 · Thu Aug 13 2020 23:39:44 GMT+0800 (China Standard Time)

Hello @drunksaint,

I thought the do_cleanup numbers were in ms XD.

Oops. You are right, you're right. All numbers are in milliseconds, so do_cleanup is not the culprit. My apologies for sending you down the rabbit hole.

Let's look at one of those >20s runs:

A few points:

There are some spikes -- generally workers that take a long time to fetch the dependencies from S3. But straggler mitigation (--timeout) should take care of those.
I think the problem is that it takes so long to launch all the Lambdas. So, I tried to mimic the job you are running: I created a job with 2000 thunks, each with a 100-MB executable and a 100-MB input file (200 MB total), and executed it on Lambda with --jobs 2000. The executable just sleeps for a few seconds and then dies. Look at the graph:

Almost 1,700 Lambdas are launch within the first second of the job. Compared to yours, it takes way longer to start (I now remember that the capacity that we asked for, gives us 1,500 concurrent Lambdas).

This was our request to Lambda team that gave us enough capacity to launch 1500 concurrent Lambdas:

Expected Average Requests Per Second: 1000
Expected Peak Requests per Second: 1250
Expected Function Duration in Seconds: 10
Function Memory Size: 3008
Invocation Type (Event / Request-response): Request-Response
Event Source (If Applicable): N/A

I think the Expected Average Requests Per Second is the key. Do you happen to know the parameters of your limit increase request?

drunksaint · Answer 21 · Fri Aug 14 2020 06:07:25 GMT+0800 (China Standard Time)

Hi @sadjad

Oops. You are right, you're right. All numbers are in milliseconds, so do_cleanup is not the culprit.

I'm glad XD. I had started adding code to measure the backend creation time XD.

Wow, our execution profiles look quite different. The charts are super neat. My support request looked like this:

1- Expected average requests per second: I am currently in development stage. So can't say for sure. Average might be low since this is for burst workloads.
2- Expected peak requests per second: 10k
3- Expected function duration: 10-15s
4- Function memory size: 3008
5- Invocation Type: RequestResponse
6- Event Source (If Applicable): -

Yeah, you might be right - I asked for a low average requests per sec. So they have provided a low burst capacity but 10k max. I'll reach out to AWS support and try to get it fixed.

Thanks!

drunksaint · Answer 22 · Fri Aug 14 2020 06:16:39 GMT+0800 (China Standard Time)

@sadjad did you see any 404/503/ChildProcess errors when you ran your simulation?

Sadjad Fouladi · Answer 23 · Fri Aug 14 2020 18:28:56 GMT+0800 (China Standard Time)

Not really. I got plenty of "rate limited" errors, but those are okay. I think I saw a few ChildProcess errors, but no 404s or 503s.

drunksaint · Answer 24 · Sat Aug 22 2020 02:46:50 GMT+0800 (China Standard Time)

Hi @sadjad

I've been busy the last week talking to people at AWS to figure out what the issue is and that seems to be taking time. Seems like some problem with the burst concurrency at their end. I sent them the two charts you sent me and they asked me if I could replicate your test exactly and see if I'm still observing the same issue in my region. I'm guessing you're using ubuntu 18.04 to run the script from an ec2 instance in us west (oregon) to lambdas in us west (oregon). I'm using a t3.medium ec2 instance from us east (n. virginia) to lambdas in us east (n. virginia). can you confirm if my guess is right? Also which instance type are you using? i doubt that'll matter much but might help reduce the number of variables in the two tests.

Sadjad Fouladi · Answer 25 · Sat Aug 22 2020 03:10:55 GMT+0800 (China Standard Time)

Hi @drunksaint,

Actually, I'm running my script from my machine at Stanford (Northern California) to Lambdas in us-west-2 (Oregon). I'm running Ubuntu 20.04 on an eight-core, 3.2 GHz Intel Core i7 CPU.

gg force program is single-threaded and having more cores wouldn't help with its performance, but I wonder maybe gg force on your t3.medium instance is maxing out the CPU? I suggest trying running gg force on a c5.large or a similar instance.

Also, in my experience, Lambda performance is not the same in all regions. I usually use us-west-1 and us-west-2, but I haven't used us-east-1 in a long time. Back in the day, it wasn't as good as the west regions in terms of concurrency.

Here's the code to my function:

#include <fstream>
#include <thread>
#include <chrono>
#include <cstdlib>

#define SIZE 1024 * 1024 * 100

char dummy[SIZE] = { 'j' };

using namespace std;
using namespace chrono;

int main()
{
  dummy[SIZE - 1] = '\0';
  srand( time( nullptr ) );
  this_thread::sleep_for( 5s );
  ofstream fout { "output" };
  fout << rand() << endl;
  return 0;
}

Compiled with g++ prog.cc -o prog -static -s.

Create a large, random input file: dd bs=1024 count=102400 </dev/urandom >input

Thunk creation:

F_HASH=`gg hash prog`
I_HASH=`gg hash input`

for i in {1..2000}
do
    gg create-thunk --value ${I_HASH} --executable ${F_HASH} --output output \
                    --placeholder ${i}.task ${F_HASH} prog ${i}
done

Execution:

gg force --jobs 2000 --engine lambda *.task --timeout 10

drunksaint · Answer 26 · Tue Aug 25 2020 08:59:17 GMT+0800 (China Standard Time)

Thanks @sadjad for the details of your execution! This helps moving the conversation forward with the AWS person I'm in touch with.

I'm seeing quite different lambda start profiles in us-west-2 and us-east-1 even from a c5n instance.
us-west-2:

us-east-1:

I see a lot of HTTP 503 Slow Down errors in the us-east-1 case especially.

invalid response: {"returnCode": -6, "stdout": "terminate called after throwing an instance of 'std::runtime_error'\n  what():  HTTP failure in downloading 'VZkA3Vvq5jyZh8K7XxyWNfvEUsZUUmadCVRL9oUj8bT406400000': HTTP/1.1 503 Slow Down\n"}
[warning] operational failure: TyW8xitrvArhitQBkbVVB9qtH8HhXTnlu_YRR8TXYbvg000000c3

It's strange that the step is always ~10sec+ in the us-east case. Do you think this could be related to the 10sec timeout value in gg? As in, maybe the first set of tasks actually ran, but failed due to the 503, and 10s later, replacement tasks were started for all these failed tasks.

Thanks
-drunksaint

drunksaint · Answer 27 · Wed Aug 26 2020 02:32:41 GMT+0800 (China Standard Time)

I have a theory - s3 has an upper limit of how many simultaneous downloads can happen from a single key. Since I'm running from an ec2 instance in the same region, the requests are being sent faster than s3 can keep up with, resulting in the high number of 503s. And gg retries after 10s. I'm not sure if gg retries 503 failures within the same lambda or if it creates a different lambda in this case. If it does create a different lambda, then the failed lambdas most likely aren't logged in the timelogs. And if it retries it in the same lambda, is the start time measured as the start time of the original execution or the start time of the new execution?

when i send requests to lambdas in us west from an ec2 instance in us east, or you send them from your home, we don't see the 503s because the requests are slow enough for s3. And this results in a smoother execution because gg doesn't face any error and doesn't need to retry.

Do you think this makes sense? Or is there something wrong with some of the assumptions?

drunksaint · Answer 28 · Sat Sep 12 2020 00:29:57 GMT+0800 (China Standard Time)

Hey @sadjad! I ran the test without gg and am not seeing the bump up to >10s in the start times. So looks like there might be something in gg that's causing that. There is one difference in the test though - I'm not using files for the small inputs and the outputs - I'm sending these directly through the lambda payload. But this shouldn't affect the start time profile.

Sadjad Fouladi · Answer 29 · Tue Sep 15 2020 01:30:38 GMT+0800 (China Standard Time)

Hi @drunksaint,

Lunch on me for the delayed response!

I'm seeing quite different lambda start profiles in us-west-2 and us-east-1 even from a c5n instance.

This definitely has something to do with Lambda's burst concurrency quotas.

It's strange that the step is always ~10sec+ in the us-east case. Do you think this could be related to the 10sec timeout value in gg? As in, maybe the first set of tasks actually ran, but failed due to the 503, and 10s later, replacement tasks were started for all these failed tasks.

Okay, it seems that I had something wrong. I didn't pay enough attention and I thought the 503 errors are coming from Lambda, telling us to slow down launching workers... But they're coming from S3. My bad!

However, the 10s timeout is for the tasks that are started, but are still executing after 10s. Failed tasks are retried immediately.

I'm not sure if gg retries 503 failures within the same lambda or if it creates a different lambda in this case. If it does create a different lambda, then the failed lambdas most likely aren't logged in the timelogs.

Yes, you're right. It creates a different Lambda and the failed ones are not logged in timelogs (you see one entry per thunk in timelogs, even if it had failed multiple times or retried due to a timeout -- our logging system is definitely lacking).

when i send requests to lambdas in us west from an ec2 instance in us east, or you send them from your home, we don't see the 503s because the requests are slow enough for s3. And this results in a smoother execution because gg doesn't face any error and doesn't need to retry.

FWIW, I did ran the same experiment from an EC2 machine in the same region, and it just ran as smoothly.

I'm not using files for the small inputs and the outputs.

Oh, I didn't know that you had small inputs. They can be the reason that S3 is acting up -- we had a similar problem in another project, where we had to deal with small files (less than a kilobyte in size) on S3 and we kept hitting the limits pretty quickly. Since gg doesn't do exponential backoff in case of 503 errors, the situation just gets worse as we keep a high request rate. I imagine that in your case, it's not possible to bundle up the big and small inputs in one file.

If that's the issue, I can easily think about an enhancement to gg where the small inputs are sent with the Lambda payload (I think we already have code for that), as you're doing in your code.

drunksaint · Answer 30 · Tue Sep 15 2020 02:29:27 GMT+0800 (China Standard Time)

Hi @sadjad

Lunch on me for the delayed response!

I thought I lost you for some time :)

This definitely has something to do with Lambda's burst concurrency quotas.

I've been speaking to someone in AWS. They haven't found the issue yet at their end. I pointed him to gg and he looked intrigued and I believe he is running some tests using gg.

Okay, it seems that I had something wrong. I didn't pay enough attention and I thought the 503 errors are coming from Lambda, telling us to slow down launching workers... But they're coming from S3. My bad!

Yeah, I guess I should've been clearer. They are definitely from s3

However, the 10s timeout is for the tasks that are started, but are still executing after 10s. Failed tasks are retried immediately.

Ok, then it might be possible that the lambdas get throttled when they see too many 503s from s3. But the AWS guys didn't bring that up though. So, still not sure why the 10-12 sec jump happened.

FWIW, I did ran the same experiment from an EC2 machine in the same region, and it just ran as smoothly.

Strange. I was getting 503s even with your test (the sleep for 5 sec) where there are no extra inputs.

I imagine that in your case, it's not possible to bundle up the big and small inputs in one file.

Yeah, unfortunately it isn't possible to bundle them.

If that's the issue, I can easily think about an enhancement to gg where the small inputs are sent with the Lambda payload (I think we already have code for that), as you're doing in your code.

Thanks, that would be super useful! Can the outputs also be bundled in the response payload? I'm not sure if that's possible though the way gg is setup.

Sadjad Fouladi · Answer 31 · Tue Sep 15 2020 03:22:38 GMT+0800 (China Standard Time)

Can the outputs also be bundled in the response payload?

Yes! Give me a couple days and I'll have that on master :)

drunksaint · Answer 32 · Tue Sep 15 2020 03:33:55 GMT+0800 (China Standard Time)

oh awesome! thanks! :)