[Question]: Any known issues that might cause delay of exactly 100 s. of starting a job execution after receiving it?

Question

[Question]: Any known issues that might cause delay of exactly 100 s. of starting a job execution after receiving it?

tvachev opened this issue a year ago · comments

Describe your question

We experience delay between queuing and initializing a job of 100 s. (1:40 min) when executing the same pipeline on self-hosted agent on Linux ( on CentOs and Debian).

2023-03-23T14:06:16.0676904Z ##[section]Starting: Job
2023-03-23T14:07:56.4086577Z ##[section]Starting: Initialize job

On Microsoft hosted agents it is nearly instantaneous.

Part of the agent worker log when it happens:

[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Bin': '/home/ladmin/myagent/bin'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:52:19Z INFO JobServerQueue] Try to append 1 batches web console lines for record '12f1170f-54f2-53f3-20dd-22fc7dff55f9', success rate: 1/1.
[2023-03-23 11:52:34Z INFO JobServerQueue] Stop aggressive process web console line queue.
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Bin': '/home/ladmin/myagent/bin'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Tools': '/home/ladmin/myagent/_work/_tool'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Bin': '/home/ladmin/myagent/bin'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Temp': '/home/ladmin/myagent/_work/_temp'

The pipeline description is probably irrelevant as we also tried extremely simple ones.
The problem does not stem from waiting on agents in a pool.

Any idea why this might be happening or if it is a known issue?

Thanks!

Versions

Agent version is 2.218.1.

Environment type (Please select at least one enviroment where you face this issue)

Self-Hosted
Microsoft Hosted
VMSS Pool
Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Operation system

CentOs 8, Debian 11

Version controll system

git

Azure DevOps Server Version (if applicable)

No response

Mark Weaver · Answer 1 · Fri Mar 24 2023 17:05:58 GMT+0800 (China Standard Time)

https://github.com/microsoft/azure-pipelines-agent/pull/4166/files seems like a likely candidate

Is there any way of turning the telemetry off completely?

(the default HttpClient timeout is 100s)

Mark Weaver · Answer 2 · Fri Mar 24 2023 17:49:49 GMT+0800 (China Standard Time)

downgrading to v2.217.2 has fixed the delay in our environment

Kirill Ivlev · Answer 3 · Sat Mar 25 2023 00:17:44 GMT+0800 (China Standard Time)

Hi @tvachev, thanks for reporting! We will take a look

mmunte-impeo · Answer 4 · Sat Mar 25 2023 18:46:14 GMT+0800 (China Standard Time)

@blushingpenguin Hi Mark. This is also hitting us. Many thanks for reporting.
What is the procedure to downgrade to an older agent version? I could not find any good information.

tenio · Answer 5 · Sat Mar 25 2023 20:33:51 GMT+0800 (China Standard Time)

@mmunte-impeo
Remove the newer version
Disable the auto update from azure devops
Install an old version - @blushingpenguin suggests 2.217.2
I can confirm it works (Strangely enough we don't see the problem when running newer agent versions on Windows)

Sebastian Sterk · Answer 6 · Mon Mar 27 2023 20:00:12 GMT+0800 (China Standard Time)

I can confirm that 2.218.1 is having 100 seconds delay.

Oscar Hult · Answer 7 · Thu Mar 30 2023 00:30:41 GMT+0800 (China Standard Time)

Running our agents as docker containers made downgrading complicated.
Instead we blocked 169.254.169.254 on the host using iptables.

iptables -I DOCKER-USER -d 169.254.169.254 -j REJECT

Important to add it to the chain DOCKER-USER or it will not have effect inside the containers.
Using REJECT (and not DROP) will cause the http request to fail instantly.

Thomas F. Abraham · Answer 8 · Sat Apr 01 2023 06:34:24 GMT+0800 (China Standard Time)

We are also having this issue.

Andrew Roan · Answer 9 · Tue Apr 04 2023 01:44:16 GMT+0800 (China Standard Time)

Same issue with self hosted 3.xxx.x on macOS. Can't say if we had the same with 2.218.1.

gregbird101 · Answer 10 · Tue Apr 04 2023 17:00:49 GMT+0800 (China Standard Time)

Same issue here, exactly 1.40 each time a pipeline runs, Azure Self hosted agents.

gt97 · Answer 11 · Tue Apr 04 2023 17:47:33 GMT+0800 (China Standard Time)

Same issue.
Agent Version 2.218.0/3.218.0

Looks to me that there is a connection that doesn't use the proxy.

If I understand well the code the file src/Agent.Sdk/Util/PlatformUtil.cs at line 325 is making a connection without the context.
HttpResponseMessage response = await httpClient.GetAsync(serverFileUrl);

The same file at line 28 creates an object without context
private static HttpClient httpClient = new HttpClient();

In other part of the code I see (src/Agent.Listener/Configuration/ConfigurationProvider.cs line 359/260):
using (var handler = HostContext.CreateHttpClientHandler())
using (var httpClient = new HttpClient(handler))

Hope this helps.

Dorus · Answer 12 · Wed Apr 05 2023 06:17:40 GMT+0800 (China Standard Time)

Downgrading worked for me too. Running in a docker container based on: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux

In that setup, to downgrade, make the following change in the start.sh script, then rebuild the image and redeploy the container(s):

#curl -LsS $AZP_AGENT_PACKAGE_LATEST_URL | tar -xz & wait $!
curl -LsS https://vstsagentpackage.azureedge.net/agent/2.217.2/vsts-agent-linux-x64-2.217.2.tar.gz | tar -xz & wait $!

The route via capabilities did not work for me, perhaps I'm doing something wrong there.
Hope this helps people like me coming from google upon this issue.

scorpioailabs · Answer 13 · Thu Apr 06 2023 21:14:56 GMT+0800 (China Standard Time)

downgrading by using curl directly with previous version doesn't seem to work for me. Bizarre how if I target the latest version, it's all cool. Any ideas??

Dorus · Answer 14 · Thu Apr 06 2023 21:31:47 GMT+0800 (China Standard Time)

Look in the /azc/_diag folder logs for error messages, e.g.:

docker container list # find container name
docker exec -it $CONTAINER_NAME /bin/bash
    cd /azp/_diag/
    cat Agent_*
    exit

If you are hitting a bug that is fixed in the newer versions then I guess you are kinda out of luck on this one, unless there is an easy workaround there.

SabareeshGC · Answer 15 · Fri Apr 07 2023 00:13:11 GMT+0800 (China Standard Time)

i believe this is a bug at this point, do we know when this will be resolved

jarmoore · Answer 16 · Mon Apr 10 2023 23:36:33 GMT+0800 (China Standard Time)

downgrading by using curl directly with previous version doesn't seem to work for me. Bizarre how if I target the latest version, it's all cool. Any ideas??

Make sure your target host / container is linux-x64. If not, you may need to tweak the curl link.

I was able to downgrade successfully using a Linux Docker container. With v2.217.2 no more delay in the agent. I'll set it back to latest version once I confirm this issue is resolved.

Thomas F. Abraham · Answer 17 · Thu Apr 13 2023 00:28:09 GMT+0800 (China Standard Time)

@kirill-ivlev This issue is causing tremendous slowdowns in our builds and releases. When is this going to be fixed? It has been 3 weeks with no update.

chitraanshpopli · Answer 18 · Thu Apr 13 2023 15:58:20 GMT+0800 (China Standard Time)

Facing the same issue. Was using 3.218.0 and discovered this issue. Downgrading to 2.218.1 made no difference. Downgrading further to 2.217.2 works, for now.

scorpioailabs · Answer 19 · Thu Apr 13 2023 19:45:06 GMT+0800 (China Standard Time)

downgrading by using curl directly with previous version doesn't seem to work for me. Bizarre how if I target the latest version, it's all cool. Any ideas??

Figures this might be related to Ubuntu 22.04 and the way it uses SSL. Downgrading to Ubuntu 20.04 allowed me to downgrade the agent version to 2.217.2.

Shan · Answer 20 · Mon Apr 17 2023 12:28:22 GMT+0800 (China Standard Time)

Facing the same issue in 3.218.. When it's gonna be fixed?

eraservsv · Answer 21 · Wed Apr 19 2023 05:17:54 GMT+0800 (China Standard Time)

We are having the same issue on all our linux agents with 3.218. Downgrade didn't work for us because of some SSL exception. When this will be fixed?

Christian Svensson · Answer 22 · Thu Apr 20 2023 01:29:11 GMT+0800 (China Standard Time)

If you don't want to use iptables, you can also work around this by adding a blackhole route:

sudo ip route add blackhole 169.254.169.254

It does not persist across reboots, so you need to handle that.

maolimu · Answer 23 · Fri Apr 21 2023 14:29:41 GMT+0800 (China Standard Time)

The blackhole trick worked on our servers.
Does this maybe have to be rettaged as BUG - instead of QUESTION?
Might this be the reason why this is not moving forward.

Kirill Ivlev · Answer 24 · Fri Apr 21 2023 21:15:57 GMT+0800 (China Standard Time)

@tfabraham We are working on the fix now, and expect that it will be delivered with the next version
I updated the title and added a bug label.

Martin Schatz · Answer 25 · Tue Apr 25 2023 16:10:42 GMT+0800 (China Standard Time)

At my company we are facing the same issues!
If the mentioned PR is the solution, I would strongly suggest to provide a feature to completely turn this telemetry thing off.
Thing is self host agents are usually behind the companies firewall which will eventually block or throttle this (in my case)

So I beg you, please provide a feature to turn this off

Kris · Answer 26 · Wed Apr 26 2023 14:50:32 GMT+0800 (China Standard Time)

At my company we are facing the same issues! If the mentioned PR is the solution, I would strongly suggest to provide a feature to completely turn this telemetry thing off. Thing is self host agents are usually behind the companies firewall which will eventually block or throttle this (in my case)

So I beg you, please provide a feature to turn this off

This helps >> sudo ip route add blackhole 169.254.169.254

Martin Schatz · Answer 27 · Wed Apr 26 2023 17:13:52 GMT+0800 (China Standard Time)

At my company we are facing the same issues! If the mentioned PR is the solution, I would strongly suggest to provide a feature to completely turn this telemetry thing off. Thing is self host agents are usually behind the companies firewall which will eventually block or throttle this (in my case)
So I beg you, please provide a feature to turn this off

This helps >> sudo ip route add blackhole 169.254.169.254

Just a hotfix but works thank u!

Kubilay · Answer 28 · Wed Apr 26 2023 19:41:59 GMT+0800 (China Standard Time)

Running our agents as docker containers made downgrading complicated. Instead we blocked 169.254.169.254 on the host using iptables.
iptables -I DOCKER-USER -d 169.254.169.254 -j REJECT
Important to add it to the chain DOCKER-USER or it will not have effect inside the containers. Using REJECT (and not DROP) will cause the http request to fail instantly.

this solved my problem. I think it is very interesting that the problem is still not permanently resolved.

SabareeshGC · Answer 29 · Wed Apr 26 2023 23:32:30 GMT+0800 (China Standard Time)

For mac this doesnt seem to solve the problem

sudo ip route add blackhole 169.254.169.254

JimmyCui · Answer 30 · Thu Apr 27 2023 19:25:49 GMT+0800 (China Standard Time)

For mac this doesnt seem to solve the problem

sudo ip route add blackhole 169.254.169.254

This works on ubuntu! Thank!

Kaven Gagne · Answer 31 · Sat Apr 29 2023 01:36:33 GMT+0800 (China Standard Time)

Any update on this one? it's causing a lot of waiting time in our pipelines.

pkoelemij · Answer 32 · Tue May 02 2023 16:19:27 GMT+0800 (China Standard Time)

Seeing the same behaviour with Azure Devops Agent version 3.218.0 on Ubuntu 22.04.

Wouter · Answer 33 · Wed May 03 2023 03:49:31 GMT+0800 (China Standard Time)

We host our pipeline agents using Ubuntu 22.04 containers. Seeing this same behavior; jobs get assigned an agent then freeze for 100s before continuing.

When will we see a fix for this issue?

Paul Gauthier · Answer 34 · Wed May 03 2023 20:20:13 GMT+0800 (China Standard Time)

Same issue. Lots of wasted build time. Following for fix.

Tom Winter · Answer 35 · Fri May 05 2023 00:40:11 GMT+0800 (China Standard Time)

FYI everyone the release that contains the fix just came out four hours ago (May 4, 2023 7:42 AM EDT). I will be deploying tonight.

https://github.com/microsoft/azure-pipelines-agent/releases/tag/v3.220.1

Michael Villani · Answer 36 · Fri May 05 2023 14:19:34 GMT+0800 (China Standard Time)

FYI everyone the release that contains the fix just came out four hours ago (May 4, 2023 7:42 AM EDT). I will be deploying tonight.

https://github.com/microsoft/azure-pipelines-agent/releases/tag/v3.220.1

I just updated my agent to 3.220.1 and the issue still persists.
Only thing that helps is blocking the IP as @SabareeshGC stated above with sudo ip route add blackhole 169.254.169.254 on ubuntu

Edit: Nevermind, I accidentaly updated to 3.220.0... will update my comment when 220.1 is officially released

Edit 2: After 3.220.2 got released, everything works as expected again, the workaround is no longer needed!

Tom Winter · Answer 37 · Fri May 05 2023 20:55:41 GMT+0800 (China Standard Time)

Restarted my agents last night assuming they would pull in the new version, but they did not. Noticed this morning that v3.220.1 is pre-release. So I have to change things to specifically use that version.

desmondkung · Answer 38 · Fri May 05 2023 21:27:13 GMT+0800 (China Standard Time)

Tested the new agents last night and very happy since. The 100-second timeout got resolved. Ubuntu 22.04.

Malgefor · Answer 39 · Wed May 10 2023 19:08:57 GMT+0800 (China Standard Time)

~~When can we expect the v3.220.1 version to go out of pre-release?~~ Sorry, just read the text on the homepage of the repo about pre-releases :)

Wojtek Turowicz · Answer 40 · Thu May 11 2023 16:03:43 GMT+0800 (China Standard Time)

How can I tell my agent to go and update to the pre-release version?

Wojtek Turowicz · Answer 41 · Thu May 11 2023 17:40:07 GMT+0800 (China Standard Time)

Running our agents as docker containers made downgrading complicated. Instead we blocked 169.254.169.254 on the host using iptables.
iptables -I DOCKER-USER -d 169.254.169.254 -j REJECT
Important to add it to the chain DOCKER-USER or it will not have effect inside the containers. Using REJECT (and not DROP) will cause the http request to fail instantly.

Unfortunately that didn't help. I ran it on the host, not in the agent container.

desmondkung · Answer 42 · Thu May 11 2023 17:51:58 GMT+0800 (China Standard Time)

How can I tell my agent to go and update to the pre-release version?

if you're updating them from devops server, throw the zip/tar into "$env:programdata\Microsoft\Azure DevOps\Agents".
The default ones used are the "vsts-agent" files.

Wojtek Turowicz · Answer 43 · Thu May 11 2023 17:54:21 GMT+0800 (China Standard Time)

I am running the agents as linux containers. I can recreate them but then they autoupdate to the most recent release anyway and the issue persists.

desmondkung · Answer 44 · Thu May 11 2023 17:58:13 GMT+0800 (China Standard Time)

the update capability checks that folder for latest agent builds. just dump them there. it will override the build from latest tag as the build number is larger

Wojtek Turowicz · Answer 45 · Thu May 11 2023 18:14:00 GMT+0800 (China Standard Time)

cool will do!

Oscar Brouwer · Answer 46 · Mon May 15 2023 14:37:10 GMT+0800 (China Standard Time)

Version 3.220.1 has been released on May 4th. However, our Azure DevOps environment still hasn't listed it as available. The README in this repo states (if I understand correctly) that a new agent should normally be available after at most 6 to 8 days.
Is there some delay in the release process or should we do something to make the latest (pre-release) version available?
I know we can force install the agent by specifying the install-url manually, but we'd rather use the latest agent that is presented through the '_apis/distributedtask/packages/agent' REST endpoint

Wojtek Turowicz · Answer 47 · Mon May 15 2023 15:39:24 GMT+0800 (China Standard Time)

@HumanPrinter its a beta release you need to apply it manually as per posts above.

Oscar Brouwer · Answer 48 · Mon May 15 2023 16:01:27 GMT+0800 (China Standard Time)

@HumanPrinter its a beta release you need to apply it manually as per posts above.

I don't think that is correct. According to the readme and issue #4221 pre-release only means that the version is still in the process of gradually being made available to all Azure DevOps organisations. Once all organisations have the version available, the release should be labeled 'Latest'. According to the readme this should normally take 6 to 8 days.

Wojtek Turowicz · Answer 49 · Mon May 15 2023 16:02:33 GMT+0800 (China Standard Time)

Well then you have answered your own question.

Jeremy Overesch · Answer 50 · Mon May 15 2023 23:27:15 GMT+0800 (China Standard Time)

How can I tell my agent to go and update to the pre-release version?

if you're updating them from devops server, throw the zip/tar into "$env:programdata\Microsoft\Azure DevOps\Agents". The default ones used are the "vsts-agent" files.

@desmondkung @turowicz The post above specifies for a windows machine. That environment variable doesn't appear to exist for a Linux machine. How do we force the update on Linux?

desmondkung · Answer 51 · Tue May 16 2023 07:05:54 GMT+0800 (China Standard Time)

How can I tell my agent to go and update to the pre-release version?

if you're updating them from devops server, throw the zip/tar into "$env:programdata\Microsoft\Azure DevOps\Agents". The default ones used are the "vsts-agent" files.

@desmondkung @turowicz The post above specifies for a windows machine. That environment variable doesn't appear to exist for a Linux machine. How do we force the update on Linux?

@jovere I don't recall devops server can be installed in a Linux machine. The folder I specified is in the server, not the agent.

Wojtek Turowicz · Answer 52 · Tue May 16 2023 16:18:37 GMT+0800 (China Standard Time)

I am using AzureDevops and self-hosted Linux agents. Currently there seems no way to run a pre-release version of the agent.

desmondkung · Answer 53 · Tue May 16 2023 17:13:12 GMT+0800 (China Standard Time)

If you're using Azure DevOps Services (the online offering), you're stuck until MS updates the backend.

Jeremy Overesch · Answer 54 · Tue May 16 2023 21:42:08 GMT+0800 (China Standard Time)

If there's no way of running a pre-release version, why is it available? Seems pretty counterintuitive. There's got to be some way to update this manually without completely uninstalling it first.

desmondkung · Answer 55 · Tue May 16 2023 22:20:27 GMT+0800 (China Standard Time)

@turowicz what about spinning up a Linux VM and installing the agent there manually as a temporary workaround? Command to link it back to Azure DevOps Services is identical to adding a new agent. Probably add a new capability to differentiate it from other agents in the same pool or create a new agent pool.

Tom Winter · Answer 56 · Tue May 16 2023 23:14:18 GMT+0800 (China Standard Time)

I use the self-hosted agent in a Linux container, as documented here. I updated the start.sh file that the docs talk about by adding these lines right after the existing AZP_AGENT_PACKAGE_LATEST_URL=... line.

AZP_AGENT_PACKAGE_LATEST_URL=https://vstsagentpackage.azureedge.net/agent/3.220.1/vsts-agent-linux-x64-3.220.1.tar.gz
export AZP_AGENT_DOWNGRADE_DISABLED=true

This works. I will of course have to remove those lines once 3.220.1 or later becomes "latest" so that the agent can auto-upgrade in the future.

Wojtek Turowicz · Answer 57 · Wed May 17 2023 15:09:33 GMT+0800 (China Standard Time)

@twinter-amosfivesix that script change helped, thank you so much!

Malgefor · Answer 58 · Fri May 19 2023 20:41:03 GMT+0800 (China Standard Time)

It seems that version v3.220.1 was skipped and version v3.220.2 is the new latest, which fixed the delay issue 👍

Kirill Ivlev · Answer 59 · Tue Aug 01 2023 17:25:00 GMT+0800 (China Standard Time)

Closing this since but already been resolved starting v3.220.2.
Thank you all for you patience