microsoft / azure-pipelines-agent

Azure Pipelines Agent 🚀

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question]: Any known issues that might cause delay of exactly 100 s. of starting a job execution after receiving it?

tvachev opened this issue · comments

commented

Describe your question

We experience delay between queuing and initializing a job of 100 s. (1:40 min) when executing the same pipeline on self-hosted agent on Linux ( on CentOs and Debian).

2023-03-23T14:06:16.0676904Z ##[section]Starting: Job
2023-03-23T14:07:56.4086577Z ##[section]Starting: Initialize job

On Microsoft hosted agents it is nearly instantaneous.

Part of the agent worker log when it happens:

[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Bin': '/home/ladmin/myagent/bin'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:52:19Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:52:19Z INFO JobServerQueue] Try to append 1 batches web console lines for record '12f1170f-54f2-53f3-20dd-22fc7dff55f9', success rate: 1/1.
[2023-03-23 11:52:34Z INFO JobServerQueue] Stop aggressive process web console line queue.
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Bin': '/home/ladmin/myagent/bin'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Tools': '/home/ladmin/myagent/_work/_tool'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Bin': '/home/ladmin/myagent/bin'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Root': '/home/ladmin/myagent'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Work': '/home/ladmin/myagent/_work'
[2023-03-23 11:53:59Z INFO HostContext] Well known directory 'Temp': '/home/ladmin/myagent/_work/_temp'

The pipeline description is probably irrelevant as we also tried extremely simple ones.
The problem does not stem from waiting on agents in a pool.

Any idea why this might be happening or if it is a known issue?

Thanks!

Versions

Agent version is 2.218.1.

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Operation system

CentOs 8, Debian 11

Version controll system

git

Azure DevOps Server Version (if applicable)

No response

https://github.com/microsoft/azure-pipelines-agent/pull/4166/files seems like a likely candidate

Is there any way of turning the telemetry off completely?

(the default HttpClient timeout is 100s)

downgrading to v2.217.2 has fixed the delay in our environment

Hi @tvachev, thanks for reporting! We will take a look

@blushingpenguin Hi Mark. This is also hitting us. Many thanks for reporting.
What is the procedure to downgrade to an older agent version? I could not find any good information.

commented

@mmunte-impeo
Remove the newer version
Disable the auto update from azure devops
Install an old version - @blushingpenguin suggests 2.217.2
I can confirm it works (Strangely enough we don't see the problem when running newer agent versions on Windows)

I can confirm that 2.218.1 is having 100 seconds delay.

Running our agents as docker containers made downgrading complicated.
Instead we blocked 169.254.169.254 on the host using iptables.

iptables -I DOCKER-USER -d 169.254.169.254 -j REJECT

Important to add it to the chain DOCKER-USER or it will not have effect inside the containers.
Using REJECT (and not DROP) will cause the http request to fail instantly.

We are also having this issue.

Same issue with self hosted 3.xxx.x on macOS. Can't say if we had the same with 2.218.1.

Same issue here, exactly 1.40 each time a pipeline runs, Azure Self hosted agents.

commented

Same issue.
Agent Version 2.218.0/3.218.0

Looks to me that there is a connection that doesn't use the proxy.

If I understand well the code the file src/Agent.Sdk/Util/PlatformUtil.cs at line 325 is making a connection without the context.
HttpResponseMessage response = await httpClient.GetAsync(serverFileUrl);

The same file at line 28 creates an object without context
private static HttpClient httpClient = new HttpClient();

In other part of the code I see (src/Agent.Listener/Configuration/ConfigurationProvider.cs line 359/260):
using (var handler = HostContext.CreateHttpClientHandler())
using (var httpClient = new HttpClient(handler))

Hope this helps.

commented

Downgrading worked for me too. Running in a docker container based on: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux

In that setup, to downgrade, make the following change in the start.sh script, then rebuild the image and redeploy the container(s):

#curl -LsS $AZP_AGENT_PACKAGE_LATEST_URL | tar -xz & wait $!
curl -LsS https://vstsagentpackage.azureedge.net/agent/2.217.2/vsts-agent-linux-x64-2.217.2.tar.gz | tar -xz & wait $!

The route via capabilities did not work for me, perhaps I'm doing something wrong there.
Hope this helps people like me coming from google upon this issue.

downgrading by using curl directly with previous version doesn't seem to work for me. Bizarre how if I target the latest version, it's all cool. Any ideas??

image

commented

Look in the /azc/_diag folder logs for error messages, e.g.:

docker container list # find container name
docker exec -it $CONTAINER_NAME /bin/bash
    cd /azp/_diag/
    cat Agent_*
    exit

If you are hitting a bug that is fixed in the newer versions then I guess you are kinda out of luck on this one, unless there is an easy workaround there.

i believe this is a bug at this point, do we know when this will be resolved

downgrading by using curl directly with previous version doesn't seem to work for me. Bizarre how if I target the latest version, it's all cool. Any ideas??

image

Make sure your target host / container is linux-x64. If not, you may need to tweak the curl link.

I was able to downgrade successfully using a Linux Docker container. With v2.217.2 no more delay in the agent. I'll set it back to latest version once I confirm this issue is resolved.

@kirill-ivlev This issue is causing tremendous slowdowns in our builds and releases. When is this going to be fixed? It has been 3 weeks with no update.

Facing the same issue. Was using 3.218.0 and discovered this issue. Downgrading to 2.218.1 made no difference. Downgrading further to 2.217.2 works, for now.

downgrading by using curl directly with previous version doesn't seem to work for me. Bizarre how if I target the latest version, it's all cool. Any ideas??

image

Figures this might be related to Ubuntu 22.04 and the way it uses SSL. Downgrading to Ubuntu 20.04 allowed me to downgrade the agent version to 2.217.2.

commented

Facing the same issue in 3.218.. When it's gonna be fixed?

We are having the same issue on all our linux agents with 3.218. Downgrade didn't work for us because of some SSL exception. When this will be fixed?

If you don't want to use iptables, you can also work around this by adding a blackhole route:

sudo ip route add blackhole 169.254.169.254

It does not persist across reboots, so you need to handle that.

The blackhole trick worked on our servers.
Does this maybe have to be rettaged as BUG - instead of QUESTION?
Might this be the reason why this is not moving forward.

@tfabraham We are working on the fix now, and expect that it will be delivered with the next version
I updated the title and added a bug label.

At my company we are facing the same issues!
If the mentioned PR is the solution, I would strongly suggest to provide a feature to completely turn this telemetry thing off.
Thing is self host agents are usually behind the companies firewall which will eventually block or throttle this (in my case)

So I beg you, please provide a feature to turn this off

commented

At my company we are facing the same issues! If the mentioned PR is the solution, I would strongly suggest to provide a feature to completely turn this telemetry thing off. Thing is self host agents are usually behind the companies firewall which will eventually block or throttle this (in my case)

So I beg you, please provide a feature to turn this off

This helps >> sudo ip route add blackhole 169.254.169.254

At my company we are facing the same issues! If the mentioned PR is the solution, I would strongly suggest to provide a feature to completely turn this telemetry thing off. Thing is self host agents are usually behind the companies firewall which will eventually block or throttle this (in my case)
So I beg you, please provide a feature to turn this off

This helps >> sudo ip route add blackhole 169.254.169.254

Just a hotfix but works thank u!

Running our agents as docker containers made downgrading complicated. Instead we blocked 169.254.169.254 on the host using iptables.

iptables -I DOCKER-USER -d 169.254.169.254 -j REJECT

Important to add it to the chain DOCKER-USER or it will not have effect inside the containers. Using REJECT (and not DROP) will cause the http request to fail instantly.

this solved my problem. I think it is very interesting that the problem is still not permanently resolved.

For mac this doesnt seem to solve the problem

sudo ip route add blackhole 169.254.169.254

For mac this doesnt seem to solve the problem

sudo ip route add blackhole 169.254.169.254

This works on ubuntu! Thank!

Any update on this one? it's causing a lot of waiting time in our pipelines.

Seeing the same behaviour with Azure Devops Agent version 3.218.0 on Ubuntu 22.04.

commented

We host our pipeline agents using Ubuntu 22.04 containers. Seeing this same behavior; jobs get assigned an agent then freeze for 100s before continuing.

When will we see a fix for this issue?

Same issue. Lots of wasted build time. Following for fix.

FYI everyone the release that contains the fix just came out four hours ago (May 4, 2023 7:42 AM EDT). I will be deploying tonight.

https://github.com/microsoft/azure-pipelines-agent/releases/tag/v3.220.1

FYI everyone the release that contains the fix just came out four hours ago (May 4, 2023 7:42 AM EDT). I will be deploying tonight.

https://github.com/microsoft/azure-pipelines-agent/releases/tag/v3.220.1

I just updated my agent to 3.220.1 and the issue still persists.
Only thing that helps is blocking the IP as @SabareeshGC stated above with sudo ip route add blackhole 169.254.169.254 on ubuntu

Edit: Nevermind, I accidentaly updated to 3.220.0... will update my comment when 220.1 is officially released

Edit 2: After 3.220.2 got released, everything works as expected again, the workaround is no longer needed!

Restarted my agents last night assuming they would pull in the new version, but they did not. Noticed this morning that v3.220.1 is pre-release. So I have to change things to specifically use that version.

Tested the new agents last night and very happy since. The 100-second timeout got resolved. Ubuntu 22.04.

When can we expect the v3.220.1 version to go out of pre-release? Sorry, just read the text on the homepage of the repo about pre-releases :)

How can I tell my agent to go and update to the pre-release version?

Running our agents as docker containers made downgrading complicated. Instead we blocked 169.254.169.254 on the host using iptables.

iptables -I DOCKER-USER -d 169.254.169.254 -j REJECT

Important to add it to the chain DOCKER-USER or it will not have effect inside the containers. Using REJECT (and not DROP) will cause the http request to fail instantly.

Unfortunately that didn't help. I ran it on the host, not in the agent container.

How can I tell my agent to go and update to the pre-release version?

if you're updating them from devops server, throw the zip/tar into "$env:programdata\Microsoft\Azure DevOps\Agents".
The default ones used are the "vsts-agent" files.

I am running the agents as linux containers. I can recreate them but then they autoupdate to the most recent release anyway and the issue persists.

the update capability checks that folder for latest agent builds. just dump them there. it will override the build from latest tag as the build number is larger

cool will do!

Version 3.220.1 has been released on May 4th. However, our Azure DevOps environment still hasn't listed it as available. The README in this repo states (if I understand correctly) that a new agent should normally be available after at most 6 to 8 days.
Is there some delay in the release process or should we do something to make the latest (pre-release) version available?
I know we can force install the agent by specifying the install-url manually, but we'd rather use the latest agent that is presented through the '_apis/distributedtask/packages/agent' REST endpoint

@HumanPrinter its a beta release you need to apply it manually as per posts above.

@HumanPrinter its a beta release you need to apply it manually as per posts above.

I don't think that is correct. According to the readme and issue #4221 pre-release only means that the version is still in the process of gradually being made available to all Azure DevOps organisations. Once all organisations have the version available, the release should be labeled 'Latest'. According to the readme this should normally take 6 to 8 days.

Well then you have answered your own question.

How can I tell my agent to go and update to the pre-release version?

if you're updating them from devops server, throw the zip/tar into "$env:programdata\Microsoft\Azure DevOps\Agents". The default ones used are the "vsts-agent" files.

@desmondkung @turowicz The post above specifies for a windows machine. That environment variable doesn't appear to exist for a Linux machine. How do we force the update on Linux?

How can I tell my agent to go and update to the pre-release version?

if you're updating them from devops server, throw the zip/tar into "$env:programdata\Microsoft\Azure DevOps\Agents". The default ones used are the "vsts-agent" files.

@desmondkung @turowicz The post above specifies for a windows machine. That environment variable doesn't appear to exist for a Linux machine. How do we force the update on Linux?

@jovere I don't recall devops server can be installed in a Linux machine. The folder I specified is in the server, not the agent.

I am using AzureDevops and self-hosted Linux agents. Currently there seems no way to run a pre-release version of the agent.

If you're using Azure DevOps Services (the online offering), you're stuck until MS updates the backend.

If there's no way of running a pre-release version, why is it available? Seems pretty counterintuitive. There's got to be some way to update this manually without completely uninstalling it first.

@turowicz what about spinning up a Linux VM and installing the agent there manually as a temporary workaround? Command to link it back to Azure DevOps Services is identical to adding a new agent. Probably add a new capability to differentiate it from other agents in the same pool or create a new agent pool.

I use the self-hosted agent in a Linux container, as documented here. I updated the start.sh file that the docs talk about by adding these lines right after the existing AZP_AGENT_PACKAGE_LATEST_URL=... line.

AZP_AGENT_PACKAGE_LATEST_URL=https://vstsagentpackage.azureedge.net/agent/3.220.1/vsts-agent-linux-x64-3.220.1.tar.gz
export AZP_AGENT_DOWNGRADE_DISABLED=true

This works. I will of course have to remove those lines once 3.220.1 or later becomes "latest" so that the agent can auto-upgrade in the future.

@twinter-amosfivesix that script change helped, thank you so much!

It seems that version v3.220.1 was skipped and version v3.220.2 is the new latest, which fixed the delay issue 👍

Closing this since but already been resolved starting v3.220.2.
Thank you all for you patience