Container stalls after "Starting Zabbix Agent"

Question

Container stalls after "Starting Zabbix Agent"

zombielinux opened this issue 4 years ago · comments

I've got the following log

[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
[s6-init] ensuring user provided files have correct perms...exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] 00-functions: applying... 
[fix-attrs.d] 00-functions: exited 0.
[fix-attrs.d] 01-s6: applying... 
[fix-attrs.d] 01-s6: exited 0.
[fix-attrs.d] 02-zabbix: applying... 
[fix-attrs.d] 02-zabbix: exited 0.
[fix-attrs.d] 03-logrotate: applying... 
[fix-attrs.d] 03-logrotate: exited 0.
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 00-startup: executing... 
[cont-init.d] 00-startup: exited 0.
[cont-init.d] 01-timezone: executing... 
[NOTICE] ** [timezone] Timezone: Setting to 'America/New_York' from 'Etc/GMT'
[cont-init.d] 01-timezone: exited 0.
[cont-init.d] 02-permissions: executing... 
[cont-init.d] 02-permissions: exited 0.
[cont-init.d] 03-zabbix: executing... 
[cont-init.d] 03-zabbix: exited 0.
[cont-init.d] 04-cron: executing... 
[NOTICE] ** [cron] Disabling Cron
[cont-init.d] 04-cron: exited 0.
[cont-init.d] 05-smtp: executing... 
[NOTICE] ** [smtp] Disabling SMTP Features
[cont-init.d] 05-smtp: exited 0.
[cont-init.d] 10-cloudflare-companion: executing... 
[NOTICE] ** [traefik-cloudflare-companion] Setting Traefik 2.x Mode
[cont-init.d] 10-cloudflare-companion: exited 0.
[cont-init.d] 99-container: executing... 
[cont-init.d] 99-container: exited 0.
[cont-init.d] done.
[services.d] starting services
[services.d] done.
[INFO] ** [traefik-cloudflare-companion] Starting Traefik Cloudflare Companion
[INFO] ** [zabbix] Starting Zabbix Agent

My docker-compose looks like this:

    image: tiredofit/traefik-cloudflare-companion:latest
    container_name: cloudflare-companion
    networks:
     - traefik_proxy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - TIMEZONE=$TZ
      - TRAEFIK_VERSION=2
      - CF_EMAIL=$CLOUDFLARE_EMAIL
      - CF_TOKEN=$CLOUDFLARE_API_KEY
      - TARGET_DOMAIN=$DOMAINNAME
      - DOMAIN1=$DOMAINNAME
      - DOMAIN1_ZONE_ID=$CLOUDFLARE_ZONEID
      - DOMAIN1_PROXIED=FALSE
    restart: always
    deploy:
      placement:
        constraints:
          - "node.role==manager"

Logging into the container and executing the items in /etc/cont-init.d/ shows only a single issue with "03-zabbix" as shown below

mkdir: can't create directory '': No such file or directory
chown: unknown user 
chown: unknown user

My cursory glance is showing its failing to create a logfile somewhere along the line and then dropping out of the whole thing.

qnap-mjolnir commented 4 years ago

Sent!

Dave Conroy · Answer 1 · Tue Jul 14 2020 02:46:27 GMT+0800 (China Standard Time)

You can turn Zabbix off: ENABLE_ZABBIX=FALSE

All looks normal to me, can you give me the output of a ps -ef
Thanks

William Sutton · Answer 2 · Tue Jul 14 2020 03:32:10 GMT+0800 (China Standard Time)

Sure can. See below

    1 root      0:00 s6-svscan -t0 /var/run/s6/services
   31 root      0:00 s6-supervise s6-fdholderd
  758 root      0:00 s6-supervise 03-zabbix
  760 root      0:00 s6-supervise 10-cloudflare-companion
  762 zabbix    0:00 zabbix_agentd -f
  764 root      0:01 python -u /usr/sbin/cloudflare-companion
  798 zabbix    0:00 zabbix_agentd: collector [idle 1 sec]
  799 zabbix    0:00 zabbix_agentd: listener #1 [waiting for connection]
  800 zabbix    0:00 zabbix_agentd: listener #2 [waiting for connection]
  801 zabbix    0:00 zabbix_agentd: active checks #1 [idle 1 sec]
  841 root      0:00 bash
  846 root      0:00 ps -ef

Dave Conroy · Answer 3 · Tue Jul 14 2020 03:35:34 GMT+0800 (China Standard Time)

PID 764 shows that he container is running, as is Zabbix.
I find that with some of the changes I've made with the base images lately just running the scripts inside cont-init.d won't; give you the expected output you want as it's hardcoded to look for a different path (I believe /var/run/s6/cont-init.d).

But back to the matter at hand, you'll only get output from the python script if it can find a matching rule in your Traefik labels section.

Try this on a sample container:

 labels:
      - traefik.enable=true
      - traefik.http.routers.example.rule=Host(`dns.example.com`)

William Sutton · Answer 4 · Tue Jul 14 2020 03:40:18 GMT+0800 (China Standard Time)

I have a helloworld container running with the following lables:

      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.helloworld.rule=HostHeader(`helloworld2.$DOMAINNAME`)"
        #- "traefik.http.routers.helloworld.rule=Host(`helloworld2.$DOMAINNAME`)"
        - "traefik.http.routers.helloworld.rule=Host(`helloworld2.tld.org`)"
        - "traefik.http.routers.helloworld.entrypoints=websecure"
        - "traefik.http.routers.helloworld.tls.certresolver=dns-cloudflare"
        - "traefik.http.services.helloworld.loadbalancer.server.port=80"
        #HTTPS Redirect Code
        - "traefik.http.middlewares.helloworld-https.redirectscheme.scheme=https"
        - "traefik.http.routers.helloworld-insecure.middlewares=helloworld-https@docker"
        - "traefik.http.routers.helloworld-insecure.rule=Host(`helloworld2.$DOMAINNAME`)"
        - "traefik.http.routers.helloworld-insecure.entrypoints=web"

Where $DOMAINNAME = tld.org

Dave Conroy · Answer 5 · Tue Jul 14 2020 03:44:55 GMT+0800 (China Standard Time)

That should definitely do it.
It should be monitoring the docker socket and showing some sort of response like so:


cloudflare-companion    | container rule value:  Host(`dns.example.com`)
cloudflare-companion    | extracted_domains from rule:  [u'dns.example.com']
cloudflare-companion    | Found Container: 670c82dc337067c35c7603969211e701b5d0fe6f28c60e4c92a7f77a038739e2 with Hostname dns.example.com

William Sutton · Answer 6 · Tue Jul 14 2020 04:17:07 GMT+0800 (China Standard Time)

FWIW, I'm running docker-swarm across a few machines, but the cloudflare-companion container is able to ping the traefik container with ease.

But I'm a bit befuddled as to why it doesn't work.

Dave Conroy · Answer 7 · Tue Jul 14 2020 04:20:40 GMT+0800 (China Standard Time)

Typically when running swarm your socket would be network related. How do you have the network socket connected with Traefik? We're assuming that you are using /var/run/docker.sock on your system, which is where I believe this issue is occurring.

You can change the socket entry point to something tcp oriented by setting the environment variable DOCKER_ENTRYPOINT

Hopefully someone else can speak up on this one, there are an awful lot of users of this image (and its companion nginx-proxy-cloudflare-companion) where I would think someone must have figured this out.

William Sutton · Answer 8 · Tue Jul 14 2020 21:11:18 GMT+0800 (China Standard Time)

Yep. That's how I'm doing it.

      - /var/run/docker.sock:/var/run/docker.sock:ro

Both containers are bound to the same manager host as well.

I'm honestly pretty new to the whole docker ecosystem, so I need to find some good documentation on choosing socket entry points.

Dave Conroy · Answer 9 · Tue Jul 14 2020 21:23:12 GMT+0800 (China Standard Time)

From inside your cloudflare companion container, can you make sure you can talk to the socket?
This command will show all your images pulled onto your Docker Host
curl --unix-socket /var/run/docker.sock http:/v1.24/images/json

I don't need to see the output really, but lets just make sure that its showing a json list of images. (Should be a series of ID and Parent IDs)

William Sutton · Answer 10 · Tue Jul 14 2020 22:20:55 GMT+0800 (China Standard Time)

I've got a big json string, each element has an ID (starts with sha256:). There is a "ParentId" tag as well, but its value is null for all containers.

I DO see all my running and expected containers though.

Dave Conroy · Answer 11 · Tue Jul 14 2020 22:32:46 GMT+0800 (China Standard Time)

OK< good enough. So at least we can talk to the socket which is where all the info is coming up, I truly am stumped here as to what could be happening. The image itself is pretty basic in what it does in a very hackish way. I'm going to reach out to some of my clients and see if we are running in a swarm environment to see if there is something that is being missed here.

William Sutton · Answer 12 · Tue Jul 21 2020 01:58:55 GMT+0800 (China Standard Time)

Any word from your clients?

Dave Conroy · Answer 13 · Tue Jul 21 2020 02:02:11 GMT+0800 (China Standard Time)

I have a few who are running swarm and on the TCP socket yet none are reporting issues. I'll test on a burner system today to see if I can recreate. We did have some nasty stuff happening in the past week with Traefik 2.2.2+ and finally resolved when we moved to 2.2.6 - but the cloudflare container wouldn't have been affected by it.

Dave Conroy · Answer 14 · Wed Jul 22 2020 02:44:20 GMT+0800 (China Standard Time)

I setup a Ubuntu 20.04 server last night and exposed the Docker Socket via TCP and ran a few tests both from the local machine and a remote machine. I used the value of DOCKER_ENTRYPOINT=tcp://(host_ip):2376. My Docker socket was listening on 0.0.0.0 and I turned off any firewalls to limit access.

On test #1 (same machine to docker socket) it worked as expected
On test #2 (remote cloudflare-companion image pointing at the remote IP host) it again worked as expected.

There has to be something else that is blocking this. I didn't do anything fancy with my setup, had it up and running within 10 minutes after first boot. I'm back to being stumped and hoping someone is able to step in..

Dave Conroy · Answer 15 · Thu Jul 23 2020 05:22:19 GMT+0800 (China Standard Time)

Have a peek at this PR.

The submitter added TLS support and also some additional variables which might be solving your problem. Assuming it is TLS related. Note, you'd probably have to export a data volume for your docker certificates as well, and there is an environment variable built to support that. Let me know if that changes anything?

qnap-mjolnir · Answer 16 · Thu Jul 30 2020 04:35:18 GMT+0800 (China Standard Time)

Edited to add reference links

Hi there,

I am also having this issue. I am running traefik v2.2.7 in swarm mode on a single node. The "manual" service for a non-docker service works and the CNAME in CF has been added correctly. The dockerized services are not being found.

I've been trying to figure out what's been happening and just doing straight comparison on the difference between the cf-companion service and all my other services...

I am not too well versed in the coding, but how/where is the cf-companion looking for the labels? My 30 second google-fu shows that there might be a difference between service labels and container labels. In Portainer, all of my services have service labels and not container labels. I noticed that the manual entry on the cf-service shows up as a container label, not a service label.

cf-companion labels:

random-other-service labels:

The only difference I see between the yml for cf-companion and all my services is the deploy command:

#My regular services yml
deploy:
  labels:
    - "LABELS"

versus

#CF-Companion yml
labels:
  - "LABELS"

Dave Conroy · Answer 17 · Thu Jul 30 2020 07:27:32 GMT+0800 (China Standard Time)

Can you give me the output (fuzz it if you need to) of one of your containers with labels?
docker inspect (containername)
You may be onto something here... We're pulling them from the json key of Config: , Labels:

qnap-mjolnir · Answer 18 · Thu Jul 30 2020 07:50:11 GMT+0800 (China Standard Time)

I navigated down to the Config:Labels area. I see no mention of the traefik labels... or anything that would show the Host('etc....

So I did docker service inspect, and tada!

Dave Conroy · Answer 19 · Thu Jul 30 2020 08:04:47 GMT+0800 (China Standard Time)

OK then. This helps tremendously. Going to think about it for a night and lets see what I can come up with.

qnap-mjolnir · Answer 20 · Thu Jul 30 2020 08:11:34 GMT+0800 (China Standard Time)

I don't know much about it, but perhaps SWARM_MODE:TRUE might be good if it can be directed to the service labels versus the container labels. And then if it is swarm_mode, the "non-dockerized services" can be under the deploy in the cf-companion.yml as well. Unfortunately I don't know enough about coding to do a PR!

Dave Conroy · Answer 21 · Thu Jul 30 2020 21:24:23 GMT+0800 (China Standard Time)

I like the SWARM_MODE idea. I've put together a test version on Docker Hub. Can you try pulling tiredofit/traefik-cloudflare-companion:develop with SWARM_MODE=TRUE as an environment variable?

qnap-mjolnir · Answer 22 · Thu Jul 30 2020 21:49:45 GMT+0800 (China Standard Time)

Got this in the logs:

[INFO] ** [traefik-cloudflare-companion] Starting Traefik Cloudflare Companion,
  File "/usr/sbin/cloudflare-companion", line 59, in <module>,
  File "/usr/sbin/cloudflare-companion", line 49, in init,
    init(),
AttributeError: 'NoneType' object has no attribute 'get',
    for prop in c.attrs.get(u'Spec').get(u'Labels'):,
  File "/usr/sbin/cloudflare-companion", line 29, in check_container,
    check_container(c),
Traceback (most recent call last):```

Dave Conroy · Answer 23 · Thu Jul 30 2020 21:57:58 GMT+0800 (China Standard Time)

Can you send me privately the entire inspect output? Based on the indenting of the output above I may be missing a key. I'm dave at tiredofit dot ca .

Dave Conroy · Answer 24 · Fri Jul 31 2020 11:20:31 GMT+0800 (China Standard Time)

Odd. still haven't seen it. No sign on my MTA either.

qnap-mjolnir · Answer 25 · Fri Jul 31 2020 22:37:49 GMT+0800 (China Standard Time)

Sent it again. It might be because I pasted it as plain text in the email... so this time I sent it as a .txt attachment.

Dave Conroy · Answer 26 · Sat Aug 01 2020 13:45:17 GMT+0800 (China Standard Time)

Received (oddly at same time). I'm going to need a few days (week max) to parse and setup a test environment. My 'real world' role has just required more time than I have anticipated. I'll be in touch.

Dave Conroy · Answer 27 · Tue Aug 25 2020 02:40:26 GMT+0800 (China Standard Time)

See tiredofit/traefik-cloudflare-companion:6.0.0 for working SWARM_MODE=TRUE.