kubernetes / kubernetes

Production-Grade Container Scheduling and Management

Home Page:https://kubernetes.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Connecting containers

smarterclayton opened this issue · comments

I wanted to lay out the general problem of interconnecting containers in terms of use cases and requirements and map out some of the solutions that matter at different scales, to help start a discussion about what additional constructs would be useful in Kubernetes.

Problem statement:

  1. A container may depend on one or more remote network services, available over standard IPv4/6 networking
  2. The address of the remote network service can changed at any time - but it is desirable to not require a restart of the container or containers that depend on that remote service.
  3. It should be possible to make the connection between the container and the remote network service secure without changing the container (introduce a TLS tunnel or VPN)
  4. Software inside a container must be able to identify the network address / destination of a specific remote service and distinguish between multiples
  5. Stateless remote services should be able to be transparently load balanced (from the containers perspective) without the software inside the container having to change
  6. Some client software, talking to a remote network service, is more reliable when some or all of the remote service endpoints are directly available (a MongoDB client can tolerate the loss of one of the cluster instances if all of the instance IPs are available)
  7. In general solutions which favor the broadest possible sets of software (including software that is not specifically written with the solution in mind) are preferred
  8. Where possible, performance of network traffic should be on par with direct connections

Ways of identifying the address of a remote service to software in a container:

  1. Environment variables
  • Client code has a convention to look up certain environment variables passed to it
    • Examples: Docker links v1, most PaaS, 12-factor apps
  • Advantages
    • common pattern
  • Disadvantages
    • client code has to be written to use the specific name
    • no one uses the same names, not all names make sense for all use cases
      • Reqt: Allow redefining names
    • have to restart the process to pick up a new environment value
  1. Dynamic Config file
  • Client code reads configuration file generated by host infrastructure from a template
    • Examples: confd, OpenShift templatizing conf with erb
  • Advantages
    • works well for most common services
  • Disadvantages
    • someone needs to write the template
    • have to define a template format everyone can/will use, or support multiple (good luck...)
    • requires specific client code to reload file, otherwise process restart
  1. Static cluster description file
    • File describing relationship destinations generated by host infrastructure into container
    • Advantages
      • can be more complex than environment variables
    • Disadvantages
      • client code has to read the file and use it
      • have to define a cluster format everyone can/will use (good luck...)
      • requires specific client code to reload file, otherwise process restart
  2. Read from remote service discovery service
    • Client code runs and connects to a remote service, reads values
      • Examples: Zookeeper/etcd
    • Advantages
      • client code can react to changes
    • Disadvantages
      • client code may need a library, has to be written to load that value, react to changes, and remaining code still has to react
      • harder than environment variables
      • how does the client code know where the remote service is? (recursive)
      • how is the connection between the client code and the remote service secured?
      • can the infrastructure for the service discovery be shared efficiently?
  3. Read from local service discovery service
    • Client code runs and connects to a local service, reads values, updates anytime something changes
      • Examples: Zookeeper/etcd with local cache, Docker links v2 (ish)
    • Advantages
      • can perform more complicated lookups
      • arbitrarily low latency on changes, as long as code is written to pull that
      • client code can depend on a file descriptor passed to the process or a convention
    • Disadvantages
      • client code may need a library, has to be written to load that value, react to changes, and remaining code still has to react
      • requires distribution of destination mappings to many systems (all outbound links to each container on a host)
  4. Run local service discovery client in the container
    • The top process in the container is a process monitor / manager, which connects to a remote endpoint and provides information
      • Examples: fabric8 Java Docker containers (each container is itself a host for smaller services in the JVM)
      • Note that this is no different than any other container that exposes no metadata to the infrastructure, so it's supportable and supported.
    • Advantages
      • Is opaque to the calling infrastructure, and can connect to other service discovery mechanisms
    • Disadvantages
      • hides details about the container processes from the infrastructure
      • depends on a specific service discovery infrastructure, so the container code is coupled to an infrastructure.

Ways of allowing the address of a remote service to change over time

  1. DNS
    • Requires cluster membership changes be propagated to a DNS server efficiently
      • SkyDNS
    • Advantages
      • allows client software to easily get the address
      • allows multiple remote IP addresses to be returned
        • Caveat: only some client libraries properly try multiple IP addresses, and most browsers do not (in a way that is useful to end users)
    • Disadvantages
      • latency for DNS propagation can be high (can be solved via clients)
      • most client libraries and code caches DNS entries for the entire life of the process, requires restart
      • client code still needs to hardcode the address, or get it from an environment
  2. Remote Proxy
    • Create a highly available proxy remotely that proxies ports to backends
    • Advantages
      • client software can rely on the proxy not moving
    • Disadvantages
      • all traffic routed through proxy may saturate proxy
      • proxy has to be HA, implement ip-address failover / heartbeating
      • adds latency to every network connection
      • malicious containers can attack proxy
  3. Local Proxy
    • Have an outbound proxy on each host that connects to remote hosts
    • Advantages
      • client software can rely on the proxy not moving
      • can implement outbound security rules to other locations
      • proxy is implicitly highly available and load balancing is local
    • Disadvantages
      • adds (small) latency to every network connection
      • requires distribution of destination mappings to many systems (all outbound links to each container on a host)
  4. iptables/tc Local Proxy
    • iptables, tc, or hardware routing defined to perform destination network-address-translation (DNAT) from local addresses to remote ones
      • Examples: geard, CoreOS jumpering
      • 127.0.0.1:3306 inside container maps to remote host 9.34.8.3:34856, running a hosted MySQL service
    • Advantages
      • can remap ports to match default development conventions (high port on proxy to common port internally)
      • can make container environment look identical across vastly different systems
      • can be changed on the fly without restart, without software changes
      • less latency than a local proxy for 1-1 connections
      • like SDN but cheaper
    • Disadvantages
      • iptables DNAT is slower than direct network access - although tc can get very close
      • requires distribution of destination mappings to many systems (all outbound links to each container on a host)
  5. Read from remote service discovery service
    • As above
  6. Read from local service discovery service
    • As above

Docker links (current and future)

Docker linking currently injects environment variables of a known form into a container, representing links defined on the host. The next iteration of Docker links will most likely implement local service discovery (a discovery endpoint injected into a container) via the definenvironition of links on the host, with the existence of a proxy on that host connecting to outbound servers. It will also likely support adapters for exposing environment variables, dynamic config files, or a static cluster file.

Observations:

  1. The remote connections of a service do not change EXCEPT:
    • ... when the destination is a stateless scalable service (suitable for load balancing)
    • ... when creating or changing topology (itself rare)
    • ... when a failure occurs on a non-resilient network service and a new component replaces the old one at a different address
  2. Propagating changes is complex due to fan out (one container reused by ten or more containers) and the potential for propagation to traverse more than a single edge. Limiting propagation is important in a large scale solution.
  3. Hardcoding actual IP addresses (and ports, when port proxying is in play) inside environment variables is the simplest thing that works, but has a set of disadvantages that in OpenShift resulted in us looking into other options
  4. Need to continue to consider that significant parts of connections will move into Docker links. However, there are still concerns with performance (a required proxy), environment variables (no way to change the environment variables to match real use cases with parameterizing links), and state of implementation (doesn't exist yet)

I'm just going to suggest some other options too.

You can also run code inside the container, you might have 2 options:

You could try spending time to make 2 really great libraries that do service discovery and loadbalancing with bindings in many languages and let the software running inside the container use that.

An other option is that because many already use a supervisor process, why not inject a supervisor process into the container that can handles things like:

  • automatic restart of the process if it does
  • automatic restart of the process if it runs out of memory (see the Facebook Tupperware video)
  • service discovery registration
  • health monitoring by connecting to a listening port
  • forking a proxy listening on 127.0.0.1 if the processes aren't supposed to use a direct connection

Configuration of the supervisor process can be through the environment variables or service discovery or both.

Update: what if that supervisor process could also request Docker to create iptables or other rules for 127.0.0.1 to make the direct connections endpoint IP dynamic again.

Has anyone thought of assigning the same IP-address to the same service on many hosts and use Equal-Cost Multi-Path routing with a metric and/or hashing to route the traffic to the different hosts (the gateway for the route is the host) ? This does assume a flat network though.

Update: flat unless you use tunnels between the hosts.

I'll add a note about libraries inside the container. That's kind of the docker links v2 approach, or the fabric approach (for Java micro services).

One thing I don't like about injecting a supervisor is that you lose info outside the container (about restarts) - users probably want to know that a service is "flapping", but a service process hides that. Sometimes that's unavoidable, but as a general case there seems to be some preference to exposing static metadata about your container that lets the container define it enough for general purpose consumers to use.

Some of our guys have tunnels being prototyped - worth noting.

Obviously, if you inject a supervisor which is owned by the infrastructure it can report restarts/flapping.

Yeah - that's one of ideas floated for dockers links v2 - the challenge is that you then get in the business of process control. It would be interesting to explore Foreman and systemd style plugins that could handle this. There are some folks working on systemd-container, a set of changes that make systemd able to do userspace management more easily - I'll mention this to them

It kind of reminds me of Pacemaker Remote too:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Remote/#_linux_container_use_case

Maybe just drop a unix domain socket in the container for communication ?

@Lennie I agree with @smarterclayton on the downsides of running a supervisor inside the container, which are discussed in the pod documentation: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/pods.md . I don't think an in-container supervisor is necessary, though. Lifecycle hooks (#140) could be used for custom registration logic, though I think we should provide a solution that works out of the box.

FWIW, there is some discussion of this issue in the networking doc: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/networking.md .

I dislike environment-variable-based approaches to addressing (including Docker links and the current approach used by the service proxy), due to the lack of updatability, the resulting start order dependencies, scalability challenges, lack of true namespacing and hierarchy, etc. Imagine running a web browser in a container, with web sites running in other containers.

Breaking down this problem a bit:

  • Port lookup: The k8s IP-per-pod approach is intended to make port lookups unnecessary where the service has a well known port. Theoretically, DNS-SRV could work for cases where dynamic port lookup was desirable for some reason.
  • DDNS: I think we do want a decent DDNS system to quickly publish DNS for newly created services.
  • IP mobility: For example, IP per service. IP mobility would enable us to increase the stability of name-to-IP mappings in DDNS, would retain the fungibility and disposability of most pods, and could be used for live migration in the future. Ideally, this would not require quadratic amounts of routing state (which probably rules out iptables-based approaches).
  • Group lookup, watch, and readiness: Load balancers, load-balancing client libraries, client libraries supporting master election and/or failover, worker pool managers, etc. need to be able to lookup group membership and keep that membership up to date. In addition to addresses, these use cases also usually want readiness information (#620). We should provide an API for watching group members identified by a label selector. It would be nice if this API weren't tied to Kubernetes, so that it could be used more broadly. If there were a good enough 3rd-party API (in particular, that wasn't tied to a particular storage system), we could create a way to publish groups identified by label selectors to that API.
  • Application-specific discovery logic, especially for highly dynamic scenarios, such as mapreduce worker registration: We should ensure that it's possible to use etcd, Consul, Eureka, worker self-registration libraries, etc. for application-specific needs, e.g., by ensuring we retain the flat address space, making the pod IP available within the pod's containers, etc.

OK, so a system with multiple methods ? Not just one. Yeah, I can see that. It is probably best. Maybe there a just to many situations you can easily capture in one solution.

When live-migrating, I guess if DDNS TTL issues are a problem, in theory you could set up a iptables-rules on the old host/old IP pointing to the new host/new IP. For a very short time, if you really have to.

The best way is to harden the simple clients though. They need to be to handle a lost connection and re-connect. Now they'll have to do 2 things: do DNS lookup, connect to the first IP (or connect to DNS name if that is in the higher level API). And those that can't, can still use a proxy.

I've also had an other thoughts before: is it possible to represent group lookup as a DNS-lookup somehow ? If the only thing you are returning is pod IP-addresses, a DNS lookup could be possible.

Maybe something like the following, where the order of the labels in the DNS-query didn't matter:
_environment.prod._backendtype.sql.grouplookup.local or _backendtype.sql._environment.prod.grouplookup.local

Has anyone considered that yet ?

That would restrict the characters for the labels and values to the ones allowed in DNS of course, maybe that is already the case I haven't checked. Unless every resolver library includes punycode support, which I doubt.

I should add: you should make losely coupled components. So every application that needs it should just include a proxy in their own pod.

Obsolete