Skip to content
This repository has been archived by the owner on Oct 13, 2023. It is now read-only.

[18.06] Improve scalability of the Linux load balancing #16

Merged
merged 3 commits into from
Jul 10, 2018

Conversation

thaJeztah
Copy link
Member

backport of moby#37372 for 18.06

Changes included;

git checkout -b 18.06-backport-scalable-lb ce-engine/18.06
git cherry-pick -s -S -x 92335eaef12fb0fb86e5e7503f88b8873a9fa973
git cherry-pick -s -S -x 8e0f6bc90351525f3e52f3bc357378fcccccdd27
git cherry-pick -s -S -x 0e162d992394d9216f590176991d33ff48ef3389
git push -u origin

cherry-pick was clean, no conflicts

Bump libnetwork to b0186632522c68f4e1222c4f6d7dbe518882024f.   This
includes the following changes:
 * Dockerize protocol buffer generation and update (78d9390a..e12dd44c)
 * Use new plugin interfaces provided by plugin pkg (be94e134)
 * Improve linux load-balancing scalability (5111c24e..366b9110)

Signed-off-by: Chris Telfer <[email protected]>
(cherry picked from commit 92335ea)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
This patch is required for the updated version of libnetwork and entails
two minor changes.

First, it uses the new libnetwork.NetworkDeleteOptionRemoveLB option to
the network.Delete() method to automatically remove the load balancing
endpoint for ingress networks.   This allows removal of the
deleteLoadBalancerSandbox() function whose functionality is now within
libnetwork.

The second change is to allocate a load balancer endpoint IP address for
all overlay networks rather than just "ingress" and windows overlay
networks.  Swarmkit is already performing this allocation, but moby was
not making use of these IP addresses for Linux overlay networks (except
ingress).  The current version of libnetwork makes use of these IP
addresses by creating a load balancing sandbox and endpoint similar to
ingress's  for all overlay network and putting all load balancing state
for a given node in that sandbox only.  This reduces the amount of linux
kernel state required per node.

In the prior scheme, libnetwork would program each container's network
namespace with every piece of load balancing state for every other
container that shared *any* network with the first container.  This
meant that the amount of kernel state on a given node scaled with the
square of the number of services in the cluster and with the square of
the number of containers per service.  With the new scheme, kernel state
at each node scales linearly with the number of services and the number
of containers per service.  This also reduces the number of system calls
required to add or remove tasks and containers.  Previously the number
of system calls required grew linearly with the number of other
tasks that shared a network with the container.  Now the number of
system calls grows linearly only with the number of networks that the
task/container is attached to.  This results in a significant
performance improvement when adding and removing services to a cluster
that already heavily loaded.

The primary disadvantage to this scheme is that it requires the
allocation of an additional IP address per node per subnet for every
node in the cluster that has a task on the given subnet.  However, as
mentioned, swarmkit is already allocating these IP addresses for every
node and they are going unused.  Future swarmkit modifications should be
examined to only allocate said IP addresses when nodes actually require
them.

Signed-off-by: Chris Telfer <[email protected]>
(cherry picked from commit 8e0f6bc)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
@thaJeztah
Copy link
Member Author

ping @fcrisciani @ctelfer PTAL

@thaJeztah
Copy link
Member Author

18:14:05 2018/07/09 18:14:05 Errors on clone:
18:14:05 google.golang.org/api: Err: exit status 128, out: Cloning into '/go/src/github.com/docker/docker/vendor/google.golang.org/api'...
18:14:05 fatal: git fetch-pack: expected ACK/NAK, got '�error: Git repository not found'
18:14:05 fatal: The remote end hung up unexpectedly
18:14:05 Build step 'Execute shell' marked build as failure

hm... outage somewhere?

@ctelfer
Copy link

ctelfer commented Jul 9, 2018

We should also include 6225d1f if that is not in 18.06. Other than that LGTM.

Bump libnetwork to 3ac297bc7fd0afec9051bbb47024c9bc1d75bf5b in order to
get fix 0c3d9f00 which addresses a flaw that the scalable load balancing
code revealed.  Attempting to print sandbox IDs where the sandbox name
was too short results in a goroutine panic.  This can occur with
sandboxes with names of 1 or 2 characters in the previous code. But due
to naming updates in the scalable load balancing code, it could now
occur for networks whose name was 3 characters and at least one of the
integration tests employed such networks (named 'foo', 'bar' and 'baz').

This update also brings in several changes as well:
 * 6c7c6017 - Fix error handling about bridgeSetup
 * 5ed38221 - Optimize networkDB queue
 * cfa9afdb - ndots: produce error on negative numbers
 * 5586e226 - improve error message for invalid ndots number
 * 449672e5 - Allows to set generic knobs on the Sandbox
 * 6b4c4af7 - do not ignore user-provided "ndots:0" option
 * 843a0e42 - Adjust corner case for reconnect logic

Signed-off-by: Chris Telfer <[email protected]>
(cherry picked from commit 0e162d9)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
@thaJeztah
Copy link
Member Author

@ctelfer Looks like moby#37156 is not in the 18.06 branch, so we shouldn't have the regression (won't hurt to include 6225d1f, but I think it's not necessary.

If you can double-check in the 18.06 branch though (in case I overlooked) 🤗

@thaJeztah thaJeztah added this to the 18.06.0 milestone Jul 9, 2018
@ctelfer
Copy link

ctelfer commented Jul 9, 2018

Oh! That's true. The 18.06 branch does not have the wrapped error changes. I'm a little surprised, but not displeased.

@fcrisciani
Copy link

looks like we need more tests to catch issues like the one that @ctelfer fixed, this is not the first time that something get broken when the error types get touched :(

Copy link

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@ctelfer ctelfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewhsu andrewhsu merged commit c863835 into docker-archive:18.06 Jul 10, 2018
@thaJeztah thaJeztah deleted the 18.06-backport-scalable-lb branch July 10, 2018 16:30
@yuvalo
Copy link

yuvalo commented Oct 8, 2018

Adding the new load balancer introduces a new behavior:
When a container communicates with another service, it will go through the load balancer and hide the source IP address.
This is a potential problem in cases where the source IP matters for an application.

@ctelfer
Copy link

ctelfer commented Oct 8, 2018

@yuvalo Thanks for the comment. A few thoughts and then queries:

  • fwiw, this change only affects east-west traffic and not north-south (which always used masquerading)
  • this behavior was already present in the Windows implementation of engine since 17.09, so at least in mixed-OS deployments having the source address unmodified wasn't a guarantee
  • there is an outstanding PR in libnetwork (Use direct server return in east-west overlay load balancing moby/libnetwork#2270 (comment)) that would use direct server return to avoid changing the source IP
  • unfortunately, the main issue with that approach is, again, Windows. Windows' networking stack does not support (at least as far as we know) the direct route load balancing primitives required to implement DSR itself meaning that Windows containers will not work with it. We're still pondering how we might get around this.

As to the queries portion. I'd be interested to hear about what kinds of east-west services require avoiding source address modification that you know of. I've heard some very vague statements that they exist to that effect, but haven't heard many real use cases. It would be helpful to understand how/why using source NAT breaks things for east-west applications for purposes of identifying a proper fix.

Again, thanks!

@yuvalo
Copy link

yuvalo commented Oct 8, 2018

@ctelfer Thanks for following up.
The reason I got to this PR was that updating to 18.06 broke our deployment in the test environment. We are still working around it, but don't have enough confidence that everything will work out.

Some of the use cases that come to mind are:

  1. Access auditing where you need to record the source of a request.
  2. High availability / scalable solutions such as Hadoop HDFS where the Datanode registers with the Namenode or Spark workers with the Spark master. There are workarounds in most cases, but updating Docker will break existing deployments. This is where I'm at today btw.
  3. IP hash based internal HTTP load balancing
  4. In the future, it will be harder to introduce network restrictions (think Kubernetes Network Policies).

People tend to think of the overlay network as still internal and expect it to be more "flat" than it is in reality.

Hope the above makes a compelling case 🤗

@ctelfer
Copy link

ctelfer commented Oct 8, 2018

@yuvalo Thanks for the update and the quick response. Overlays are definitely flat networks. But services are an abstraction clearly built around a layer on indirection between clients and servers. Having said that, while I'm not a huge fan of IP-based identity as a concept, it is certainly traditional.

Item 1 above was the main complaint that I've heard re: source NATing. There are, of course, workarounds for this as well such as accounting via TLS or http header fields. (These can give even more semantically relevant info and stronger identity guarantees.) But obviously, some apps still use just basic IP info.

Item 2 is one I haven't encountered before and very good to know about. I'm guessing that redundant NameNodes each have to be a separate Docker service because otherwise datanodes would have the same problem in reverse. (i.e. the datanodes don't know which namenodes they are talking too) So Docker's networking abstractions are interfering with application abstractions built for similar purposes. ☹️ What we need to be building, therefore, seems like a set of abstractions that can both work stand-alone and work with application services for things like load balancing, audit, scaling, etc. The NameNode design clearly shows that the application needs more than just compute and network redundancy to support hit-less upgrade or fail-over. (this is more for longer-term planning)

In any case, I think that the DSR approach will address the cases you listed above. Now, just need a clean way to keep compatibility w/ the Windows overlay services.

@ctelfer
Copy link

ctelfer commented Oct 12, 2018

@yuvalo So we have merged in the libnetwork changes to support DSR-mode load balancing as a per-network feature. This won't prevent upgrade failures, but it does provide a solution for when these situations arise. One one can create a network using docker network create ... -d overlay --opt dsr ... to enable the this behavior and avoid NATing for services in that network. Note that this only works in overlay networks. Alternately, a docker-compose.yml file can be augmented as:

...
networks:
  my-network:
    driver: overlay
    driver_opts:
      dsr: ""
...

Unfortunately, this is not yet merged into the docker engine itself yet, but we are looking to get it in soon (i.e. not waiting for next stable release) including looking at backports.

The rationale for making this feature opt-in rather than the default boils down to:

  • doing live updates from pre-18.06 to 18.09+ engines w/ DSR as default will cause outages no matter what
  • doing live updates from pre-18.06 to 18.06+ engines w/ NAT as default may cause issues, but may not .. it depends on the sensitivity of the applications in the cluster to source NATing
  • DSR mode still will not work w/ Windows and the default overlay configuration should work cross-platform

I hope we'll have a version of Docker soon that can mitigate your issues.

@TincaTibo
Copy link

Hi
I faced a situation linked to this change when upgrading to 18.06.

  • I have a tool that allow debugging of the exchanges on the cluster, and gets the name of the replicas used in communication from DNS reverse resolving of their IPs.
  • After upgrade to 18.06, I was blind. Because the IPs used in communication were changed from the replicas IPs to the services VIPs.
  • I did a workaround by resolving all services first to get their VIPs, but then, the tool is not autodiscovering the services now... It preloads them :(

I have nothing against this change, although it was a surprise and cost me some hours to figure it out.
It seems indeed normal that since services are addressing VIP to communicate, they are the one used in communications.
But could we add PTR records in the DNS to allow reverse resolving of the VIPs?
Indeed reverse resolving of replicas works, but not of their VIP.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants