Netflix Zuul连接泄露问题

Zuul 1.x的连接泄露问题,目前仍然存在。

2018年,公司内做微服务化改造,将单体服务拆分为多个微服务,那么就需要一个网关层做服务分发。常见的选型有Kong(Openresty)、nginx、Zuul等。出于Java和Spring Cloud技术栈的考虑,最终选择了Netflix Zuul,基于Spring Cloud Netflix开发公司内部的安全校验、路由分发等功能,版本为Spring Boot 2.0/Spring Cloud Finchley。

上线后按业务配置路由,网关正常工作。突然有一天,通过网关访问某个业务整体不可用,但访问其它服务正常。使用jstack查看,发现有许多如下线程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
"http-nio-8080-exec-200" #263 daemon prio=5 os_prio=0 tid=0x00007f109630e800 nid=0x4e30 waiting on condition [0x00007f102003f000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006ce533488> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)
at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246)
- locked <0x000000075d6584e0> (a org.apache.http.pool.AbstractConnPool$2)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:282)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:269)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:118)
// 省略

Locked ownable synchronizers:
- <0x000000075d9168c0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

第一步:连接池获取连接的超时时间问题?

找到获取连接的部分代码,发现这里获取连接的超时时间取自RequestConfig.getConnectionRequestTimeout,而RequestConfig.DEFAULT中,connectionRequestTimeout默认为-1,可能导致阻塞。

首先尝试解决获取连接的阻塞问题。由于ZuulProperties.Host中配置项较少,没有connectRequestTimeoutMills,如何配置这个参数?观察ZuulProxyAutoConfiguration,发现需要重写CloseableHttpClient

1
2
3
4
5
6
7
8
9
10
11
12
@Bean
@ConditionalOnMissingBean({SimpleHostRoutingFilter.class, CloseableHttpClient.class})
public SimpleHostRoutingFilter simpleHostRoutingFilter(ProxyRequestHelper helper, ZuulProperties zuulProperties,
ApacheHttpClientConnectionManagerFactory connectionManagerFactory, ApacheHttpClientFactory httpClientFactory) {
return new SimpleHostRoutingFilter(helper, zuulProperties, connectionManagerFactory, httpClientFactory);
}

@Bean
@ConditionalOnMissingBean({SimpleHostRoutingFilter.class})
public SimpleHostRoutingFilter simpleHostRoutingFilter2(ProxyRequestHelper helper, ZuulProperties zuulProperties, CloseableHttpClient httpClient) {
return new SimpleHostRoutingFilter(helper, zuulProperties, httpClient);
}

再看SimpleHostRoutingFilter@PostConstruct,还会创建connectionManagerTimer定时关闭连接。因此我们也需要重写timer,配置代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Configuration
@Slf4j
public class ZuulProxyConfig extends ZuulProxyAutoConfiguration {

@Value("${zuul.host.connect-request-timeout-millis}")
private int connectRequestTimeoutMills;

@Bean
public CloseableHttpClient httpClient(ApacheHttpClientFactory httpClientFactory,
ApacheHttpClientConnectionManagerFactory connectionManagerFactory,
ZuulProperties zuulProperties) {
ZuulProperties.Host hostProperties = zuulProperties.getHost();
final RequestConfig requestConfig = RequestConfig.custom()....build();
HttpClientConnectionManager connectionManager = connectionManagerFactory.newConnectionManager(...);
CloseableHttpClient httpClient = httpClientFactory.createBuilder().
setDefaultRequestConfig(requestConfig).
setConnectionManager(connectionManager).build();
Timer connectionManagerTimer = new Timer("SimpleHostRoutingFilter.connectionManagerTimer", true);
connectionManagerTimer.schedule(...);
return httpClient;
}
}

修改后部署环境,但是一段时间后问题再次出现。设置了connectRequestTimeoutMills后,日志中出现大量com.netflix.zuul.exception.ZuulException: Timeout waiting for connection from pool。因此根本原因不在获取连接超时时间,而在于为什么获取不到连接。

第二步:为何拿不到连接?

第一步的改动中重写了connectionManagerTimer,这个timer的作用是定时回收过期线程,我们在其中增加日志定时打印各Route的连接池情况,再次出现问题时日志如下:

1
2
3
4
5
6
// leased:使用中;pending:等待中;available:空闲,max:最大连接数
route: {}->http://xxx.xxx.idc:80, stats: [leased: 7; pending: 0; available: 51; max: 1000]
route: {}->http://x.x.x.x:5000, stats: [leased: 0; pending: 0; available: 2; max: 1000]
route: {}->http://xxx.xxx.idc:80, stats: [leased: 1000; pending: 0; available: 0; max: 1000]
// ...
total route stats: [leased: 1045; pending: 0; available: 59; max: 2000]

可以发现多数route是正常的,但是出问题的这个服务,连接池占满了。因此原因指向了连接没有正常关闭。

第三步:Zuul是如何关闭连接的?

分析Zuul的代码,其核心就是三步:pre、route、post,出现异常则走error+post。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
public class ZuulServlet extends HttpServlet {
@Override
public void service(javax.servlet.ServletRequest servletRequest, javax.servlet.ServletResponse servletResponse) throws ServletException, IOException {
try {
init((HttpServletRequest) servletRequest, (HttpServletResponse) servletResponse);

RequestContext context = RequestContext.getCurrentContext();
context.setZuulEngineRan();

try {
preRoute();
} catch (ZuulException e) {
error(e);
postRoute();
return;
}
try {
route();
} catch (ZuulException e) {
error(e);
postRoute();
return;
}
try {
postRoute();
} catch (ZuulException e) {
error(e);
return;
}

} catch (Throwable e) {
error(new ZuulException(e, 500, "UNHANDLED_EXCEPTION_" + e.getClass().getName()));
} finally {
RequestContext.getCurrentContext().unset();
}
}

在我们的网关中,存在两个postRoute:自定义的PostErrorHandlerFilter和Zuul的SendResponseFilter。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public Object runFilters(String sType) throws Throwable {   // sType = post
if (RequestContext.getCurrentContext().debugRouting()) {
Debug.addRoutingDebug("Invoking {" + sType + "} type filters");
}
boolean bResult = false;
// list = [PostErrorHandlerFilter(自定义), SendResponseFilter]
List<ZuulFilter> list = FilterLoader.getInstance().getFiltersByType(sType);
if (list != null) {
for (int i = 0; i < list.size(); i++) {
ZuulFilter zuulFilter = list.get(i);
Object result = processZuulFilter(zuulFilter);
if (result != null && result instanceof Boolean) {
bResult |= ((Boolean) result);
}
}
}
return bResult;
}

从runFilter和ZuulServlet代码可发现,假设postRoute的filter抛出异常,就会被catch并进行error处理,然后返回。我们的项目在自定义的PostErrorHandlerFilter中非200请求统一抛出业务异常,通过Spring的DefaultErrorAttributes处理异常。

1
2
3
4
5
public Map<String, Object> getErrorAttributes(WebRequest webRequest, boolean includeStackTrace) {
// log error and add stats.
// return data
return o;
}

而Zuul是在SendResponseFilter中关闭连接的,因此在runFilters抛出后,打断了postRoute的流程,导致ZuulResponse的close不会被执行:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// SendResponseFilter中关闭
private void writeResponse() throws Exception {
// ...
finally {
try {
Object zuulResponse = RequestContext.getCurrentContext().get("zuulResponse");
if (zuulResponse instanceof Closeable) {
((Closeable) zuulResponse).close();
}
outStream.flush();
} catch (IOException ex) {
log.warn("Error while sending response to client: " + ex.getMessage());
}
}
}

这就是发生连接泄露的原因。解决方式是手动关闭连接(我们是在postFilter抛出,errorFilter中关闭连接),从RequestContext中获取到ZuulResponse,执行close方法即可。

总结及后续

这个问题虽然排查花费了一些时间,但解决仅用了两个commit,分别对应上面几步流程所述。

IMAGE

总结几个值得关注的问题:

  1. 参数配置如连接池参数等。Spring Cloud/Boot的AutoConfiguration做了很多封装,但有时也会省略一些配置(如connectionRequestTimeout),需要关注配置情况。
  2. 网关要做到路由隔离。比如Zuul不同路由可以配置不同的连接池大小,即便某个连接不可用,也不会影响其它服务。max-total-connections一定要大于max-per-route-connections。
  3. Zuul本身的问题。从代码可以看出,除了上面所示的postRoute抛异常导致连接泄露,errorRoute抛异常同样有可能导致连接泄露,需要在代码中进行处理。考虑到Zuul 1.x不再维护,后续不建议再使用Zuul做网关。

这个问题我们在Spring Cloud Netflix的GitHub中提了issue:may be found a bug for httpclient connection leak #2831,有国内公司也遇到过相同问题。

我们的解决方式是在postFilter和errorFilter中关闭连接。也有其它开发者提了pull request,通过注册回调的方式确保连接关闭,不再需要SendResponseFilter中关闭,详见:Make sure resources are disposed at request completion #2849。但是由于Zuul停止维护,这个提交没有被合并,所以该问题至今仍可复现。