dubbo ReconnectTimerTask 不停重连provider 问题

dubbo ReconnectTimerTask 不停重连provider 问题

????问题描述 dubbo消费端一直不停重试reconnect dubbo provider, 并报错;

[DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [10.1.1.12:0 -> /10.1.1.228:20888]], dubbo version: 2.7.3, current host: 10.1.1.122019-08-30 20:33:52.283 [/] httpwrapper [dubbo-client-idleCheck-thread-1] ERROR o.a.d.r.e.s.h.ReconnectTimerTask - [DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [10.1.1.12:0 -> /10.1.1.228:20888]], dubbo version: 2.7.3, current host: 10.1.1.12 org.apache.dubbo.remoting.RemotingException: client(url: dubbo://10.1.1.228:20888/com.cxq56.service.GeoService?actives=0&anyhost=true&application=httpwrapper&async=false&bean.name=providers:dubbo:com.cxq56.service.GeoService&check=false&cluster=failover&codec=dubbo&default.deprecated=false&default.dynamic=false&default.register=true&default.retries=1&default.timeout=10000&deprecated=false&dubbo=2.0.2&dynamic=false&generic=false&heartbeat=60000&interface=com.cxq56.service.GeoService&lazy=false&loadbalance=random&methods=createForbiddenGeo,calculatedDistance,createSiteInfo,getSiteAndDistance,getAllGeoByCityId,searchForPOI,createGeo&pid=1&qos.enable=false&register=true&register.ip=10.1.1.12&release=2.7.1&remote.application=geo-provider&retries=0&revision=1.0-SNAPSHOT&shutwait=40000&side=consumer&sticky=false&timeout=3000&timestamp=1567049198218&validation=false) failed to connect to server /10.1.1.228:20888 client-side timeout 3000ms (elapsed: 3000ms) from netty client 10.1.1.12 using dubbo version 2.7.3at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:171)at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:190)at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:246)at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:155)at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)at java.lang.Thread.run(Thread.java:748)

????经过查看,provider的ip并不在线上集群,但是在redis注册中心中还存在,说明是没有实现优雅停机导致,provider注册数据没有删掉,但是其他provider服务都没有问题,只有这个provider会被一直reconnect;这就很让人头疼了,因为一直抛异常。
????背景:dubbo有连个机制来保证服务的可用性,一个是心跳机制,探测对方是否存活; 一个是重连机制;这两个定时探测机制都是通过HeaderExchangeClient类来初始化的;

private final Client client;    private final ExchangeChannel channel;    private static final HashedWheelTimer IDLE_CHECK_TIMER = new HashedWheelTimer(            new NamedThreadFactory("dubbo-client-idleCheck", true), 1, TimeUnit.SECONDS, TICKS_PER_WHEEL);    private HeartbeatTimerTask heartBeatTimerTask;    private ReconnectTimerTask reconnectTimerTask;    public HeaderExchangeClient(Client client, boolean startTimer) {        Assert.notNull(client, "Client can't be null");        this.client = client;        this.channel = new HeaderExchangeChannel(client);        if (startTimer) {            URL url = client.getUrl();            startReconnectTask(url);            startHeartBeatTask(url);        }    }

????可以看到使用的是HashedWheelTimer来定时轮询的;这里的reConnectTask如果失败,就会打印出我们的异常日志;而且失败后不会停止重试,会一直尝试下去;那么这里有一个问题,是否redis有的历史注册信息,consumer都会去尝试reconnect呢?
????所以我们尝试打个断点尝试分析一下;并往上追述可以发现当dubbo consumer启动时会在redis中注册本身的消费端信息,同时也会通过接口名获取所有provider注册信息,并在RedisRegistery.class中进行过滤,代码如下:

private void doNotify(Jedis jedis, Collection<String> keys, URL url, Collection<NotifyListener> listeners) {        if (keys == null || keys.isEmpty()                || listeners == null || listeners.isEmpty()) {            return;        }        long now = System.currentTimeMillis();        List<URL> result = new ArrayList<>();        List<String> categories = Arrays.asList(url.getParameter(CATEGORY_KEY, new String[0]));        String consumerService = url.getServiceInterface();        for (String key : keys) {            if (!ANY_VALUE.equals(consumerService)) {                String providerService = toServiceName(key);                if (!providerService.equals(consumerService)) {                    continue;                }            }            String category = toCategoryName(key);            if (!categories.contains(ANY_VALUE) && !categories.contains(category)) {                continue;            }            List<URL> urls = new ArrayList<>();            Map<String, String> values = jedis.hgetAll(key);            if (CollectionUtils.isNotEmptyMap(values)) {                for (Map.Entry<String, String> entry : values.entrySet()) {                    URL u = URL.valueOf(entry.getKey());            //如果dynamic为false 或者 过期时间 大于 当前时间 就加入这个注册url,后面进行reconnect                    if (!u.getParameter(DYNAMIC_KEY, true)                            || Long.parseLong(entry.getValue()) >= now) {                        if (UrlUtils.isMatch(url, u)) {                            urls.add(u);                        }                    }                }            }            if (urls.isEmpty()) {                urls.add(URLBuilder.from(url)                        .setProtocol(EMPTY_PROTOCOL)                        .setAddress(ANYHOST_VALUE)                        .setPath(toServiceName(key))                        .addParameter(CATEGORY_KEY, category)                        .build());            }            result.addAll(urls);            if (logger.isInfoEnabled()) {                logger.info("redis notify: " + key + " = " + urls);            }        }        if (CollectionUtils.isEmpty(result)) {            return;        }        for (NotifyListener listener : listeners) {            notify(url, listener, result);        }    }

????provider注册信息的过滤条件是,dynamic为true且过期时间小于当前时间,一般旧的注册数据的过期时间肯定都会小于当前时间(这种数据算是脏数据,优雅停机和dubbo monitor都可以移除),源头就在这个dynamic上,由于这个provider使用的dubbo版本是2.7.1,有一个bug,默认的dynamic的值为false,直接导致现在的问题;另外这个dynamic的官方文档解释的意思是 "服务是否动态注册,如果设为false,注册后将显示后disable状态,需人工启用,并且服务提供者停止时,也不会自动取消册,需人工禁用。" 但是并没有说,consumer会一直重连。

-----------------------------------------------------------------------------------------------------------end------------------------------------------------------------------------

推荐阅读