最近总玩CloudStack + KVM,发现在重启CloudStack服务后,host(kvm)的状态老是为alert。日志里出现如下错误提示:
ERROR [agent.manager.AgentManagerImpl] (AgentManager-Handler-7:) Monitor ClusteredVirtualMachineManagerImpl$$EnhancerByCGLIB$$121cf44e says there is an error in the connect process for 1 due to null
java.lang.NullPointerException
at com.cloud.vm.VirtualMachineManagerImpl.fullHostSync(VirtualMachineManagerImpl.java:1643)
at com.cloud.vm.VirtualMachineManagerImpl.processConnect(VirtualMachineManagerImpl.java:2289)
at com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:605)
at com.cloud.agent.manager.AgentManagerImpl.handleConnectedAgent(AgentManagerImpl.java:1157)
at com.cloud.agent.manager.AgentManagerImpl.access$100(AgentManagerImpl.java:142)
at com.cloud.agent.manager.AgentManagerImpl$AgentHandler.processRequest(AgentManagerImpl.java:1235)
at com.cloud.agent.manager.AgentManagerImpl$AgentHandler.doTask(AgentManagerImpl.java:1374)
at com.cloud.agent.manager.ClusteredAgentManagerImpl$ClusteredAgentHandler.doTask(ClusteredAgentManagerImpl.java:618)
at com.cloud.utils.nio.Task.run(Task.java:83)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
agent日志出现提示:
2013-08-09 11:27:18,746 INFO [cloud.agent.Agent] (Agent-Handler-3:null) Reconnecting...
2013-08-09 11:27:18,747 INFO [utils.nio.NioClient] (Agent-Selector:null) Connecting to 20.1.134.190:8250
2013-08-09 11:27:18,855 INFO [utils.nio.NioClient] (Agent-Selector:null) SSL: Handshake done
2013-08-09 11:27:19,422 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Proccess agent startup answer, agent id = 1
2013-08-09 11:27:19,422 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Set agent id 1
2013-08-09 11:27:19,423 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Startup Response Received: agent id = 1
2013-08-09 11:27:19,539 WARN [cloud.agent.Agent] (UgentTask-5:null) Unable to send request: null
2013-08-09 11:27:23,856 INFO [cloud.agent.Agent] (Agent-Handler-3:null) Connected to the server
2013-08-09 11:27:24,481 INFO [cloud.agent.Agent] (Agent-Handler-3:null) Lost connection to the server. Dealing with the remaining commands...
2013-08-09 11:27:29,483 INFO [cloud.agent.Agent] (Agent-Handler-3:null) Reconnecting...
2013-08-09 11:27:29,484 INFO [utils.nio.NioClient] (Agent-Selector:null) Connecting to 20.1.134.190:8250
2013-08-09 11:27:29,580 INFO [utils.nio.NioClient] (Agent-Selector:null) SSL: Handshake done
2013-08-09 11:27:30,223 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Proccess agent startup answer, agent id = 1
2013-08-09 11:27:30,224 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Set agent id 1
2013-08-09 11:27:30,225 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Startup Response Received: agent id = 1
2013-08-09 11:27:30,350 WARN [cloud.agent.Agent] (UgentTask-5:null) Unable to send request: null
2013-08-09 11:27:34,581 INFO [cloud.agent.Agent] (Agent-Handler-3:null) Connected to the server
2013-08-09 11:27:35,310 INFO [cloud.agent.Agent] (Agent-Handler-3:null) Lost connection to the server. Dealing with the remaining commands...
重启agent、libvirtd服务,异常依然。重启host,问题还是一样。
从日志中能看出,异常是management-server在连接上cloud-agent后,刷新vm状态时问题导致的。而此时,除了vRouter,所有vm的状态均为Stoped。vRouter的状态缺为Running,就此找到问题所在。不知何故,在host上使用virsh list并不能看到vRouter,而management-server却认为他是Running状态,需要刷新一下状态,导致在management-server查询不到vRouter,所以抛出异常。这应该是一个bug,需要修复。
解决方案,删除vRoute(需要先在数据库将状态置为Stopped,执行sql “update vm_instance set state = 'Stopped' where vm_type = 'DomainRouter';”)。