Souparno Adhikary Souparno Adhikary - 1 year ago 142
Linux Question

PBS Communication Error: Nodes can not communicate

I successfully installed the pbs server, started the services and can view the nodes using pbsnodes command. The queue is showing in qstat -q command properly. After I submit a test job, the followings come up in my sched_log, server_log and in the mom node mom_log files:

sched_log:

08/16/2017 14:18:48.476;64; pbs_sched.19885;Job;2.headnode;Job Run
08/16/2017 14:19:28.215;02; pbs_sched.19885;Req;headnode3;Can not open connection to mom
08/16/2017 14:19:28.215;02; pbs_sched.19885;Req;headnode4;Can not open connection to mom
08/16/2017 14:19:28.238;02; pbs_sched.19885;Req;headnode5;Can not open connection to mom
08/16/2017 14:19:28.239;02; pbs_sched.19885;Req;headnode6;Can not open connection to mom


server_log:

08/16/2017 14:40:37.829;01;PBS_Server.27737;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.233:15003]
08/16/2017 14:40:37.829;01;PBS_Server.27739;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.232:15003]
08/16/2017 14:40:37.829;01;PBS_Server.27793;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.235:15003]
08/16/2017 14:40:38.828;01;PBS_Server.27736;Svr;PBS_Server;LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 192.168.89.234:15003]


mom_log:

08/16/2017 18:50:36.215;01; pbs_mom.10833;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 11123 MOM status update intervals
08/16/2017 18:51:22.308;01; pbs_mom.10838;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
08/16/2017 18:51:22.308;01; pbs_mom.10838;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 11124 MOM status update intervals
08/16/2017 18:52:06.402;01; pbs_mom.10859;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status update successfully sent after 11124 MOM status update intervals
08/16/2017 18:53:21.555;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 18:58:26.182;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:03:31.815;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:08:31.407;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:13:37.039;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:18:41.670;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
08/16/2017 19:23:46.455;02; pbs_mom.13039;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0


How can this problem be solved? Is it due to any kind of authentication failure? In that case, should I set up ssh key authenticated logins?

Interestingly I have another server with Torque named headnode2 with ip .89.231 which is not showing any error. I did not follow any extra step to configure that one.

Answer Source

You may just need to configure the firewall. I'd run

# iptables-save > iptables.bak && iptables -F

on the server and on one test node, and then submit a job to that node to see if it runs.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download