Strange Problem or Stupid Administrator?
A misguided administrator (i.e. me) accidently deleted the fabricuser account on a 10 node fabric (redhat 4.6). Everything worked ok until the nodes had to be rebooted and I realized my mistake (fabric_keeper would not start). I recreated the fabricuser account (however with a different UID) and changed file/group ownership on the /usr/local/fabric files to match the new UID. Everything seemed to be working ok now; the services and fabric tasks all started and I was able to query and monitor the fabric.
Yesterday I was informed by one of the developers that new tasks could not be deployed to the fabric. "fabric_ctl ... deploy" and "fabric_ctl ... get" commands fail with a "Failed to Connect to Fabric" error while "fabric_ctl ... query", "fabric_ctl ... start", etc. commands with the same multicast/port address work fine. I shut all but one node down to isolate to a single node fabric, uninstalled appistry, made sure all /usr/local/fabric files were deleted as well as the fabricuser account, and cleanly reinstalled appistry. Same problem.
Out of desperation I tried changing the multicast address and found that any other multicast (other that the original) works fine. I can "deploy", "get", "query" ... all fabric_ctl commands work. If I change back to the original multicast address the "deploy" and "get" commands fail again (but other fabric_ctl commands work fine.) Any ideas what would cause these symptoms?
Thanks,
Keith








I must say good troubleshooting so far. The fact that it works with the other MCAST address eliminates a lot of possibilities. When you changed the MCAST address, that is when you actually were able to do your single node test. If the MCAST address is the same as the other workers even if they are technically stopped, the fabric_keeper service can interact with your single node worker. Therefore, it was great when you changed the MCAST address.
That being said, it may be as simple as rebooting all of the workers. In your post, the problem started occurring after a reboot, but I do not see a reboot after you reinstated the fabricuser user/group. I suspect that the fabric_keeper service is attached to the old UID of fabricuser.
One option is to reboot all the workers and see if everything works afterward.
If that doesn't work or rebooting 9 workers is not an option in your environment, you can do the following:
Change the MCAST address of one of the problem workers. Normally, you would do a fabric_ctl get then a fabric_ctl put-addr, but this situation does not allow us to have this luxury. Therefore, please do the following steps:
1. Pick a problem worker and connect to it.
2. edit the /usr/local/fabric/system/staging/addr.cfg (notice that it is the staging directory)
3. Change the fabric-address to the fabric-address of the working worker.
4. Reboot that worker
Once the worker comes up, it should be on the new MCAST address and I suspect that it will work fine. I am having you bring it up on the new MCAST address so there is no chance that an existing fabric_keeper in an unknown state will convolute the problem.
Can you do these steps and tell what happens? Another troubleshooting tip is to start a log_monitor on that good worker, and have it listen to the other MCAST address. You may see an error show up while the the problem fabric is loading. To use log_monitor, you would simply type log_monitor [MCAST Address]. I like to use the tee command so that I can view it and log it to a file at the same time because it goes fast sometimes. log_monitor [MCAST Address] | tee log.txt
Tell me how this goes, and we can get those workers back to their good state.
Mark,
Thanks for the reply. Now that it is working on the new muticast address I won't be able to touch the systems until after next week as we are having customer demos and we're locking things down until then (I don't think they trust me :).
However ... let me ask this. The assumption was that this problem is an artifact of the fabricuser account problem ... once guilty forever accused! I found out today that someone installed a windows worker with the same "problem" multicast address as the redhat fabric. Can having a mixed redhat/windows fabric cause problems? The windows worker (now in a fabric by itself) seems to be having similar problems (services are running, log monitor is trapping messages, but fabric_ctl commands all fail with "unable to locate fabric").
What I think I'll try after the demos is the remove the Appistry software from the windows server and return the redhat to the original multicast and see if the problems go away. If that doesn't work I'll try your suggestions and let you know.
Regards,
Keith
Keith,
Yes, having two workers with different OSes on the same fabric address will cause them to attempt to exchange files, and not be able to. I don't believe it typically causes any issues other than a lot of log traffic complaining, and it will prevent the newest worker from coming up all the way since he can never really "get in sync" with the existing other-OS fabric, and so is not allowed to join.
We'll watch for your next post to see how things go post-demos.
Are you demoing an application using the fabric? If so, I'd be curious what you guys are doing. If you can't talk about it here, I understand. We could go off-band with email....
Thanks!
Guerry
---
Guerry A. Semones
guerry -- at -- appistry.com
Appistry | The Fabric of Business: Scalability Simplified
Product Manager / Developer Relations
If you are wondering if a different OS has slipped into your fabric, you may be able to successfully run a "query-detail" command to determine the operating systems of the workers. That should help people troubleshoot this problem in the future.