Appistry findings
Posted: Thu, 06/05/2008 - 18:26
Appistry findings
When I package and deploy an application the log monitor tells me the fabric keeper is spreading the application why doesn't the log monitor on the other machine tell me the application is being deployed there as well? Why isn't the application spread across the fabric? Answer: MCAST is most likely being blocked by our routers. The fabric monitor uses another mechanism to be aware of the workers.
I have a network of 3 computers, one is a core 2 duo. I have setup my fabric to run exclusive tasks 1 worker per cpu. The fabric tells me I have 3 workers rather than one. It should tell me that I have 4 workers since once node is dual cpu. Answer: it is still supposed to tell you you have 3 workers (the docs are wrong), but the fabric is aware that the node is dual cpu and will schedule tasks on it accordingly. Use the query-detail command to see that the node is known as dual cpu.
In my 3 network node, I set one node to down, and ran a client from there to have the other 2 nodes perform work. Why was the dual cpu faster computer tasked with less work than the slower computer. I thought the faster computer should do more work and speed up the work process? Answer: Its true that the theory behind the load balancing is basically that the work will get distributed to worker with the most CPU available for the task. The worker's operating system indicates that CPU rating number and tasks are sent one with the highest CPU available. Therefore, the dual CPU box operating system, should give a higher number than a single core given they were doing the same amount of work. You are probably now wondering why your fabric distributed more work to the single core than the dual core. The reason for that has to do with the size of your fabric. Since there were only 2 workers in your fabric, the load balancing model looks inverted. This is how it works:
- The transaction is sent to the fabric.
- The fabric gives the transaction to the first request handler that responds. Although both of the workers are request handlers by default, the fabric tends to choose the stronger faster computer as the request handler since it replies quicker.(the dual core)
- The request handler then tries to find a worker to actually do the work. It will always attempt to find a worker other than itself first. Therefore, it finds the single core worker to do the work. This is done to provide more reliability.
- This results in the slower worker technically doing more of the actual work than the faster worker.
This is not the ideal of course, but changing the model would result in lower reliability in general. This load balancing model is beneficial with larger fabrics. With more workers(like maybe 4 or 5), the stronger workers would receive a larger amount of load.
noticed in the log monitor that when I distribute an application, the node shuts down and restarts. Once I ran a client and got errors since there was no application that could run even though I had two nodes hosting the application. Answer: Nodes need to restart when an application is spread. There is a concurrent_shutdown setting that defaults to 2. This allows the fabric to shutdown no more than 2 nodes at a time. With a larger fabric, you would not see this reliability issue.
Why must I shut down an entire node? I just want to shutdown one application on a node. Answer: This should work in a future release.
When I setup a new computer and added it to the fabric. The fabric was aware of the new worker, but it was down until I started it. I think when I add a new node to the fabric it should just work. I don't want to have to keep track of nodes being down. Why did this happen? Answer: I think when you setup the node you must have set it to down before connecting the node to the fabric. The fabric recognizes new nodes but will not restart them when they are down and being added. This would be presumptuous.
When multicast was not working the fabric monitor could report on all workers in the fabric fine. Since our routers don't allow multicast and the fabric would only run the fabric application on the same machine as the client, I put the computers on a separate network with a hub. After doing this the workers were all able to all participate and complete a clients request, however the fabric monitor did not work. Why is that? I used ip address and no names.
How does the fabric come up and run on Linux? Under what shell? Under what user? Are you aware that i386/client is under jre?
After installing on Linux, the fabric was not running on the worker. I was going to reboot, but first decided to run the fabric with fabric_system_service. I did this as user myself and the command was run from /usr/local/fabric/system. the fabric did start, but communicated with another node and ended up "Spreading" all necessary files for appistry across to my machine and put them all into my local home. this was very odd and quite confusing. I killed the processes, deleted all the files and rebooted and everything came up fine. Answer: This occurred since I ran the commands on the command line. The correct way to run the fabric keeper from the command line (or any other component) is to use the unix/linux service command. You must be root to run this and the command is of the form service fabric_keeper start. Appistry will try to modify the software to indicate when the application is run incorrectly.
On machine gridhost-7-165 can build and package hello world as root. I can repeatedly package hello world without updating revision -this is an error is it not? No error on deploy. On the same machine as user fabricuser I cannot package the app. I get the strange error hive package creation failed. On machine grid-master 10.35.37.92, I can build as root or as fabricuser but cannot package. I keep getting error: "helloWorld task failed validation Class: 'com.appistry.samples.hello_world.HelloWorld' Method: 'greet' File: 'hello_world.jar' XML: 'hello_world_task.xml' Reason: java.lang.NoSuchMethodError: ". The java code does infact have the correct method. Last night I was completing the loading of the fabric with fabric software. Unbeknown to me, appistry had released a new version. This new version has supposedly spread across the fabric and I have seen the version with the command fabric-version-detail. I suppose that something didn't go right with this process though and either their new release has errors, or old releases did not update properly. Answer: Although the fabric was automatically updated, the fabric admin tools were not. You can run fabric_pkg for instance to see the version number does not match that of the fabric seen with the command fabric_ctl -d239.255.0.1:30000 -ufabric-admin/fabrid-admin fabrice-version-detail all. When this is the case, the fabric tools will not function correctly in building fabric application packages and perhaps more. The solution is to remove the appistry software and reinstall it. Use the following command to erase the rpm "rpm -qa | grep appistry | xargs rpm -e", then reinstall again. I mentioned that in industry people don't want software just upgrading without them knowing about it and I suggested a feature where when a worker is added or updated, if versions are newer then the worker may be put into quarantine and the admin alerted. the quarantined worker would not be added until the admin said it was ok.
Note appistry's env file /etc/fabric_env sets up the PATH variable and appends the java bin path. On RedHat java is installed in /usr/bin. The new version jdk1.5 is installed into /usr/java/default. Appistry should put the Java bin at the beginning of the PATH so that the newer version of java takes precedence.
Why am I getting the stranage error "hive package creation failed" when I run fabric_pkg create. Answer: unknown, Appistry says this is an error they used a long time ago. They suspect there are old libraries on the machine. I suspect that when I in advertantly got their new release and installed it, some developer had re introduced this. Why am I getting an error when trying to create a fabric package? Answer: After some digging into this Appistry has come to the conclusion that this error occurred because of permissions and different users. i.e. I was able to package as user root, but not as user fabricuser probably do to the way I installed and permissions. Appistry is going to put in some code to inform the customer of the issue better. It will be in the next release they say.
If the fabric was utilizing a worker that died mid stream, it could potentially present you with a few different exception scenarios. Each of these can be dealt with. For example, if you have auto-recoveries on, the fabric client API would attempt to auto-recover that transaction. If you, do not have auto-recoveries on and you receive an exception stating its recoverable, you still could submit a recover transaction to retrieve the result of that transaction. If the transaction never even got to the fabric, the API can auto-retry the transaction. These are just examples, but for the most part the fabric will do what it thinks is right behind the scenes, and if it doesn't know what to do, it will give you an exception so you can make the call.
What if I want to pass parameters of a more complex type? Can I do that? Answer: No you can only pass a serialized object if the server knows how to de serialize it.
How did one of my fabric installations get screwed up. All installations of the fabric made fabricuser have id 12228 and group id of 12222 but one had fabricuser with id of 12222 as well. This was a problem when all workers in the fabric were trying to write a file to a shared disk area and the one worker did not have permissions. I also noticed that one installation make fabricuser have a group id of 12172. This was not affecting the writing of the file, but is not consistent with the other 4 woker installations. Answer: mM own fault with identities changing on the linux systems in between Appistry installs and fabricuser creation.
What if I need to kill tasks say missbehaving tasks or my client is being shutdown and I want to kill all associated tasks Answer: The only way to do this now is with a fabric_ctl kill-worker command. This means from the command line. Our Java Exec could do this. This means killing the entire worker and all other tasks on the worker. They are working on a api to kill a specific task and app. This needs to be addressed by Appistry so that an admin can kill tasks without a developer having to write code to allow it.
How can I see programatically which workers are up or down and how many cores there are? Answer: right now you can only see this from a fabric_ctl on the command line. Answer: currently you'd have to write code to run an executable fabric_ctl that would get these results then scrape the data from the output. Appistry will look into adding these.
Is it possible to get the processing percentage in use by the fabric. Answer: yes. Since the Monitor Gui can get it so can you. Not sure exactly how but their apache web server can get the information.
In general, installing and running on windows was pretty easy. Installing and running on Linux was more problematic since you have to deal with different users, permissions, and different shells. Appistry fabric_env runs on bash. we normally use the tcsh. My installed fabricuser ended up getting different uid values on a couple of different machines due to external screw ups. This resulted in different workers having different permissions and not being able to delete or add fabric applications. One time I incorrectly started the fabric on the command line as a user other than fabricuser and the entire fabric spread to my home directory.








This is a very good post. Thanks bmhardy.
I know you provide an answer below this quote, but I want to add.
The fabric keeper will start and then take care of starting the remaining services. We use bash init scripts for normal redhat Linux distributions to setup the fabricuser and start the fabric. Most fabric services run as that fabricuser with the exception of the system service which runs as root.