Just to verify, you have one request object, that you pass to each thread. Then in the thread you get the FamSession out of that request, correct? I'll try to reproduce this error.
As far as using different Fam Sessions, you can create an external FamSession object. So if you had a separate thread,whose only job was to perform a FAM op, then get out, you could create your own session. It basically takes an ID, and a version. You could generate a GUID for each use. I would use the Fam Session from the Request for the long running transaction. This might be worth trying.
I do not think the workers are "going down". The data fed to the console is being collected by multicast which is UDP-based. As opposed to TCP, UDP is a lossy, non-retried protocol, and packets can be dropped by the network. If your workers are very busy CPU-wise, it's possible packets are being missed, and so the data coming back to the console will be spotty.
I'm more curious about the chunks not coming back. I noticed in your Fabric Method code sample above that you are catching and dropping exceptions. If an exception is occurring in that method, the imageMap returned will be empty, and so you'd have a blank chunk. If you do not catch the exception there, it will propagate to the process flow, where you can branch to another task to handle it, or just let it continue to propagate. Ultimately, it will make it to your client where you can handle the exception, and print out what happened. I'm not saying you're getting an exception, but I'd make that change so you can see one if you do.
The same is true in your collect thread (by the code above), you are eating the exception and not printing anything out, etc. If your process flow and/or task encounter a problem, we report it back by exceptions. You need to be looking at those. That might likely explain the blank image chunks that you are getting.
For example, one exception you may be eating in the collect thread would be a process flow timeout. Are your process flow timeouts set to "infinite" in the XML? If not, if your task work takes too long (as defined by the timeouts in the flow XML), then the flow will timeout the task. If the flow is simple and does not branch on a timeout, the timeout will result in the flow just going to the "finish" state and returning to the client. In that case, you'd likely have a null imageMap or nothing.
Last, I'd highly recommend watching log_monitor output rather than the console for what you are doing. You will get far more detailed information. For example, if a worker really is "going down" or if the task process is crashing, you'll see it there, along with details about what is happening. Just do "log_monitor ".
I'm not sure I follow the problem. You can run that line of code instead of removing it. However, as stated, you need to run it only once in your application. .NET will not allow you to register the channel, which is referenced in the config twice. This is a .NET thing, not a CloudIQ thing.
Without more detail and code, I'm not sure what else to add.
First of all thanks Brett & Guerry for your support.
Idea of balance between running the entire job on one worker, vs segmenting the job into the smallest possible number of pieces works. Earlier what I was doing was submitting each pixel to fabric and getting its response. So it was taking lots of time.
There were total 470*470 pixels, I have created chunks of 47*47 and created 100 threads to submit and collect jobs. And the performance is now much much better.
In starting when there are less number of iterations asynchronous call to fabric is giving lower performance so at that time I am using synchronous call(fabric.execute()) to fabric, and then after some iterations using asynchronous call(fabric.submit()), so now complete application performance is better than standalone application (giving 100 to 150 % better performance).
After so many iterations when there will be higher load on cloud, workers start missing from console, don't know why. In this case I am getting some chunks back from fabric request but which ever worker goes down during execution, I am not getting those chunks. (so some part of image will be blank). Do appistry have any management to get the submitted job result if worker goes down between submit() and request.get() ?
We have added two more machines to existing 239.255.0.5, so now total 5 workers are there in my fabric. but if we see from console each machine is having different numbers of workers available. and it varies with time i.e. number of workers vary from 1 to 5.
We have installed development environment of CloudIQ to each machine, so is that because of that?
About tomcatwin - its tomcatwin.far (tomcat service) which is starting and stopping on 73 IP, due to some reasons. Right now I have stopped the service. Will check it again if the problem remains as it is will let you know.
As you told me that, to move the following code to the place where it executes only once in the entire application...I had followed it,eventhough I am getting the above mentioned error....
In the web application as the web.config file will be mapped or called automatically...I can overcome the above mentioned error by removing following code :
But in the windows application we need to map the config file manually....So,can you give me any solution to overcome this error in windows based application.....
In regards to the CPU / TPS charts in the console. Without going into deep detail, the console speaks to a worker via HTTP, and the worker responds with CPU and TPS data aggregated from among the workers by mcast. If an mcast message is missed or does not arrive, the worker still responds with the data it received to the console. Mcast is a lossy protocol without retries, so it is possible to lose packets. I believe the primary issue we are observing here is that your application is overwhelming the three workers, and all three workers are hitting 100 percent CPU and cannot keep up with the work being asked of them. These charts are not really telling you much about what is happening inside the Engine part of the system. The log monitor logs are most helpful with that.
Let's examine that.
I do not have a copy of your Task XML, but I'm presuming you are using normal tasks and not limited or exclusive tasks. Normal tasks are "unlimited" meaning that if you submit 1000's of jobs into a fabric, the software-based load balancing will spread the work among the workers as best as it can. And, the workers, in response, will try to do as many *simultaneous* tasks as memory and CPU allow. Your inFlightUpperThreshold is 4800. That means that *very* quickly each worker is trying to process up to 1600 jobs at once. The task service has a default pool of 200 threads, and so will attempt to process up to 200 of the 1600 jobs at a given time. None of this should cause an issue with the fabric. We are throttling at the thread pool. We test at 100% CPU and as long as your application is not timing out at the client or at the process flow level, the work will eventually get done, though with CPU thrashing, you'll likely get done slower.
Here are things I recommend at this point:
1) I think your best starting point is to dial down the inFlightUpperThreshold to a lower number. This may take some experimentation. Again, even with the high number, the work will complete, but it may take more time. This will help with the out of memory issue also, though not alleviate it (see #2).
2) Next, it looks like you are running the Java Heap out of memory. That is a problem, though only a Java problem. By default, we use the JVM defaults for memory. You should expand the memory configuration for the JVM's in the fabric by following the information here: http://www.appistry.com/community/wiki/display/cloudiq43/Java+Configuration
3) The for loop in CollectValuesFromFabric.java is throwing a null exception? It looks odd. It looks like the Exception itself is null because it prints out "In exception null". Null there should be a stack trace, right? That exception is eating up the for loop counter, and so the for loop will finish before all the jobs are ever collected (if they were collectable, which since the service is running out of memory....) I think that "null" needs to be figured out and fixed.
4) I don't think this is happening, but you might also check the default_timeout in your process flow XML. By default, the setting is INFINITE, which means that the process flow will wait forever for the task to complete. However, if you changed it to like "1" second, then with CPU levels at 100%, the process flow step will time out waiting for the task to complete, and will retry the task, adding more jobs to the system. The original task will continue to run, but the retry will start another. Of course, that second task will time out, and the cycle will continue until the retries are gone. I don't believe this is happening, but you might want to check if you specified a default_timeout value.
Looking at the log_monitor log, what is the role of the service named "tomcatwin"? Is Tomcat running as part of this application? Or did you fractal application accidently get named "tomcatwin"? That service is stopping and restarting 28 times in 1 minute on 172.29.120.73. I'd like to understand what that is all about.
Two workers go down means, they got missing from cloud console (also noticed it from log_monitor).
After some minutes of execution, we are getting repeatedly below log on
log_monitor:
Aug 26 11:25:29 172.29.120.47 FabricProcess[E:3022] could not get task service f
or 'fractal_app.fractal_flow', state 'start', guid '2c678d95-21a1-4947-aa17-9dab
89381fe7'
Aug 26 11:25:31 172.29.120.47 RegionService[I:63] 172.29.120.47 - region 32000 h
eartbeat - size: 1
I have attached log.zip on another thread.
appistry.com/community/forums/content/performance-issue-attachments
Screen shot of cloudiq console also attached, where you can see, how workers go missing during execution, and CPU utilization goes 100% continuously. And you can see TPS as well.
Attached rolling.log file and part of log_monitor cmd.
Attached logs of java console also, where you can see we are gettign out of memory exception too.
When you say "2 workers go down" I believe you mean they stop reporting TPS. Is that correct? Please describe what you mean by "go down."
Can you watch log_monitor messages while doing a run with fabric.submit version of your code? In a window, run "log_monitor 239.255.0.5:4000 > log.txt" I'd like to see the log output file if you don't mind. I'm wondering if something is happening to the services, and so causing the other two workers to drop out. If that happens, the last worker could be overwhelmed on network and CPU, especially since he's the client.
Also, what is your "inFlightUpperThreshold"? set to? At 400 TPS, I'd imagine you're not overwhelming the network or CPUs on the three workers, and of course, that is all relative to how powerful the workers are. Also, are the three workers comparable in CPU power, memory, etc.?
The fact that the 2 workers "go down", makes me think that is where the problem lies, or is an indicator of the problem.
I'd also consider running the client on a separate box against the three workers (what you are doing is legitimate, I just want to eliminate possibilities).
By the way, what you've done is great! We just need to understand what's going on and fix it.
We are having 3 workers including my machine which act as client too.
We are able to see 3 workers running till we start the fractal application, after starting the application, it shows so many TPS on console. and after few minutes, 2 workers go down due to any reason, and only 1 worker(which is my machine, which is client as well as worker) continues responding the fabric calls. And during that time, it utilizes 100% of cpu. Other than my machine none of worker machine go down. But my machine goes slow down. And thus number of TPS goes down. (like with 3 workers TPS around 400, and with 1 worker its between 0 to 50 and 100 max.
Now comparing to my normal fractal application to fabric fractal application using fabric.submit, when normal application takes 5 seconds, fabric application is taking more than 30 minutes to compute the first fractal.
Got your point regarding dividing a job. Will try out that and let you know if any issue arises.
How many workers are you running in your environment? Are they single core/multi-core?
Is your client also one of the workers?
As an aside, our load balancing works best with a 3 worker (or larger) fabric. With only 2 workers, there can be utilization like you describe, due to the pairing strategies. We're a little curious about the statement, "We have also noticed that, our workers gradually gone down while processing the requests." Is the CPU utilization going down? Did a machine go down? Do they seem to be taking less requests? Perhaps you could elaborate.
Finally, when dividing a job into smaller pieces to spread across machines, there is a balance between running the entire job on one worker, vs segmenting the job into the smallest possible number of pieces. The ideal point is actually somewhere in the middle (due to the inherent overhead with sending out the jobs to workers, and returning the data). Ideally, you want to submit blocks of work in each request. In this case, rather than sending out an individual pixel with each request, you would ideally send out a block of pixels in a request, say 100x100 or 200x200. When we design jobs like these, we usually try to make that parametrized, so we can adjust the block size based on the number of workers.
What we have typically seen, is to run application servers fronted by some sort of load balancer. In this way, web requests are evenly distributed to your application servers. The calls the application servers make into CloudIQ Engine will be automatically be balanced via the Engine API. Its no different than having multiple smart clients making calls into Engine. Engine itself looks at the load of the machine overall, not just the load of Engine apps. So if machines are under heavy load because of external processes, Engine will adjust its load balancing accordingly.
Thank you, Brett. But your answer has kindled one more query in my mind.
Say I am having WebSphere installed as a service and I wish to make a cluster for load balancing purpose. The intended nodes would reside on different physical machines. If we install the nodes on different workers, will it affect the CloudIQ's load balancing capability? Will this be an overhead?
Welcome to Appistry Shweta! Let me see if I can help with your questions.
CloudIQ Engine does not have a built in web container. It is purely a high-availability, execution framework. It supports C/C++, Java and .NET. You can of course run other languages by wrapping them in one of the previously listed languages. In order to execute web-based apps, you do need to deploy an application server.
We have deployed servers like WebSphere and JBoss. As they support silent installs, we can deploy them via CloudIQ Manager. You can also easily deploy apps to these servers via Manager. The biggest challenges usually revolve around load balancers. If you are fronting the Cloud with a particular load balancer, you want to make sure it stays in sync with your application servers. However many load balancers can be configured programmatically. So when you develop the start/stop scripts for your application server, you can make calls to update the load balancer as well.
If you are in the development phase, this allows you to architect things in a manner to make cloud-enablement a little easier. The first point I would discuss is to separate your business logic from your servlets. You mention your logic being in the doGet method. I would code your business logic as POJO's (Plain Old Java Objects) that can be deployed to CloudIQ Engine. I would then have your servlets do nothing than forward the calls by your web clients (using the Engine API) to CloudIQ Engine housing these POJO objects. So your basic architecture would be as follows:
CloudIQ Manager - Running Apache with your web app. This would be presentation layer only. No business logic. Your Servlets would use the CloudIQ Engine API's to call your code running in Engine.
CloudIQ Engine - All of your business logic would live here as Plain Old Java Objects (POJO's). This also allows you to keep your Java code much simpler. Utilizing CloudIQ Engine allows you to develop in pure Java, without the overhead of the EJB framework.
And of course, if you need more capacity, simply add more workers to your Cloud. Manager and Engine will automatically spread your code to the new machines.
As far as SQL Server, I would leave that on its current infrastructure. Typical RDBMS's have specific OS related requirements that don't lend themselves to automatic provisioning. That being said, many organizations are minimizing the touch points they have with the RDBMS to long term storage only. In the past the RDBMS has been used as a temporary store for data during application processing. By moving your business logic to CloudIQ Engine, you can break up your business logic into smaller tasks that run in process flows. Using the Snapshot feature built into Process Flows, you can guarantee your execution state from step to step without having to use an external data store. The Cloud automatically backs up the data in your request object.
So if you have a process flow with 5 tasks, the data in your request object can be saved after each task. If there is a hardware failure on one of the later tasks, you do not have to start the entire process flow over. The Cloud has backed up your state after each task for you.
We are still in the developement phase ,so we have not packaged our application as a WAR file.We want to start making it cloud enabled parallely.
Our complete logic is in doGet method of servlets.w are using SQL server as a database. So, how can we proceed further.
Please help.
Looking over the code, I don't see anything that jumps out at me as wrong with what you've. However, I do have a couple of comments.
In regards to your comment: "I am not getting any TPS on console." The single fabric.execute call to the process flow would only generate a single transaction. Each step in a process flow calls a task or component method, and each call is a transaction. So, with your setup, at most you'll only see one TPS, a single blip per run unless this thread is being run continuously and so calling the process flow repeatedly.
However, I'd like to discuss what you are trying to accomplish. You have fractal.Fractal applet running on your desktop or browser. That applet calls makeNewFractal() which spins a thread running a FractalCalculator. When the calculator completes the hard work of generating the newDrawing, it calls calculatorCallback on the applet (on your desktop/browser) instructing it to update its canvas with newDrawing. In your scenario, you have moved the calculatorCallback method out into the fabric. When that call occurs, a *new* applet is created on an Engine worker for the duration of that single call, that applet (and not the one on your desktop) is updated by the calculatorCallback call, and then discarded. Your local desktop/browser-based applet is not affected at all.
What I think you really want to do to take advantage of CloudIQ Engine is to look at the FractalCalculator.calcFractal() method. The double for-loops inside that method do the costly calculations which can be run in parallel. Basically, you want to take this single-threaded set of for-loops and make them into a set of asynchronous, parallel operations or "jobs." Each "job" will compute a single pixel or a block of pixels. You could do this with local threads, but what you really want to do is take advantage of the many CPUs in your CloudIQ workers. Let's see how to do that at a high level....
1) You want to take the logic in the for-loops and move the actual computations of each pixel or a block of pixels into methods to be deployed as a CloudIQ Engine application. You'll likely deploy pieces or perhaps all of the fractal library with that application, or at least the pieces needed to do the actual pixel computations. These component methods will replace calculatorCallback.
2) You rewrite the client-side for-loop logic in calcFractal() to asynchronously submit calls to the fabric to generate a pixel or block of pixels. You do this asynchronously so that the computations run in parallel. To submit asynchronously, you do not use fabric.execute because it is single threaded and blocks until the call returns. Instead, you use fabric.submit() or fabric.submitCorrelated().
3) You add a new thread to the FractalCalculator class. This new thread runs on the client side, and collects the results of the submitted Engine jobs using fabric.waitAny() or fabric.waitCorrelated().
4) Once the client has collected all the jobs back from Engine, it would assemble the results into newDrawing (as the for-loops do now) and the client would then call the applet's calculatorCallback with the newDrawing.
By doing this, you've moved all the heavy computations out onto a fabric of Engine workers, rather than just utilizing the local computing power of the PC running the applet.
I have a .NET C# simple asynchronous client example handy, which I'll post below. The .NET project files are part of the samples from our doc site here. In the samples, look at src/dotnet/monte_carlo_pi/monte_carlo_pi_app (which is the Engine application), and at src/dotnet/simple_async_client (which is below). SimpleAsyncClient calls the monte_carlo_pi_app Engine application asynchronously (submitting and collecting results on separate threads) to generate monte carlo PI values (like generating your pixels or blocks of pixels) and then generates a final monte carlo PI calculation and outputs it from the client (like calling calculatorCallback to update the fractal applet).
Though threading is different between C# and Java, the basic principles of the methods for SubmitRequests (step #2 above) and WaitOnRequests (step #3 above) are the same.
I hope this helps!
Thanks,
Guerry
The asynchronous client sample:
usingSystem;usingSystem.Collections.Generic;usingSystem.Text;usingSystem.Threading;usingAppistry.FabricAPI;namespace Appistry.Samples.SimpleAsyncClient{//NOTE: THIS SAMPLE CODE WORKS BUT MAY CHANGE AS IT IS A WORK IN PROGRESS// THE TUTORIAL THAT WILL ACCOMPANY IT IS NOT YET COMPLETE.class SimpleClientAsync
{// async operations related membersprivate Fabric fabric;private AutoResetEvent processCompletedRequests;private AutoResetEvent allowMoreRequestsInFlight;privateobject mutex =newobject();privateint inFlightUpperThreshold;privateint inFlightLowerThreshold;privateint workerCount;private DateTime startTime;privatelong totalFabricRequestsToSubmit;privateint requestsInFlight;// ONLY use and reference the property to assure thread safetypublicint RequestsInFlight
{
get {return requestsInFlight;}// Make requestsInFlight setter thread safe
set {lock(mutex){ requestsInFlight = value;}}}// monte carlo pi calculation related propertiesprivateint pi_totalPointsToCalculate;privateint pi_pointsToCalculatePerRequest;public SimpleClientAsync(){// TODO: MAY MAKE THE CONFIGURABLE FIELDS INTO PROPERTIES // AND SET THE PROPERTIES IN MAIN
processCompletedRequests =new AutoResetEvent(false);
allowMoreRequestsInFlight =new AutoResetEvent(false);// CONFIGURE: set the number of workers to your fabric. we use// this value to figure out how many jobs to keep in flight in// the fabric at once so that the client can drive load to the fabric
workerCount =1;// CONFIGURE: construct a fabric API object instance specifying fabric// address values from fabric.cfg, Mcast TTL and encryption type// Mcast TTL must be 1 or greater if the fabric client is running// on a box separate from the fabric workers.
fabric =new Fabric("239.255.0.1", 31000, 1, Encryption.NONE);// CONFIGURE: set total number of monte carlo pi points to calculate
pi_totalPointsToCalculate =20000000;// CONFIGURE: tune this to number of monte carlo pi points // for each worker to compute per individual fabric request
pi_pointsToCalculatePerRequest =5000;// CONFIGURE: if desired, tune upper threshold (1600 here) to be // max number of concurrent requests allowed per worker. When this// threshold is hit, the client suspends sending more requests to // fabric. The lower threshold can also be tuned. When the number of// active requests in the fabric falls below this threshold, the // client resumes submitting jobs to the fabric. Typically the lower// threshold is about 60% of the upper threshold.
inFlightUpperThreshold = workerCount *1600;
inFlightLowerThreshold = workerCount *1300;
totalFabricRequestsToSubmit = pi_totalPointsToCalculate / pi_pointsToCalculatePerRequest;}publicvoid SubmitRequests(){
FabricRequest request = null;
startTime = DateTime.Now;
Console.WriteLine("Submitting fabric requests");for(int i =0; i < totalFabricRequestsToSubmit; i++){try{if(RequestsInFlight > inFlightUpperThreshold)
allowMoreRequestsInFlight.WaitOne();// construct a fabric request object instance specifying// the fabric application and process flow to run
request =new FabricRequest("monte_carlo_pi_dotnet_app", "monte_carlo_pi_flow");
request["total-points-to-compute"]= pi_pointsToCalculatePerRequest;
fabric.Submit(request);
RequestsInFlight++;}catch(Exception ex){
Console.WriteLine("Error on submit to fabric {0}:"+ ex.Message, i);}finally{
request.Dispose();if(RequestsInFlight ==10)
processCompletedRequests.Set();}}
Console.WriteLine("Done submitting requests to fabric");}publicvoid WaitOnRequests(){// wait for submitter notify ready
processCompletedRequests.WaitOne();// monte carlo pi calculation related variableslong pi_totalPointsAttempted =0;long pi_totalPointsInCircle =0;while(RequestsInFlight >=1){try{
FabricRequest request;if(fabric.Wait(out request, 100000)){
pi_totalPointsInCircle +=(long)request["computed-points"];
pi_totalPointsAttempted += pi_pointsToCalculatePerRequest;
RequestsInFlight--;}else{
Console.WriteLine("Wait timed out! You may want to tune some parameters.");}
request.Dispose();if((RequestsInFlight < inFlightLowerThreshold))
allowMoreRequestsInFlight.Set();}catch(Exception ex){
Console.WriteLine("Wait Exception:"+ ex.Message);}}
TimeSpan elapsed = DateTime.Now- startTime;double pi =(4.0* pi_totalPointsInCircle / pi_totalPointsAttempted);
Console.WriteLine("Computed Pi of {0} using Monte Carlo method in {1} seconds", pi, elapsed.TotalSeconds);
Console.WriteLine("Workers: {0}", workerCount);
Console.WriteLine("Points computed per request: {0}", pi_pointsToCalculatePerRequest);
Console.WriteLine("Total computed points: "+ pi_totalPointsAttempted);
Console.WriteLine("Total valid points: "+ pi_totalPointsInCircle);}[MTAThread]staticvoid Main(string[] args){
SimpleClientAsync client =new SimpleClientAsync();
Thread submitThread =new Thread(new ThreadStart(client.SubmitRequests));
Thread waitThread =new Thread(new ThreadStart(client.WaitOnRequests));
submitThread.Start();
waitThread.Start();
submitThread.Join();
waitThread.Join();}}}
usingSystem;usingAppistry.Task;namespace Appistry.Samples.Pi{publicclass MonteCarloPi
{// OUTPUT: map method return value into "computed-points"// key in current fabric request object // INPUT: on execution, map value for "total-points-to-compute"// from current fabric request object into method parameter[return: TaskReturnValue("computed-points")]publiclong computePoints([TaskParameter("total-points-to-compute")]long totalPoints){double x, y;int pointsInCircle =0;// one could insert a better random number generator here....
Random generator =new Random((int)DateTime.Now.Ticks);// compute points using monte carlo method by keeping track of// points falling in the "circle" (i.e. computed as <= 1)for(int i =0; i < totalPoints; i++){
x = generator.NextDouble();
y = generator.NextDouble();if(Math.Sqrt(x * x + y * y)<=1)
pointsInCircle++;}// return count of points to client for Pi computationreturn pointsInCircle;}}}
First off, I would suggest starting a new thread. If you are having an issue seeing a particular method, having a new posting with that title would probably be better for the community. It makes searching easier and benefits all users.
That said, it looks like you have declared your method as 'protected'. That will limit its scope. We cannot access a protected method. I would suggest making that method public and see if that solves the issue.
So you have deployed Apache as a Service on CloudIQ Manager? Is your web app packaged up as a WAR file? If so, I would read here: http://www.appistry.com/community/wiki/display/cloudiq43/Deploy+an+Appli... on deploying an Application to a Service. In this case, Apache is the service your app depends on. If you look at the examples further down the page I referenced, you'll see this line:
<service-app service="win-service">
Instead of "win-service" you would put the name of your apache service (the exact name of your apache service in its FAR XML definition)
Deploying a Service to an app allows you to define the dependency. Your app depends on Apache being installed, and the XML definition will enforce this.
Over time I would recommend separating out the business logic. For one, it will make code management easier, but two, it would allow you to put those processes into CloudIQ Engine for reliability.
As I have told earlier, I am working on - http://www.gui.net/fractaljava.html
fractal application. (Complete Source Available Online)
Firstly, I am just trying to put one of its method into Cloud Engine application.
I am putting fractal.Fractal.calculatorCallback - method into cloud.
Hi,
We have created an survey application in which we are creating survey for telecom service. We are using Apache Tomcat. I have deployed tomcat on cloud.
We are using simple servlet ,jsp and our bussiness logic is coupled with presentation layer.
Please Tell me how can I proceed further.
Thanks in advance.
You should probably give a little more detail about your application. What is your web container? Apache? What type of business logic is in the application? Are you using an MVC type of framework? Do you have your business logic separated from your presentation layer? If your business logic is tightly coupled to the presentation layer, that limits the benefits you can achieve.
Perhaps if you could elaborate a little more, we could talk about the best way to help you move forward.
Dan,
Just to verify, you have one request object, that you pass to each thread. Then in the thread you get the FamSession out of that request, correct? I'll try to reproduce this error.
As far as using different Fam Sessions, you can create an external FamSession object. So if you had a separate thread,whose only job was to perform a FAM op, then get out, you could create your own session. It basically takes an ID, and a version. You could generate a GUID for each use. I would use the Fam Session from the Request for the long running transaction. This might be worth trying.
Brett
Hi Nilay,
Glad to see its going much better!
I do not think the workers are "going down". The data fed to the console is being collected by multicast which is UDP-based. As opposed to TCP, UDP is a lossy, non-retried protocol, and packets can be dropped by the network. If your workers are very busy CPU-wise, it's possible packets are being missed, and so the data coming back to the console will be spotty.
I'm more curious about the chunks not coming back. I noticed in your Fabric Method code sample above that you are catching and dropping exceptions. If an exception is occurring in that method, the imageMap returned will be empty, and so you'd have a blank chunk. If you do not catch the exception there, it will propagate to the process flow, where you can branch to another task to handle it, or just let it continue to propagate. Ultimately, it will make it to your client where you can handle the exception, and print out what happened. I'm not saying you're getting an exception, but I'd make that change so you can see one if you do.
The same is true in your collect thread (by the code above), you are eating the exception and not printing anything out, etc. If your process flow and/or task encounter a problem, we report it back by exceptions. You need to be looking at those. That might likely explain the blank image chunks that you are getting.
For example, one exception you may be eating in the collect thread would be a process flow timeout. Are your process flow timeouts set to "infinite" in the XML? If not, if your task work takes too long (as defined by the timeouts in the flow XML), then the flow will timeout the task. If the flow is simple and does not branch on a timeout, the timeout will result in the flow just going to the "finish" state and returning to the client. In that case, you'd likely have a null imageMap or nothing.
Last, I'd highly recommend watching log_monitor output rather than the console for what you are doing. You will get far more detailed information. For example, if a worker really is "going down" or if the task process is crashing, you'll see it there, along with details about what is happening. Just do "log_monitor ".
Hope that helps!
HI,
I'm not sure I follow the problem. You can run that line of code instead of removing it. However, as stated, you need to run it only once in your application. .NET will not allow you to register the channel, which is referenced in the config twice. This is a .NET thing, not a CloudIQ thing.
Without more detail and code, I'm not sure what else to add.
Hi
First of all thanks Brett & Guerry for your support.
Idea of balance between running the entire job on one worker, vs segmenting the job into the smallest possible number of pieces works. Earlier what I was doing was submitting each pixel to fabric and getting its response. So it was taking lots of time.
There were total 470*470 pixels, I have created chunks of 47*47 and created 100 threads to submit and collect jobs. And the performance is now much much better.
In starting when there are less number of iterations asynchronous call to fabric is giving lower performance so at that time I am using synchronous call(fabric.execute()) to fabric, and then after some iterations using asynchronous call(fabric.submit()), so now complete application performance is better than standalone application (giving 100 to 150 % better performance).
After so many iterations when there will be higher load on cloud, workers start missing from console, don't know why. In this case I am getting some chunks back from fabric request but which ever worker goes down during execution, I am not getting those chunks. (so some part of image will be blank). Do appistry have any management to get the submitted job result if worker goes down between submit() and request.get() ?
We have added two more machines to existing 239.255.0.5, so now total 5 workers are there in my fabric. but if we see from console each machine is having different numbers of workers available. and it varies with time i.e. number of workers vary from 1 to 5.
We have installed development environment of CloudIQ to each machine, so is that because of that?
About tomcatwin - its tomcatwin.far (tomcat service) which is starting and stopping on 73 IP, due to some reasons. Right now I have stopped the service. Will check it again if the problem remains as it is will let you know.
Nilay
Hello
Thanks for replying
As you told me that, to move the following code to the place where it executes only once in the entire application...I had followed it,eventhough I am getting the above mentioned error....
In the web application as the web.config file will be mapped or called automatically...I can overcome the above mentioned error by removing following code :
RemotingConfiguration.Configure(Server.MapPath("web.config"), false);But in the windows application we need to map the config file manually....So,can you give me any solution to overcome this error in windows based application.....
Nilay,
In regards to the CPU / TPS charts in the console. Without going into deep detail, the console speaks to a worker via HTTP, and the worker responds with CPU and TPS data aggregated from among the workers by mcast. If an mcast message is missed or does not arrive, the worker still responds with the data it received to the console. Mcast is a lossy protocol without retries, so it is possible to lose packets. I believe the primary issue we are observing here is that your application is overwhelming the three workers, and all three workers are hitting 100 percent CPU and cannot keep up with the work being asked of them. These charts are not really telling you much about what is happening inside the Engine part of the system. The log monitor logs are most helpful with that.
Let's examine that.
I do not have a copy of your Task XML, but I'm presuming you are using normal tasks and not limited or exclusive tasks. Normal tasks are "unlimited" meaning that if you submit 1000's of jobs into a fabric, the software-based load balancing will spread the work among the workers as best as it can. And, the workers, in response, will try to do as many *simultaneous* tasks as memory and CPU allow. Your inFlightUpperThreshold is 4800. That means that *very* quickly each worker is trying to process up to 1600 jobs at once. The task service has a default pool of 200 threads, and so will attempt to process up to 200 of the 1600 jobs at a given time. None of this should cause an issue with the fabric. We are throttling at the thread pool. We test at 100% CPU and as long as your application is not timing out at the client or at the process flow level, the work will eventually get done, though with CPU thrashing, you'll likely get done slower.
Here are things I recommend at this point:
1) I think your best starting point is to dial down the inFlightUpperThreshold to a lower number. This may take some experimentation. Again, even with the high number, the work will complete, but it may take more time. This will help with the out of memory issue also, though not alleviate it (see #2).
2) Next, it looks like you are running the Java Heap out of memory. That is a problem, though only a Java problem. By default, we use the JVM defaults for memory. You should expand the memory configuration for the JVM's in the fabric by following the information here:
http://www.appistry.com/community/wiki/display/cloudiq43/Java+Configuration
3) The for loop in CollectValuesFromFabric.java is throwing a null exception? It looks odd. It looks like the Exception itself is null because it prints out "In exception null". Null there should be a stack trace, right? That exception is eating up the for loop counter, and so the for loop will finish before all the jobs are ever collected (if they were collectable, which since the service is running out of memory....) I think that "null" needs to be figured out and fixed.
4) I don't think this is happening, but you might also check the default_timeout in your process flow XML. By default, the setting is INFINITE, which means that the process flow will wait forever for the task to complete. However, if you changed it to like "1" second, then with CPU levels at 100%, the process flow step will time out waiting for the task to complete, and will retry the task, adding more jobs to the system. The original task will continue to run, but the retry will start another. Of course, that second task will time out, and the cycle will continue until the retries are gone. I don't believe this is happening, but you might want to check if you specified a default_timeout value.
Looking at the log_monitor log, what is the role of the service named "tomcatwin"? Is Tomcat running as part of this application? Or did you fractal application accidently get named "tomcatwin"? That service is stopping and restarting 28 times in 1 minute on 172.29.120.73. I'd like to understand what that is all about.
That's all for now. Let's try that.
Hi Guerry
Two workers go down means, they got missing from cloud console (also noticed it from log_monitor).
After some minutes of execution, we are getting repeatedly below log on
log_monitor:
I have attached log.zip on another thread.
appistry.com/community/forums/content/performance-issue-attachments
Screen shot of cloudiq console also attached, where you can see, how workers go missing during execution, and CPU utilization goes 100% continuously. And you can see TPS as well.
Attached rolling.log file and part of log_monitor cmd.
Attached logs of java console also, where you can see we are gettign out of memory exception too.
Nilay,
When you say "2 workers go down" I believe you mean they stop reporting TPS. Is that correct? Please describe what you mean by "go down."
Can you watch log_monitor messages while doing a run with fabric.submit version of your code? In a window, run "log_monitor 239.255.0.5:4000 > log.txt" I'd like to see the log output file if you don't mind. I'm wondering if something is happening to the services, and so causing the other two workers to drop out. If that happens, the last worker could be overwhelmed on network and CPU, especially since he's the client.
Also, what is your "inFlightUpperThreshold"? set to? At 400 TPS, I'd imagine you're not overwhelming the network or CPUs on the three workers, and of course, that is all relative to how powerful the workers are. Also, are the three workers comparable in CPU power, memory, etc.?
The fact that the 2 workers "go down", makes me think that is where the problem lies, or is an indicator of the problem.
I'd also consider running the client on a separate box against the three workers (what you are doing is legitimate, I just want to eliminate possibilities).
By the way, what you've done is great! We just need to understand what's going on and fix it.
Thanks,
Guerry
Hi Brett
We are having 3 workers including my machine which act as client too.
We are able to see 3 workers running till we start the fractal application, after starting the application, it shows so many TPS on console. and after few minutes, 2 workers go down due to any reason, and only 1 worker(which is my machine, which is client as well as worker) continues responding the fabric calls. And during that time, it utilizes 100% of cpu. Other than my machine none of worker machine go down. But my machine goes slow down. And thus number of TPS goes down. (like with 3 workers TPS around 400, and with 1 worker its between 0 to 50 and 100 max.
Now comparing to my normal fractal application to fabric fractal application using fabric.submit, when normal application takes 5 seconds, fabric application is taking more than 30 minutes to compute the first fractal.
Got your point regarding dividing a job. Will try out that and let you know if any issue arises.
Nilay
A few quick questions and comments.
How many workers are you running in your environment? Are they single core/multi-core?
Is your client also one of the workers?
As an aside, our load balancing works best with a 3 worker (or larger) fabric. With only 2 workers, there can be utilization like you describe, due to the pairing strategies. We're a little curious about the statement, "We have also noticed that, our workers gradually gone down while processing the requests." Is the CPU utilization going down? Did a machine go down? Do they seem to be taking less requests? Perhaps you could elaborate.
Finally, when dividing a job into smaller pieces to spread across machines, there is a balance between running the entire job on one worker, vs segmenting the job into the smallest possible number of pieces. The ideal point is actually somewhere in the middle (due to the inherent overhead with sending out the jobs to workers, and returning the data). Ideally, you want to submit blocks of work in each request. In this case, rather than sending out an individual pixel with each request, you would ideally send out a block of pixels in a request, say 100x100 or 200x200. When we design jobs like these, we usually try to make that parametrized, so we can adjust the block size based on the number of workers.
We recently ran through an update of this recipe for installing on Ubuntu. Here's the latest:
Ubuntu / CloudIQ Install Notes
The standard Linux installation notes still apply. Please read them here:
http://www.appistry.com/community/wiki/display/cloudiq43/Linux+Installation
0. Download RHEL 5 32-bit or 64-bit version from Peer2Peer
http://www.appistry.com/community/content/downloads
1. Setup addr.cfg.install as explained here: http://www.appistry.com/community/wiki/display/cloudiq43/Linux+Installation
2. Convert .rpm to .deb
sudo alien -kc <.rpm installer name e.g. appistry-cloudiq-4.3.4.1-rhel5.rpm>
3. Install .deb
dpkg --install <.deb installer name e.g. appistry-cloudiq-4.3.4.1-1_i386.deb>
4. Script fails while trying to add user "fabricuser" (this doesn't always fail)
Proceed with steps below to recover
5. CloudIQ services aren't installed b/c chkconfig doesn't exist on Debian-based distros:
update-rc.d fabric_keeper defaults
update-rc.d fabric_system_service defaults
6. Manually create fabric user
sudo useradd -s /bin/bash -g fabricuser fabricuser
7. Give fabricuser ownership of the CloudIQ installation:
chown -R fabricuser:fabricuser /usr/local/appistry
8. Create links for libcrypto and libssl
cd /usr/lib
ln -s libcrypto.so.0.9.8 libcrypto.so.6
ln -s libssl.so.0.9.8 libssl.so.6
9. Patch /bin/arch which does not exist in Ubuntu since v7.10. The /etc/fabric_env script references /bin/arch
sudo echo "uname -m" > /bin/arch
sudo chmod +x /bin/arch
10. Source /etc/fabric_env
source /etc/fabric_env
11. Add ". /etc/fabric_env" to the user's ~/.bashrc file
12. Reboot and fabric_keeper and fabric_system_service should start automatically
13. Run "log_monitor " and see log output from your new worker. The fabric-address here is the one you created when doing the addr.cfg.install step.
If you see output after step 13, you're good to go. If not, then refer to items #13 and #14 in the FAQ: http://www.appistry.com/community/wiki/display/cloudiq43/Frequently+Aske...
Thanks,
Guerry
What we have typically seen, is to run application servers fronted by some sort of load balancer. In this way, web requests are evenly distributed to your application servers. The calls the application servers make into CloudIQ Engine will be automatically be balanced via the Engine API. Its no different than having multiple smart clients making calls into Engine. Engine itself looks at the load of the machine overall, not just the load of Engine apps. So if machines are under heavy load because of external processes, Engine will adjust its load balancing accordingly.
Thank you, Justin
Thank you, Brett. But your answer has kindled one more query in my mind.
Say I am having WebSphere installed as a service and I wish to make a cluster for load balancing purpose. The intended nodes would reside on different physical machines. If we install the nodes on different workers, will it affect the CloudIQ's load balancing capability? Will this be an overhead?
Hi, see http://www.appistry.com/community/forums/content/example-linux-service-d... for Linux examples.
Welcome to Appistry Shweta! Let me see if I can help with your questions.
CloudIQ Engine does not have a built in web container. It is purely a high-availability, execution framework. It supports C/C++, Java and .NET. You can of course run other languages by wrapping them in one of the previously listed languages. In order to execute web-based apps, you do need to deploy an application server.
We have deployed servers like WebSphere and JBoss. As they support silent installs, we can deploy them via CloudIQ Manager. You can also easily deploy apps to these servers via Manager. The biggest challenges usually revolve around load balancers. If you are fronting the Cloud with a particular load balancer, you want to make sure it stays in sync with your application servers. However many load balancers can be configured programmatically. So when you develop the start/stop scripts for your application server, you can make calls to update the load balancer as well.
Hope this helps you get started.
If you are in the development phase, this allows you to architect things in a manner to make cloud-enablement a little easier. The first point I would discuss is to separate your business logic from your servlets. You mention your logic being in the doGet method. I would code your business logic as POJO's (Plain Old Java Objects) that can be deployed to CloudIQ Engine. I would then have your servlets do nothing than forward the calls by your web clients (using the Engine API) to CloudIQ Engine housing these POJO objects. So your basic architecture would be as follows:
CloudIQ Manager - Running Apache with your web app. This would be presentation layer only. No business logic. Your Servlets would use the CloudIQ Engine API's to call your code running in Engine.
CloudIQ Engine - All of your business logic would live here as Plain Old Java Objects (POJO's). This also allows you to keep your Java code much simpler. Utilizing CloudIQ Engine allows you to develop in pure Java, without the overhead of the EJB framework.
And of course, if you need more capacity, simply add more workers to your Cloud. Manager and Engine will automatically spread your code to the new machines.
As far as SQL Server, I would leave that on its current infrastructure. Typical RDBMS's have specific OS related requirements that don't lend themselves to automatic provisioning. That being said, many organizations are minimizing the touch points they have with the RDBMS to long term storage only. In the past the RDBMS has been used as a temporary store for data during application processing. By moving your business logic to CloudIQ Engine, you can break up your business logic into smaller tasks that run in process flows. Using the Snapshot feature built into Process Flows, you can guarantee your execution state from step to step without having to use an external data store. The Cloud automatically backs up the data in your request object.
So if you have a process flow with 5 tasks, the data in your request object can be saved after each task. If there is a hardware failure on one of the later tasks, you do not have to start the entire process flow over. The Cloud has backed up your state after each task for you.
Hi Brett,
We are still in the developement phase ,so we have not packaged our application as a WAR file.We want to start making it cloud enabled parallely.
Our complete logic is in doGet method of servlets.w are using SQL server as a database. So, how can we proceed further.
Please help.
Hi Nilay,
Looking over the code, I don't see anything that jumps out at me as wrong with what you've. However, I do have a couple of comments.
In regards to your comment: "I am not getting any TPS on console." The single fabric.execute call to the process flow would only generate a single transaction. Each step in a process flow calls a task or component method, and each call is a transaction. So, with your setup, at most you'll only see one TPS, a single blip per run unless this thread is being run continuously and so calling the process flow repeatedly.
However, I'd like to discuss what you are trying to accomplish. You have fractal.Fractal applet running on your desktop or browser. That applet calls makeNewFractal() which spins a thread running a FractalCalculator. When the calculator completes the hard work of generating the newDrawing, it calls calculatorCallback on the applet (on your desktop/browser) instructing it to update its canvas with newDrawing. In your scenario, you have moved the calculatorCallback method out into the fabric. When that call occurs, a *new* applet is created on an Engine worker for the duration of that single call, that applet (and not the one on your desktop) is updated by the calculatorCallback call, and then discarded. Your local desktop/browser-based applet is not affected at all.
What I think you really want to do to take advantage of CloudIQ Engine is to look at the FractalCalculator.calcFractal() method. The double for-loops inside that method do the costly calculations which can be run in parallel. Basically, you want to take this single-threaded set of for-loops and make them into a set of asynchronous, parallel operations or "jobs." Each "job" will compute a single pixel or a block of pixels. You could do this with local threads, but what you really want to do is take advantage of the many CPUs in your CloudIQ workers. Let's see how to do that at a high level....
1) You want to take the logic in the for-loops and move the actual computations of each pixel or a block of pixels into methods to be deployed as a CloudIQ Engine application. You'll likely deploy pieces or perhaps all of the fractal library with that application, or at least the pieces needed to do the actual pixel computations. These component methods will replace calculatorCallback.
2) You rewrite the client-side for-loop logic in calcFractal() to asynchronously submit calls to the fabric to generate a pixel or block of pixels. You do this asynchronously so that the computations run in parallel. To submit asynchronously, you do not use fabric.execute because it is single threaded and blocks until the call returns. Instead, you use fabric.submit() or fabric.submitCorrelated().
3) You add a new thread to the FractalCalculator class. This new thread runs on the client side, and collects the results of the submitted Engine jobs using fabric.waitAny() or fabric.waitCorrelated().
4) Once the client has collected all the jobs back from Engine, it would assemble the results into newDrawing (as the for-loops do now) and the client would then call the applet's calculatorCallback with the newDrawing.
By doing this, you've moved all the heavy computations out onto a fabric of Engine workers, rather than just utilizing the local computing power of the PC running the applet.
I have a .NET C# simple asynchronous client example handy, which I'll post below. The .NET project files are part of the samples from our doc site here. In the samples, look at src/dotnet/monte_carlo_pi/monte_carlo_pi_app (which is the Engine application), and at src/dotnet/simple_async_client (which is below). SimpleAsyncClient calls the monte_carlo_pi_app Engine application asynchronously (submitting and collecting results on separate threads) to generate monte carlo PI values (like generating your pixels or blocks of pixels) and then generates a final monte carlo PI calculation and outputs it from the client (like calling calculatorCallback to update the fractal applet).
Though threading is different between C# and Java, the basic principles of the methods for SubmitRequests (step #2 above) and WaitOnRequests (step #3 above) are the same.
I hope this helps!
Thanks,
Guerry
The asynchronous client sample:
The Engine application piece (not including Engine XML--download samples zip for that):
Thanks Brett,
I have made the method public, and application is deployed now.
I have created another thread for this application specific issue.
Nilay
Nilay,
First off, I would suggest starting a new thread. If you are having an issue seeing a particular method, having a new posting with that title would probably be better for the community. It makes searching easier and benefits all users.
That said, it looks like you have declared your method as 'protected'. That will limit its scope. We cannot access a protected method. I would suggest making that method public and see if that solves the issue.
Brett
So you have deployed Apache as a Service on CloudIQ Manager? Is your web app packaged up as a WAR file? If so, I would read here: http://www.appistry.com/community/wiki/display/cloudiq43/Deploy+an+Appli... on deploying an Application to a Service. In this case, Apache is the service your app depends on. If you look at the examples further down the page I referenced, you'll see this line:
Instead of "win-service" you would put the name of your apache service (the exact name of your apache service in its FAR XML definition)
Deploying a Service to an app allows you to define the dependency. Your app depends on Apache being installed, and the XML definition will enforce this.
Over time I would recommend separating out the business logic. For one, it will make code management easier, but two, it would allow you to put those processes into CloudIQ Engine for reliability.
As I have told earlier, I am working on - http://www.gui.net/fractaljava.html
fractal application. (Complete Source Available Online)
Firstly, I am just trying to put one of its method into Cloud Engine application.
I am putting fractal.Fractal.calculatorCallback - method into cloud.
protected synchronized void calculatorCallback( @TaskParameter("success") boolean success, @TaskParameter("newDrawing") Drawing newDrawing ) {And made components , flow & app xml files as below -
fractal-components.xml
<?xml version="1.0"?> <java-components xmlns="http://www.appistry.com/ns/component" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation= "http://www.appistry.com/ns/component eaf-component.xsd"> <component name="fractal_component"> <class name="fractal.Fractal"/> <method name="calculatorCallback"> <signature> <argument type="boolean"/> <argument type="fractal.Drawing"/> </signature> </method> </component> </java-components>fractal-flow.xml
fractal-app.xml
<?xml version='1.0'?>
And when I am trying to deploy it it gives below error
Let me know, where I am wrong?
Nilay
Hi,
We have created an survey application in which we are creating survey for telecom service. We are using Apache Tomcat. I have deployed tomcat on cloud.
We are using simple servlet ,jsp and our bussiness logic is coupled with presentation layer.
Please Tell me how can I proceed further.
Thanks in advance.
Nilesh,
You should probably give a little more detail about your application. What is your web container? Apache? What type of business logic is in the application? Are you using an MVC type of framework? Do you have your business logic separated from your presentation layer? If your business logic is tightly coupled to the presentation layer, that limits the benefits you can achieve.
Perhaps if you could elaborate a little more, we could talk about the best way to help you move forward.
Brett