Application Logging in Appistry/Java

vitopn 
Joined: 2008-09-13
Posts: 15
User is offline
Application Logging in Appistry/Java

Any best practice advice on application logging in Java.

My current plan is to use Log4j but it seems like I am going to end up with a lot of log files (One for each application on each worker). I could use a SocketAppender or DBAppender to collect all error events from any worker but that would introduce a single point of failure. My other thought was to write some code that scans the logs from each App/Work log file and reports the combined results (but it is not trivial to gather all of the log files on all of the worker machines).

Has anyone else tackled these issues?

-Vito

guerry  Appistry employee
Joined: 2007-12-21
Posts: 95
User is offline

vitopn,

Folks have used a number of different approaches. I'll say up front that we are planning on enhancements in the near future in regards to logging. We want to address this issue in a fabric-like manner (i.e. distributed, no central point of failure, reliability, etc.)

Really, there are at least two types of things that folks want to do in regards to logging. First, there's "event" monitoring where the log messages warn of immediate problems. Then there's the whole historical analysis type of stuff. As you said, the distributed environment makes this type of problem non-trivial. The "single point of failure" outside the fabric is, of course, another problem, and once you get used to having all the redundancy and reliability the fabric, single points are an irritant. :-)

Let's address the single point of failure first. At some point, you may have to ask "what is good enough?", "how critical is my data?" and "how time critical is my data delivery?" If you have a reliable database (perhaps a mirrored MySQL instance), and that's good enough for some stuff like log catching, then you can use that. Some folks use network attached storage (NAS) with RAID or a storage area network (SAN) outside the fabric. If you must store something outside and you are really worried about redundancy, you could for example, have two NAS on separate switches accessible to the fabric and save off onto both.

Here's a list of ways that some have handled the type of thing you are interested in. Each have their own strengths and weaknesses, and are not presented in any particular order.

Solution A:
Incidently, what you described above has been done. We have a Log4J tutorial coming online soon that basically lays out what you have described. However, it does introduce a single point of failure. Perhaps two socketAppenders to separate collectors? Again, you'd have to take network topology into account to make sure a single switch outage didn't cut the whole fabric off from the outside resource.

Solution B:
The fabric application sends multicast messages. Limited tasks running their own thread monitor and collect the multicast messages and 1) back them up locally, and 2) then store them primarily on NAS, SAN, or database. Last, depending on requirements, you could choose to flush the local copies from the worker once they are successfully saved elsewhere.

At a minimum, you want one limited task running. However, if the task's host goes down, you will likely miss some log messages in the short window when the fabric spins up the replacement limited task. If you run two or more limited tasks, then you won't have that short window in the case of an outage, but you now have the issue of how to merge the data at the primary storage point, or the issue of storing the log data at the primary storage point N times (one per limited task).

The limited tasks can also respond to fabric requests (the log monitoring is in a separate thread), and can take administrative commands to clear their local copies, return local file copies, etc.

Since limited tasks can move around based on the fabric changing (workers coming and going, etc.), you would likely end up with log files on machines where the limited task is no longer running. You'd likely want to deal with this by removing files when the limited task is shutdown.

Another comment has to do with multicast. Multicast is UDP-based and therefore not guaranteed to deliver like TCP. Therefore, you can drop messages. Now, in years of dealing with multicast traffic, I've never seen a real problem arise, but there is always the chance. It's the nature of the protocol.

Solution C:
The fabric application writes log messages to disk on each worker. Background tasks on each worker collect the files from disk and store them to a central storage point ala solution A. Like solution B, this is a push model instead of a pull model. The main difference lies in how the log messages are generated, and the fact that the background tasks run on *every* worker, and not just a subset like the limited tasks.

Again, you could augment the background tasks with additional tasks that perform administrative tasks like purging files, sending file copies on request, etc. Background tasks do not respond to fabric requests (they are daemon-like), and so you would need separate tasks established in this case.

Solution D:
We have had customers store some data in the FAM, our in-memory cache service. In some cases, they are storing performance metrics, and similar atomic pieces of application data. This might include errors, exceptions, and other vital data.

However, If your application is disgorging huge amounts of log traffic, then I'm not sure I'd stream a constant flow of messages into a FAM space. As stated, FAM is memory based, so if you queued the messages in the FAM, I'd have another service dequeuing them to be processed, stored, etc. You'd not be storing them in FAM permanently.

If you establish a 2/1 fabric space, then your data will be stored on more than worker, giving you redundancy.

Solution E:
You could brute force store the log files to each worker, and let them accumulate. You could then implement an affinity task that reports that it has log files that have grown to N size, or reports that it's log files were last collected N time ago. You could then have a client that regularly calls into the fabric using the affinity to find workers that need their logs to be gathered. On each call, the client would find a worker needing attention, and the client would then harvest the files to another location. After being harvested, the worker would reset it's counters according to the logic used (log file size is now smaller, or last collected time, etc.)

Solution F:
A scripted "pull" solution would be to write a ruby/python/perl, etc. script to use fabric_ctl to get a list of worker IPs and then cycle over the IPs harvesting the log files using secure copy, etc. to copy and purge them.

I'm sure there are other methods. I have someone else I'd like to ping about this question, but he's not available to me today.

Either way, we are going to be addressing this functionality in the near future so that the fabric can do the heavy lifting for you, but for right now perhaps one of these patterns will help.

As we are exploring the requirements, I would like to contact you and mrbahr about your requirements, if that would be okay.

Thanks!

Guerry

vitopn 
Joined: 2008-09-13
Posts: 15
User is offline

Guerry,
OK. Excellent response. Great information .. I think given the "what is good enough?" question I will be opting for a simple easy to maintain solution (DBAppender for Warnings and up). Hopefully, by the time this becomes a scaling issue, Appistry will have a better fabric-based approach for this.

And yes .. feel free to contact us.

Cheers,
-Vito