Monitoring the IBM i with AAG (1/3) – Simply i

Using AAG to monitor the IBM i system could not be simpler. In the next 3 posts I want to outline a quick overview to show how to monitor 3 different use cases. The first will be ensuring an application is running correctly. The second will be ensuring our replication is accurate and reliable using AAG’s HA4i checks. And finally, we will go over how to monitor the system status to ensure our workload is not overloading the system.

To start we will ensure Shields product EM4i is running correctly, however, to show that this can be configured for non-Shield products we will NOT use the Shield-product specific checks which have been built for EM4i, HA4i and other products.

For EM4i to be running correctly there are 3 critical jobs that must be running in the EM4iSBS. There should also not be any other jobs running in this subsystem. We can ensure there are at-least 3 jobs running by using the check check_Shield_SBSJOB. We can pass in the information for the subsystem, set a critical state if the number of jobs is within the range from negative infinity to 2, and set a warning state if the number of jobs is above 3. This will also failover to a critical state if the subsystem does not exist as the number returned will be 0.

As we can see we are in the clear, so far, as there are 3 jobs currently running within the subsystem. There is an issue though…. What if these jobs are not the correct jobs that need to be running? For this we will implement another set of checks.

Using the check_Shield_JOBSRCH check command we are able to search for a job matching the criteria; Job Name, User, Job Number. This will return the number of jobs found matching the inputted criteria.

We need to ensure MSGMON, MSGPOL, and SMSMGR are running on the system. In the case where other applications may have similar naming conventions, we will also input the EM4i user which has been setup for the EM4i application.

Now we can see that there are 3 jobs all matching the criteria of the jobs required by EM4i.

Another issue could be where an application is unable to run due to the keys for that product being expired. Using the check_Shield_KEYEXP check command we can, not only, return a critical state if our keys are expired, but also set a warning to remind us to renew keys before they are coming up to their expiry date. We have set out check to ensure EM4i keys do not expire within 10 days. If they expire within 10 days we will receive a warning as a reminder and a critical state as a hard reminder if the keys expire within 2 days. These parameters can be set by the user to fit their scheduled key replacements.

We can see our keys are expiring in 9 days which has pushed out a warning notice to us. With AAG it is possible to receive this message through Nagios, Email, or by means of a Pushover message using the Pushover API to get live notifications to your phone.

AAG and EM4i are separate Shield products, that being said they compliment each other to add accuracy and confidence in your monitoring solution. EM4i is a message monitoring tool that will send notifications to users and allow remote responses to be securely sent via a web interface. The following will be the definition of overkill, but I always have believed in checking a check to be sure we don’t miss anything. Another check command from AAG we can use to double check EM4i would be check_Shield_RPYW. Using this command, we are able to not only check a message queue to see if there are messages awaiting replies. But also check the amount of time a message has been waiting. Using this feature, we can see if EM4i is running incorrectly and has not pushed out a message, OR if a message has been picked up by EM4i we are able to see our system admin in charge of replying to those messages has not been on the ball.

For this check we will pass in the message queue information, a critical state of over 5 messages waiting for replies, a warning when a single message is found, and the critical and warning time periods if a message is ignored.

As we can see we have a test message waiting on QSYSOPR and the message has been waiting for 20 min. which is still in the okay range. AAG will also pull and display information about the messages waiting on your system.

As I hope you can see AAG does not just provide individual checks, we aim to provide checks that are able to work in unison and be built into the ultimate system status check. Using AAG’s flexible commands a user can modify the information they are checking to suit each system they wish to monitor.

I will be posting 2 other use cases in the upcoming days to showcase AAG working in this manner. Check back for more !

Charlie…