The issue for me ended up being I was specifying the User=zabbix when I should just have left that line out since it defaults to zabbix
Ad Widget
Collapse
AS/400 Monitoring solutions
Collapse
X
-
Version 0.7.0 was released.
Gerald, try, please, this version. I've re-worked seriously the auto-reconnection mechanism. Unfortunately, it was not tested enough before (and really worked very bad). I hope, that it should work better now :-)
Other enhancements:- metrics vfs.fs.state and as400.disk.asp added;
- metric as400.disk.state improved (returns a fictitious value “4294967295” if the disk unit is NOT owned by the system);
- Initialization and output were improved. The only place for any messages is a log file, if some problems with its accessing/creating – stdout. Presence of required libraries is checked explicitly. Metrics agent.hostname and system.uname are written to the log during initialization. In the case of any problems during initialization the program is stopped immediately.
Comment
-
Hi Gerald,
are there some results of your testing?
It is interesting for me if the auto-reconnection does work, finally ;-)
Thanks in advance!Comment
-
Comment
-
Error: 27730:20170609:143008.788 cannot send list of active checks to "x.x.x.96"
Hi KOS,
Thanks for coding this emulator. We were able to get it working today and we are successfully collecting data. We have one problem we are trying to figure out.
Our machine uses a virtual ip address of: x.x.x.6 (left out x.x.x for security reasons).
And we have two physical IP's which are:x.x.x.86 and x.x.x.96.
In Zabbix, the agent is defined as x.x.x.6 and we are successfully collecting data.
We plan to use the eventlog function to obtain QSYSOPR messages, so we need active checks working.
Any ideas ?
ThanksComment
-
Kos,
unfortunately there is no improvement.
There are no real helfull messages. There are not much messages at all.
I have now switched on the debug level and restarted the client.
So next weekend I should get more details.
13:20170611:150519.487 current cpu_used=0, ji.cpu_used=33
13:20170611:150522.437 Procstat.updateJobinfoList() communication to AS/400 error: java.net.SocketException: Connection reset
13:20170611:172406.794 Procstat.updateJobinfoList() error: com.ibm.as400.ac cess.AS400Exception: CPF3C53 Job 688412/QLWISVR/QP0ZSPWP nicht gefunden.
13:20170611:172409.570 Procstat.updateJobinfoList() communication to AS/400 is working again
13:20170611:172439.523 Procstat.updateJobinfoList() error: com.ibm.as400.ac cess.AS400Exception: CPF3C53 Job 688629/QCPMGTDIR/QP0ZSPWP nicht gefunden.
13:20170612:052507.478 Procstat.updateJobinfoList() error: com.ibm.as400.ac cess.AS400Exception: CPF3C53 Job 688862/QUSER/ZDASOINIT nicht gefunden.
root@DEZABBIX:/GRprog/as400#Comment
-
Thank you for a response, even not very good for me
It is a bit strange for me that there is no improvements.
There should be a messages (at "WARNING" level) from the different Java threads (like "active checks" or "listener") about communication errors, then - about reconnections. Something like the following:
The first part of each line (before a colon and timestamp) is the Java thread number. So, in this example we can see that there were communication errors in the threads 14 (collector), 18 (listener, i.e. thread for a passive checks) and 15 (active checks). Accordingly, each of these threads had re-connected when the communications to AS/400 host were restored.Code:14:20170525:192326.040 agent #1 started [collector] 15:20170525:192326.040 agent #2 (zabbix.*******.***) started [active checks #3] 17:20170525:192326.040 agent #4 started[listener #2] 16:20170525:192326.040 agent #3 started[listener #1] 18:20170525:192326.040 agent #5 started[listener #3] 14:20170525:192525.511 Procstat.updateJobinfoList() communication to AS/400 error: java.net.SocketException: Software caused connection abort: recv failed 18:20170525:192533.873 ZabbixAgent.process(): 'vfs.fs.state[33]' communication error: java.net.SocketException: Software caused connection abort: recv failed, trying to reconnect... 18:20170525:192535.386 ZabbixAgent.process() 'vfs.fs.state[33]' communication error: java.net.SocketTimeoutException: connect timed out 15:20170525:192545.589 ZabbixAgent.process(): 'system.users.num' communication error: java.net.SocketException: Software caused connection abort: recv failed, trying to reconnect... 15:20170525:192547.102 ZabbixAgent.process() 'system.users.num' communication error: java.net.SocketTimeoutException: connect timed out 15:20170525:192650.969 active check "system.users.num" is not supported: java.net.SocketTimeoutException: connect timed out 15:20170525:192826.332 ZabbixAgent.process() 'vfs.fs.size[33,pfree]' communication to AS/400 is working again 14:20170525:192827.455 Procstat.updateJobinfoList() communication to AS/400 is working again 18:20170525:192911.816 ZabbixAgent.process() 'system.hostname' communication to AS/400 is working again
In your case I can see only messages from the thead #13 (it is "collector" thread), and it worked as designed. However, I don't see any messages from the threads that should proccess real metrics (active or passive checks), they should looks like: "ZabbixAgent.process() '<some metric>' communication error [...]". I don't understand what's occuring
Maybe, it is really could be useful to set "DebugLevel=4" for a weekend and then send me your logs for analysis. Please note, the size of this log will be much more; so, it will necessary to set the "LogFileSize=100" parameter also (I hope that 100 MB should be enough). Of course, you need to re-start the program to have a modified settings in effect.Comment
-
Hello KevC,Hi KOS,
Thanks for coding this emulator. We were able to get it working today and we are successfully collecting data. We have one problem we are trying to figure out.
Our machine uses a virtual ip address of: x.x.x.6 (left out x.x.x for security reasons).
And we have two physical IP's which are:x.x.x.86 and x.x.x.96.
In Zabbix, the agent is defined as x.x.x.6 and we are successfully collecting data.
We plan to use the eventlog function to obtain QSYSOPR messages, so we need active checks working.
Any ideas ?
Thanks
your message appeared only now (probable, it has been approved by moderator at the moment).
When you use an active checks, the connections are established from Zabbix agent to Zabbix server. So, you need to correctly set the "ServerActive=" parameter referring to your Zabbix server. It is unimportant what IP-address has Zabbix-agent.
The messagetells that the address "x.x.x.96" is not, probably, your Zabbix server. Check, please, that your Zabbix server really listen on that IP-address and uses the appropriate port (default is 10051 for incoming connections).Error: 27730:20170609:143008.788 cannot send list of active checks to "x.x.x.96"Comment
-
Thanks Kos. We do have the Zabbix server server defined in the ServerActive= parameter but the ListenPort= may be missing, we use 10050 as the ListenPort. Having our AS400 guru check on that, will keep you posted. We are so close. Appreciate all your work on this emulator.Comment
-
Hi Kos,Hello KevC,
your message appeared only now (probable, it has been approved by moderator at the moment).
When you use an active checks, the connections are established from Zabbix agent to Zabbix server. So, you need to correctly set the "ServerActive=" parameter referring to your Zabbix server. It is unimportant what IP-address has Zabbix-agent.
The messagetells that the address "x.x.x.96" is not, probably, your Zabbix server. Check, please, that your Zabbix server really listen on that IP-address and uses the appropriate port (default is 10051 for incoming connections).
We were able to get this working by making sure the hostname on the rverActive= parameter matches the DNS name in Zabbix.
Thanks for your helpComment
-
Hi Kos,
Now that we have the emulator working, we have a couple challenges and would like to know if you have any thoughts for a solution.
In order to meets our customers requirements we need filter on the following from QSYSOPR:
Full Message id: We need to exclude specific Message id's, such as CPA3138, the trouble is there is also a CPF3138, so excluding based on the number will exclude additional messages. We could use the trigger to do this but that would mean we would bring in extra messages to the Zabbix DB that are not needed and there could be 100 or more per day. Once solution might be to put the Message id in the source (we don't need to filter on the jobname, not sure about others).
Users: Need to filter out messages from specific users(for example DEV testers) but this field is not provided, any thoughts.
Severity 50 or greater - This is provided and works great
For MIMIX Queue:
We need to monitor for messages in Library QUSRSYS with severity 50 or greater.
Don't believe the Library is provided but not sure since we are not yet able to generate MIMIX message in test, working on that.
Please let me know if you have any thoughts on how we might be able to handle the above requirements. Appreciate all your work on this.Comment
-
Hello KevC,
I'm glad that my work is useful for anybody :-)
Regarding your questions:
Full Message id:
You can filter by the full EventID field using a regular expression if you use it in the item's key. In this case the filtering is performed on the agent's side, where is has a full EventID (contrasting to filtering in trigger on the server side, where only digital part of EventID is available). By the way, documentation has an example of this trick.
Users:
Unfortunately, there is no such functionality at the moment. In theory, API provides the methods getUser() (returns the sender job's user) and getCurrentUser() (returns the current user name). However, in general case, both these methods could return an empty string (""). Additionally, it's impossible to transfer these values onto Zabbix server (the only way that I see is to concatenate this field to some other existing field like text of message). Finally, it will require to extend the Item's key to accept an additional parameter. So, at the moment I'm not sure if it is really needed. Maybe, filtering by the Job name could be enough in your case?
Library:
As I understand, the library name is part of IFS path name of the message queue (again, see example in documentation). You could consult with your AS/400 admin for a details, but I'm supposing that you could use something like that as the queue name: "/QUSRSYS.LIB/MIMIX.MSGQ".Comment
-
As400 monitoring
Kos,
attached the compressed logfile.
This was already with debug level four but I forgot to
switch on larger file size....
I did this today, so after next weekend I shoud lbe able to provide more details.
But maybe the small file is already helpfull.
Our AS400 makes a full system backup at sunday, 15:00.
RegardsAttached FilesComment
-
Thanks Gerald for a debug log, it is really useful.
If I understand correctly, this time our Zabbix agent emulator just stopped gracefully after problem instead of stay running and trying to reconnect later.
However, it collected a good dying trace (including a stack trace of Java exception that caused to stop).
The root cause of this problem is the following: some of API calls wrap the original exceptions ("java.net.NoRouteToHostException" in this case, the subclass of "java.io.IOException") into some other type of exception ("java.lang.RuntimeException" in this case). My program is trying to correctly process all communication errors by catching IOException. However, in this case it's receiving the RuntimeException instead of IOException. It don't understand what to do in this situation, and just stopping in result.
By the way, it's much better behaviour than it was in previous versions: 1) it is evidently signalling that something went wrong; 2) it saves the current log, preventing it from overwriting; 3) there is no more situation where some of threads was died unexpectedly but the others stay to run.
I've got to think out how to fix this problem.Comment
Comment