Saturday, 19 September 2015

WebSphere Application Server - SIbus Resilience with DB2 HADR - There and back again

Yet more on DB2 HADR: -






I'm testing DB2 HADR using a pair of VMs, running v10.5.0.5, and a 3rd VM running WAS 8.5.5.2 ( this is Business Monitor 8.5.5 but that's not important right now ).

I'm trying to get a handle (!)  on Messaging Engine ( SIbus ) failover between primary and standby DB2 boxes.

I've configured the ME Custom Property sib.msgstore.jdbcFailoverOnDBConnectionLoss = false  rather than the WAS 8.5 default of true.

I've also tuned the TCP/IP keep-alives on all three VMs to: -

net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_retries2 = 2

and yet SIB DB failover does not ... failover.

I start the MECluster ( which hosts the SIB ME ), and it happily connects to the primary DB ( db2one.uk.ibm.com ).

When I do the DB2 HADR takeover on db2two.uk.ibm.com, I see: -

[18/09/15 21:18:42:274 BST] 0000006a ConnectionEve A   J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for resource jdbc/wbm/MonitorDatabase. The exception is: com.ibm.db2.jcc.am.ClientRerouteException: [jcc][t4][2027][11212][4.11.69] A connection failed but has been re-established. The host name or IP address is "db2one" and the service name or port number is 60,006.

in SystemOut.log, before it connects: -

[18/09/15 21:19:02:232 BST] 00000069 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1537I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=7E0AFF46E21AA864, has acquired an exclusive lock on the data store.

However, when i reverse the takeover ( back to d2one.uk.ibm.com ), I see: -

[18/09/15 21:19:22:269 BST] 000000a6 WSJccConnecti W   DSRA8650W: Error closing a JDBC child wrapper, com.ibm.ws.rsadapter.jdbc.WSJccPreparedStatement@c693be3a
com.ibm.db2.jcc.am.SqlException: [jcc][10120][10943][4.11.69] Invalid operation: statement is closed. ERRORCODE=-4470, SQLSTATE=null

...

[18/09/15 21:19:22:278 BST] 000000a6 ConnectionEve W   J2CA0206W: A connection error occurred.  To help determine the problem, enable the Diagnose Connection Usage option on the Connection Factory or Data Source. This is the multithreaded access detection option. Alternatively check that the Database or MessageProvider is available.
[18/09/15 21:19:22:279 BST] 000000a6 ConnectionEve A   J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for resource jdbc/wbm/MonitorDatabase. The exception is: com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException: [jcc][t4][2030][11211][4.11.69] A communication error occurred during operations on the connection's underlying socket, socket input stream,
or socket output stream.  Error location: Reply.fill() - insufficient data (-1).  Message: Insufficient data. ERRORCODE=-4499, SQLSTATE=08001
[18/09/15 21:19:42:234 BST] 00000069 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1594I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=7E0AFF46E21AA864, has lost the lock on the data store.
[18/09/15 21:19:42:240 BST] 00000069 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1538I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=7E0AFF46E21AA864, is attempting to obtain an exclusive lock on the data store.

...

In other words, it fails from the configured primary db2one to the configured standby db2two but not back again.

This is what I have for the primary DB: -

and this is what I have for the standby: -

I then made some progress :-)

So this is what I have on db2two: -

db2 list db directory

 System Database Directory

 Number of entries in the directory = 2

Database 1 entry:

 Database alias                       = COGNOS
 Database name                        = COGNOS
 Local database directory             = /home/db2inst1
 Database release level               = 10.00
 Comment                              = IBM Cognos Content Store
 Directory entry type                 = Indirect
 Catalog database partition number    = 0
 Alternate server hostname            = db2one
 Alternate server port number         = 60006

Database 2 entry:

 Database alias                       = MONITOR
 Database name                        = MONITOR
 Local database directory             = /home/db2inst1
 Database release level               = 10.00
 Comment                              =
 Directory entry type                 = Indirect
 Catalog database partition number    = 0
 Alternate server hostname            = db2one
 Alternate server port number         = 60006

and this is what I have on db2one: -

db2 list db directory

 System Database Directory

 Number of entries in the directory = 2

Database 1 entry:

 Database alias                       = COGNOS
 Database name                        = COGNOS
 Local database directory             = /home/db2inst1
 Database release level               = 10.00
 Comment                              = IBM Cognos Content Store
 Directory entry type                 = Indirect
 Catalog database partition number    = 0
 Alternate server hostname            = db2two
 Alternate server port number         = 60006

Database 2 entry:

 Database alias                       = MONITOR
 Database name                        = MONITOR
 Local database directory             = /home/db2inst1
 Database release level               = 10.00
 Comment                              =
 Directory entry type                 = Indirect
 Catalog database partition number    = 0
 Alternate server hostname            =
 Alternate server port number         =

Spot the difference ?

Yep, the DB2 catalog on db2one has no Alternate server settings for the MONITOR DB, and guess with what DB I'm having problems ?

Added to that, I see this: -

2015-09-19-07.28.52.105153+060 I61711243E775         LEVEL: Warning
PID     : 2100                 TID : 140240484296448 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : MONITOR
APPHDL  : 0-13494              APPID: 192.168.33.100.56862.150919073103
AUTHID  : DB2USER1             HOSTNAME: db2one.uk.ibm.com
EDUID   : 156                  EDUNAME: db2agent (MONITOR) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrCheckDb, probe:18200
MESSAGE : SQL1776N  The command cannot be issued on an HADR database. Reason
          code = "1".
DATA #1 : Hex integer, 4 bytes
0x00000000
DATA #2 : sqeApplication_acbInfo, PD_TYPE_sqeApplication_acbInfo, 4 bytes
x0
DATA #3 : String, 50 bytes
Connections are not allowed on a standby database.

in ~/sqllib/db2dump/db2diag.log.

I updated the Alternate server settings on db2one as follows: -

db2 "update alternate server for database monitor using hostname db2two port 60006"

and restarted DB2: -

db2stop force

db2start

and ... guess what ?

Yep, takeover now works both ways: -

From db2two to db2one

[19/09/15 07:34:11:758 BST] 000000bf ConnectionEve W   J2CA0206W: A connection error occurred.  To help determine the problem, enable the Diagnose Connection Usage option on the Connection Factory or Data Source. This is the multithreaded access detection option. Alternatively check that the Database or MessageProvider is available.
[19/09/15 07:34:11:759 BST] 000000bf ConnectionEve A   J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for resource jdbc/wbm/MonitorDatabase. The exception is: com.ibm.db2.jcc.am.ClientRerouteException: [jcc][t4][2027][11212][4.11.69] A connection failed but has been re-established. The host name or IP address is "db2two.uk.ibm.com" and the service name or port number is 60,006.

Special registers may or may not be re-attempted (Reason code = 1). ERRORCODE=-4498, SQLSTATE=08506
[19/09/15 07:34:31:487 BST] 00000068 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1594I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7, has lost the lock on the data store.
[19/09/15 07:34:31:488 BST] 00000068 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1538I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7, is attempting to obtain an exclusive lock on the data store.
[19/09/15 07:34:31:655 BST] 00000154 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1545I: A single previous owner was found in the messaging engine's data store, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7
[19/09/15 07:34:31:659 BST] 00000068 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1537I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7, has acquired an exclusive lock on the data store.

From db2one to db2two

[19/09/15 07:37:31:801 BST] 00000154 ConnectionEve W   J2CA0206W: A connection error occurred.  To help determine the problem, enable the Diagnose Connection Usage option on the Connection Factory or Data Source. This is the multithreaded access detection option. Alternatively check that the Database or MessageProvider is available.

[19/09/15 07:37:31:802 BST] 00000154 ConnectionEve A   J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for resource jdbc/wbm/MonitorDatabase. The exception is: com.ibm.db2.jcc.am.ClientRerouteException: [jcc][t4][2027][11212][4.11.69] A connection failed but has been re-established. The host name or IP address is "db2one.uk.ibm.com" and the service name or port number is 60,006.
Special registers may or may not be re-attempted (Reason code = 1). ERRORCODE=-4498, SQLSTATE=08506
[19/09/15 07:37:51:666 BST] 00000068 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1594I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7, has lost the lock on the data store.
[19/09/15 07:37:51:668 BST] 00000068 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1538I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7, is attempting to obtain an exclusive lock on the data store.
[19/09/15 07:37:51:787 BST] 0000015a SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1545I: A single previous owner was found in the messaging engine's data store, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7
[19/09/15 07:37:51:797 BST] 00000068 SibMessage    I   [MONITOR.BAMCell1.Bus:MECluster.000-MONITOR.BAMCell1.Bus] CWSIS1537I: The messaging engine, ME_UUID=FE28F02EE5B5F496, INC_UUID=A35AD93CE23D2EA7, has acquired an exclusive lock on the data store.


No comments:

Note to self - use kubectl to query images in a pod or deployment

In both cases, we use JSON ... For a deployment, we can do this: - kubectl get deployment foobar --namespace snafu --output jsonpath="{...