Friday 21 November 2014

IBM Business Process Manager - Missing the Bus

I've just built a single cell, two node three cluster IBM BPM Advanced 8.5.5 environment, against a remote DB2 ESE 10.1.0.3 server.

So I was a little startled when, after starting the Deployment Environment, the Service Integration Bus (SIbus) failed to properly start.

This is what I saw in one of my Cluster Member logs: -

[21/11/14 13:17:03:719 GMT] 00000073 SibMessage    I   [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSIS1593I: The messaging engine, ME_UUID=E997A9EFA09498FC, INC_UUID=6DC2A53AD19710D7, has failed to gain an initial lock on the data store.
[21/11/14 13:17:03:719 GMT] 00000073 SibMessage    I   [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSIS1538I: The messaging engine, ME_UUID=E997A9EFA09498FC, INC_UUID=6DC2A53AD19710D7, is attempting to obtain an exclusive lock on the data store.


This was a clean build, so the Messaging Engine database should have been OK.

The tables were definitely there: -

SIB000                          DB2USER1        T     2014-11-21-13.43.55.547439
SIB001                          DB2USER1        T     2014-11-21-13.43.55.682333
SIB002                          DB2USER1        T     2014-11-21-13.43.55.819494
SIBCLASSMAP                     DB2USER1        T     2014-11-21-13.43.55.334938
SIBKEYS                         DB2USER1        T     2014-11-21-13.43.55.947883
SIBLISTING                      DB2USER1        T     2014-11-21-13.43.55.420531
SIBOWNER                        DB2USER1        T     2014-11-21-13.43.55.151963
SIBOWNERO                       DB2USER1        T     2014-11-21-13.43.55.081007
SIBXACTS                        DB2USER1        T     2014-11-21-13.43.56.039355

and yet .... they were ALL empty :-(

As this is MY own environment, I called the ball and dropped the SIB tables: -

db2 drop table db2user1.sib000
db2 drop table db2user1.sib001
db2 drop table db2user1.sib002
db2 drop table db2user1.sibclassmap
db2 drop table db2user1.sibkeys
db2 drop table db2user1.siblisting
db2 drop table db2user1.sibowner
db2 drop table db2user1.sibownero
db2 drop table db2user1.sibxacts

and restarted the MECluster

This time around, the tables were nicely populated e.g.

db2 "select id from db2user1.sib000"

...
                 252
                 253
                 254
                 255
                 256
                 257
                 258
                 259
                 260
                 261
                 262
                 263
                 264
                 265
                 266
                 272

  269 record(s) selected.

...

and the SIbus comes up nicely: -

with JVM1 reports: -

[21/11/14 13:43:58:431 GMT] 0000006a SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Started.

and JVM2 reports: -   

[21/11/14 13:47:23:859 GMT] 00000065 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Joined. 

In other words, the Bus Member on node 1 is active, with the Bus Member on node 2 standing by to take over.

When I stopped the MEClusterMember1 on node 1, I see this from node 2: -

[21/11/14 13:51:53:684 GMT] 00000097 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Started. 

which again is as expected.

And, as  a final acid test, when I restart MEClusterMember1, I see this: -

[21/11/14 13:55:33:043 GMT] 00000062 SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Joined.

and stop MEClusterMember2, I see this: -

[21/11/14 13:57:33:123 GMT] 0000008f SibMessage I [BPM.ProcessServer.Bus:MECluster.000-BPM.ProcessServer.Bus] CWSID0016I: Messaging engine MECluster.000-BPM.ProcessServer.Bus is in state Started.

both messaging coming from node 1.

This shows that, once I dropped and recreated the SIB tables, the bus comes up nicely, and failover works both ways - node 1 to node 2 and node 2 to node 1.

This ties up with the IBM BPM pattern, known as 1-of-n, where only one ME / Bus Member can be active at any one time, regardless of the number of nodes in the cell / members in the cluster.

Which is nice.

So what went wrong ? I do not know, but I know how to resolve it AND, more importantly, watch for problems.

Some background reading: -



No comments:

Note to self - use kubectl to query images in a pod or deployment

In both cases, we use JSON ... For a deployment, we can do this: - kubectl get deployment foobar --namespace snafu --output jsonpath="{...