Monday, 13 June 2011

IBM Tivoli Directory Server - It's the ulimit, man

During a piece of planned work last week, we suffered an outage on one of our two LDAP servers, running IBM Tivoli Directory Server 6.1.0.37.

After a fair bit of PD, we identified the problem, but the logs - ibmslapd.log to be precise only provided part of the answer: -
07/06/11 21:27:18 GLPCTL113I Largest core file size creation limit for the process (in bytes): '0'(Soft limit) and '-1'(Hard limit).
07/06/11 21:27:18 GLPCTL119I Maximum Data Segment(Kbytes) soft ulimit for the process is -1 and the prescribed minimum is 262144.
07/06/11 21:27:18 GLPCTL119I Maximum File Size(512 bytes block) soft ulimit for the process is -1 and the prescribed minimum is 2097152.
07/06/11 21:27:18 GLPCTL122I Maximum Open Files soft ulimit for the process is 1024 and the prescribed minimum is 500.
07/06/11 21:27:18 GLPCTL120I Maximum Stack Size(Kbytes) soft ulimit for the process is 2048; the attempt to modify to the prescribed minimum 10240 failed.
07/06/11 21:27:18 GLPSRV151E Prescribed soft limit is out of range or greater than hard limit value for Stack Size(Kbytes) ulimit option.
07/06/11 21:27:18 GLPCTL119I Maximum Virtual Memory(Kbytes) soft ulimit for the process is -1 and the prescribed minimum is 1048576.
07/06/11 21:27:18 GLPSRV040E Server starting in configuration only mode due to errors.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libevent.so.
07/06/11 21:27:18 GLPSRV155I The DIGEST-MD5 SASL Bind mechanism is enabled in the configuration file.
07/06/11 21:27:18 GLPCOM021I The preoperation plugin is successfully loaded from libDigest.so.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libevent.so.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libtranext.so.
07/06/11 21:27:18 GLPCOM023I The postoperation plugin is successfully loaded from libpsearch.so.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libpsearch.so.
07/06/11 21:27:18 GLPCOM025I The audit plugin is successfully loaded from libldapaudit.so.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libevent.so.
07/06/11 21:27:18 GLPCOM023I The postoperation plugin is successfully loaded from libpsearch.so.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libpsearch.so.
07/06/11 21:27:18 GLPCOM022I The database plugin is successfully loaded from libback-config.so.
07/06/11 21:27:18 GLPSRV017I Server configured for secure connections only.
07/06/11 21:27:18 GLPSRV015I Server configured to use 636 as the secure port.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libloga.so.
07/06/11 21:27:18 GLPCOM024I The extended Operation plugin is successfully loaded from libidsfget.so.
07/06/11 21:27:18 GLPSRV180I Pass-through authentication is disabled.
07/06/11 21:27:18 GLPCOM004I SSL port initialized to 636.
07/06/11 21:27:18 GLPSRV098I Directory server audit logging started.
07/06/11 21:27:18 GLPSRV009I 6.1.0.37 server started.
07/06/11 21:27:18 GLPSRV036E Errors were encountered while starting the server; started in configuration only mode.
07/06/11 21:27:18 GLPSRV048I Started 15 worker threads to handle client requests.
07/06/11 21:27:54 GLPSSL019E The SSL layer has reported an unidentified internal error.
  As you can see from the highlight, the problem APPEARED to be with SSL.

Whilst LDAP was starting, it was only coming up in so-called "Configuration Mode" meaning that it won't access user authentication requests.

This led to an interesting problem - we have a load balancer sitting in front of a pair of replicated ITDS boxes. The load balancer "saw" the failing server coming up on port 636, even though it was in configuration mode, and started directing traffic to it :-(

In fact, that was the symptom. The actual problem was much earlier in the logs: -

07/06/11 21:27:18 GLPCTL120I Maximum Stack Size(Kbytes) soft ulimit for the process is 2048; the attempt to modify to the prescribed minimum 10240 failed.

It turned out that the user that was being used to start LDAP - idsinst - albeit via a sudo command, was unable to correctly set the Stack Size user limit ( ulimit -s ).

This was partly due to the method that is used to access the server via command-line ( rather than using the more traditional SSH route ), but the lesson that we learned is that some errors are more important than others.

The moral of the story ? Check ALL of the log file, not just the bits that look interesting.

1 comment:

Ryanm29 said...

Its an easy mistake to make especially when TDS is involved! :)

Note to self - use kubectl to query images in a pod or deployment

In both cases, we use JSON ... For a deployment, we can do this: - kubectl get deployment foobar --namespace snafu --output jsonpath="{...