What happened to staff email on Monday?
On Saturday June 4th at approximately 14.00 email stopped synchronising between the WBS staff GroupWise system and mobile devices such as Blackberries and iPhones. This was first reported to the Helpdesk on Sunday. Attempts to fix this remotely were made but when they didn't work we left the problem to be looked at first thing on Monday morning. Affected users could still check email via GroupWise or Webmail.
We quickly found the root cause of the problem to be an error in the product we use for our new service to synchronise GroupWise with non-Blackberry mobile phones. The manufacturer's of this product Novell, have recognised this problem and produced a hot fix, that we shall apply soon at a time that will cause the minimum amount of disruption. The full technical details of this problem and the hotfix are here.
We now knew the cause but the immediate problem remained. To resolve this we decided to restart the staff Post Office. This is the part of the email system that the mobile devices talk to. An email was sent to all staff warning of a short outage at 10.30 that day.... or so the theory went. When the server was restarted it popped up a warning that a disk check had not been performed in the last 146 days (only a computer could consider that a sensible figure) and that it was strongly recommended that one be performed now. We took the advice, as not to do so could have caused greater problems. Said check took about 1 hour, during which time the staff email system was not available. Once the check was finished the system came back up and worked correctly.
Steps taken to prevent this happening again...
1) The hotpatch will be applied soon to prevent the original problem that stopped the mobile phone synchronisation
2) We will set an automatic job to restart the server regularly on Tuesday mornings at about 5.00am. Even if a disk check is required an outage from 5.00am - 6.00am is unlikely to be too disruptive