Orbis 2 Systems Installation, Implementation and Operations Group (SIIOG)
Meeting Notes: October 25, 2001
Present: Gail Barnett, Roy Lechich, Wes Most, Ernie Marinko, Fred Martz
Issues relating to our recovery abilities in terms of database synchronicity and system redundancy (server side)
Now that the MMG has selected one of the hardware configurations proposed by ITS, we have a good idea of how many servers will be used for production, as well as how many ancillary servers will be used for various support functions, including backup/recovery, training, testing and development.
We intend to use Sun products SNDR (Sun Network Data Replicator) and II (Instant Image) in tandem to synchronize the backup database with the production database (this includes both the Oracle database and the Keyword Index data file):
- Instant Image will periodically take a snapshot of the production data (actually the last set of changes, or "deltas’, since the last snapshot)
- SNDR will ship this snapshot to the Backup Oracle database server.
- The backup database server will be refreshed with the new data from this snapshot.
Two questions which quickly arose:
We decided we would explore the frequency question by testing on our equipment as well trying to obtain information on how other sites have implemented this. The "reversing" question is also something to be resolved by research and experimentation.
The discussion expanded to the larger question: how do we determine the optimum set of "moves" or procedures we need to have at our disposal to best react to any possible adverse situation in our production environment? We agreed that we should explore the full range of possiblities of problems; from temporary application hang to smoky fire at the Whitney machine room.
Treating this question as a preliminary loose topic for now, various questions came up:
How do we actually effect a switchover from production to backup systems? How will the user be affected? We discussed how we could set up workstations with multiple configurations, containing .ini files which point to either production or backup instances of Voyager. If the workstation setup also contains an easy-to-run procedure (e.g. an icon which invokes a batch file which might terminate any current Voyager clients in an "orderly" way, rename the ini files , and restart the clients, which would now point to the backup system, or vice versa) this simplifies the procedure that the user must perform after being notified.
Another solution, which struck all of us as superiror in its simplicitly, was for the ITS/Library staff to change the DSN alias to point to the desired target system. This would simply require that the user be notified of the need to exit and restart any Voyager clients; the .ini file would contain the same DSN alias as always but would end up being pointed to the desired system.
Wes asked Ernie if the Workstation Support Group has experimented with having multiple instances of Voyager clients installed on a workstation; either varying version for testing new releases or the same versions for possibly connecting to different systems. Ernie replied that they had not. Although the idea of having duplicate clients for the purpose of switching over based on multiple .ini files is probably not a good option, the general issue of multiple clients on a single workstation is important in the future for its implications for people who will be using both the current version of Voyager for production work and newer releases for testing.
Gail brought up the interesting potential situation in which the production Oracle server goes down, but not the production Voyager. Do we then re-configure the Voyager server to point to the backup Oracle server?
Can we do this transparently? How would this affect the user? What would be the fate of current transactions? Clearly we have to consider a wide range of scenarios with different permutations of servers going down.
For our next meeting, we will try to enumerate the adverse situations we’re most likely to have to be prepared for, and from them derive a set of steps or "moves" which we would need to have mastered and tested, and talk about how these could be combined into procedures to handle any situation we’re likely to have to deal with.
Return to Orbis2 Implementation Site