DiskSuite DB replica management

Management of DiskSuite metadb DB replicas

DiskSuite uses metadb DB replicas to hold all the critical information needed to access your data stored on metadevices. These replicas are essential in many cases to maintain your data integrity, and no discussion of Disksuite would be complete without some discussion on the care and maintenance of thes DB replicas.

Topics include:

Where to locate the DB replicas?
Setting up your metadb DB replicas
Checking the status of your DB replicas

Where to locate the DB replicas?

Disksuite requires space on the disk to store its metadb database replicas. Because this database contains the critical information needed for you to access the disks, it must be replicated as widely as possible. You should in general spread the replicas out even over as many disks as are available, and where possible they should be evenly distributed over multiple controllers as well (although this is often not feasible). On a system with only 2 internal disks, the replicas will most likely be limitted to those 2 disks, and should be divided equally between them (this way the system can stay up if either disk fails). We also have some systems with 4 internal drives, and in these cases we replicate the databases across all 4 drives, even if only two of them are to be mirrored.

The database replicas take up room on the disk which cannot be used for filesystems, etc. The typical partition scheme for a Glued Solaris box is as follows:

0: / (root slice)
1: swap partition
2: whole disk (backup) slice, not used by Glue
3: /usr
4: /usr/vice/cache (AFS cache)
5: /var
6: generally unused, /usr/afs on AFS servers, maybe DB replicas
7: generally unused, maybe DB replicas

As can be seen, the standard glue set up uses most of the slices available. Slice 2 might be usable, but I would recommend against it especially on a system disk. That leaves slices 6 and 7 free. Physics generally puts the DB replica on one of these 2 slices. The replicas are not that large, 8192 blocks or about 4MB on the recent versions (Solaris 9), and much smaller (about 1/2MB) on earlier (Solaris 7,8 ) versions, and we usually put 2-3 copies on each disk. (NOTE:it is important to spread the copies out over multiple disks, and have the same number of replicas on each disk.) Since I dislike making slices smaller than 50 or so MB, we usually waste a fair amount of space anyway. The other slice may have additional local space available if the disk is big enough that I cannot justify expending the entire disk on system slices.

However, if you hope to make the system an AFS server (thus using slice 6), and possibly put data on slice 7, you have a problem, as there are no more partitions free to put the DB replicas. Fortunately, there is a way around that, at least if you do the mirroring before making the system an AFS server. Disksuite can share a slice between the DB replicas and a filesystem in some cases:

The slice in question must only be used as a metadisk device, so disksuite drivers handle the device
There cannot be any data on the disk when the metadisk device for the slice is created (other than the DB replicas). E.g., you cannot use this feature to install the replicas on a slice already containing data.

Since the DB replicas do not take up that much space, this can be very useful. Note: this can cause problems in an upgrade because of the change in the default replica size from Solaris 7,8 to Solaris 9. You may need to look at a reference from SunSolve discussing this and the use of the -l option to metadb.

Because it is unwise to have disksuite manage a /vicep partition on an AFS server, and since you would want the AFS server software of an AFS server mirrored also, the best bet is if you can mirror the system before the AFS server software is installed. Put the DB replicas on slice 6, mirror root (/), /usr, /var, swap, and the AFS cache as normal, then create an empty metadevice on slice 6, newfs it, and mount it on /usr/afs.

Some example configurations from Physics:

A system with 2 internal disks, not an AFS server: slices 0-5 are standard, and three DB replicas put on either slice 6 or 7, with the other available for extra storage if more room on the disk. Slices 0, 1, 3, 4, and 5 are mirrored, and possibly the extra slice for local storage as well.
An AFS server system with 2 internal disks, no non-system storage internal: slices 0-5 are standard, /usr/afs on slice 6, and three DB replicas on slice 7. Slices 0, 1, 3, 4, 5, and 6 are all mirrored.
An AFS server system with 2 internal disks, with extra space on the 2 disks for local storage of non-system files: slices 0-5 are standard, slice 6 contains 3 DB replicas AND /usr/afs, and slice 7 contains the extra space. Slices 0, 1, 3, 4, 5, and 6 are mirrored. Slice 7 may or may not be mirrored (definitely not if used as a vice partition).
A system with 4 internal disks, not an AFS server: Two disks are mirrored system disks, with standard slices 0-5. Slice 7 on all 4 disks reserved for two DB replicas, and slices 0, 1, 3, 4, and 5 on system disks are mirrored. Slice 6 on system disks available if extra space on them, and slices 0-6 on the other two disks also available. These slices may or may not be mirrored.
An AFS server with 4 internal disks, no extra room on system disks: Two disks are mirrored system disks, with standard slices 0-5. Slice 7 on all 4 disks made about 1GB and two DB replicas are placed on all of them. Slice 6 on system disks contain /usr/afs. Slices 0, 1, 3, 4, 5, and 6 on the system disks are mirrored. Slices 0-6 on the other two disks are available, and may or may not be mirrored (definitely not if using as a vice partition).

Setting up Disksuite Database replicas

Regardless of whether you want to do logging, mirroring, striping, or RAID, you need to create the metadb DB replicas for Disksuite. Because this step is so universal, it is being covered in its own section.

Before creating the DB replicas, you should have:

Decided where you are going to put them. This is discussed in more detail in the section on DB location policies.
Define the partitions the DB replicas are going to go on. You need an empty slice for this (you may be able to use the slice for a filesystem later, but the slice must be empty at this stage). The partition needs to be defined on each disk the DB replicas are going on (generally all disks). You can use the command to do this. If you are mirroring the disks, you want them to have the same partition structure anyway, so once the first disk is set up, you can use the command
prtvtoc /dev/rdsk/DISK1 | fmthard -s - /dev/rdsk/DISK2
to copy the partition table from DISK1 to DISK2.

We are now ready to create the state meta-databases. First, make sure no one configured disksuite without your knowledge by checking for the existence of DB replicas with the command metadb. Solaris 2.7 users may have to give a full path to the metadb command, e.g. /usr/opt/SUNWmd/sbin/metadb. On Solaris 8 and 9, it is in /usr/sbin which should be in your path. This should return an error complaining that "there are no existing databases". It might also just return nothing (usually indicating that DB replicas were set up once and then all were deleted).

If you get a list of replicas, STOP. Someone set up or tried to set up disksuite before you, and figure out what the status is before proceding further. Using the command below to try to create another initial database set will hopefully yield an error, but if not could be disastrous, wiping out the previous DB and making the previously mirrored, striped, etc. disks inaccessible.

For a two disk system, Sun advises a minimum of 2 replicas per disk; physics uses 3. To create the initial replicas, issue the command (as root):

metadb -a -f -c 3 slice

replacing slice with the slice to put the initial replicas on. E.g., /dev/dsk/c0t0d0s7 to put it on slice 7 of the 1st disk. The -c 3 in the above command instructs it to put three copies of the DB there. The -a says we are adding replicas, and the -f forces the addition. NOTE: the -f option should only be used for the initial DB replica, when it is REQUIRED to avoid errors due to lack of any previously existing replicas.

NOTE: if you a replacing a replica set on a partition that has a file system on it, be aware of the change in the default replica size between Solaris 7,8 and Solaris 9. You may need to use the -l option on metadb to limit the size of the new replicas so as not to overwrite the beginning of the filesystem, or do some nasty recreation of the filesystem to a smaller size.

You can check the databases were created successfully by issuing the metadb command without arguments, or with just the -i argument to get a legend for the flags. You should now see 3 (or whatever value you gave to the -c argument) DB replicas in successive blocks on the slice you specified. At this point, only the a (active) and u (up-to-date) flags should be set.

Now add the replicas on the second (and any other) drives. This is done with a command like:

metadb -a -c 3 /dev/dsk/c0t1d0s7

NOTE: make sure add same number of replicas to each disk, or it may cause problems when a disk fails (you may end up with less than half the replicas; having exactly half is bad enough). If problems, use -d option to delete all replicas on the named partition, and then re-add the correct number.

Checking the status of your DB replicas

You can use the plain metadb command (or give it the -i option for the flags legend) to verify the databases are functioning properly. This should be used right after creation to ensure they were created successfully, and is also useful to use later to verify things are OK.

You should again see a line for each replica on each disk, along with some flags indicating the status of each replica. In general, lower case flags are good, upper case flags are bad. The following flags seem to be set on a functioning system (flags should appear for every replica unless otherwise stated):

a: the replica is active. This should always be set.
m: flagging the master replica (only one replica should have this set, usually the first)
p: the replica is patched into the kernel. This should get set after the first reboot (why? what does it mean?)
l: the replica was read successfully. This should get set after the first reboot?
u: the replica is up-to-date. This should always be set.
o: the replica was active prior to the last database change. This should get set after the first reboot.