Notes on Setting Up and Using Sun DiskSuite on Glued Solaris Machines


Notes on Using Sun DiskSuite package on Glued Sun Machines

Disksuite is a standard Sun software based RAID package available on Glued Solaris boxes. The following are some notes on setting up and using the Sun DiskSuite package on a Glued Solaris systems, primarily for the purposing of mirroring a pair of internal systems disks (mirroring /, /usr, swap, et al). It is based on experiences mirroring a pair of internal system disks on Sun Enterprise 220R, SunFire 280R, and SunFire V440 systems, running Solaris 2.7 through 2.9.

Contents

Prelimary Steps

There are some preliminary notes and comments about the Disksuite package which is relevant to its use on Glue, as follows.

After the above, you should be at the "normal" beginning point for using DiskSuite. The procedure from here will vary greatly depending on what is desired. This page covers some common uses in Physics Dept, like mirroring system disks, but other uses should be fairly straight forward and standard resources (see next section) should be useful.

Useful Disksuite References

General References (not Glue specific)

In addition, the purpose of mirroring the system drives is to ensure that the system stays up, so it is useful to be cognizant of the recovery procedure should a disk fail. In particular, there are some small tricks involved when one of a set of mirrored system disks fails. Also, familiarity with the recovery procedure may affect choices made during the initial setup, and recovery can be made easier by proper documentation during the setup. The following references may be of use:

Glue, Physics department conventions, policies

Before proceeding with a discussion of the setup of Disksuite, it is useful to discuss some conventions and policies in use by either Glue, the Physics department, or both. While these are not mandatory (unless you are configuring on a Physics system:), you should probably have an idea of how you want things set up, and these may provide useful guides. It is also a good idea for you to clearly document your conventions; Physics documents them here as well as in comments in the md.tab file.

Where to locate the DB replicas?

Disksuite requires space on the disk to store its metadb database replicas. Because this database contains the critical information needed for you to access the disks, it must be replicated as widely as possible. You should in general spread the replicas out even over as many disks as are available. On a system with only 2 internal disks, the replicas will most likely be limitted to those 2 disks, and should be divided equally between them (this way the system can stay up if either disk fails). We also have some systems with 4 internal drives, and in these cases we replicate the databases across all 4 drives, even if only two of them are to be mirrored.

The database replicas take up room on the disk which cannot be used for filesystems, etc. The typical partition scheme for a Glued Solaris box is as follows:

As can be seen, the standard glue set up uses most of the slices available. Slice 2 might be usable, but I would recommend against it especially on a system disk. That leaves slices 6 and 7 free. Physics generally puts the DB replica on one of these 2 slices. The replicas are not that large, 8192 blocks or about 4MB, and we usually put 2-3 copies on each disk. (NOTE:it is important to spread the copies out over multiple disks, and have the same number of replicas on each disk.) Since I dislike making slices smaller than 50 or so MB, we usually waste a fair amount of space anyway. The other slice may have additional local space available if the disk is big enough that I cannot justify expending the entire disk on system slices.

However, if you hope to make the system an AFS server (thus using slice 6), and possibly put data on slice 7, you have a problem, as there are no more partitions free to put the DB replicas. Fortunately, there is a way around that, at least if you do the mirroring before making the system an AFS server. Disksuite can share a slice between the DB replicas and a filesystem in some cases:

  1. The slice in question must only be used as a metadisk device, so disksuite drivers handle the device
  2. There cannot be any data on the disk when the metadisk device for the slice is created (other than the DB replicas). E.g., you cannot use this feature to install the replicas on a slice already containing data.
Since the DB replicas do not take up that much space, this can be very useful.

Because it is unwise to have disksuite manage a /vicep partition on an AFS server, and since you would want the AFS server software of an AFS server mirrored also, the best bet is if you can mirror the system before the AFS server software is installed. Put the DB replicas on slice 6, mirror root (/), /usr, /var, swap, and the AFS cache as normal, then create an empty metadevice on slice 6, newfs it, and mount it on /usr/afs.

Some example configurations from Physics:

Metadevice naming conventions

In addition to the locations of the DB replicas, you will need to come up with a naming scheme for the mirrors and submirrors. Each slice to be mirrored will need a distinctly named disksuite submirror device on each of the disks being mirrored, and in addition the redundant mirrored device also needs to be named. The metadevice names should be of the dN or dMN where M and N are digits, and M cannot be 0 (e.g. d01 is not allowed).

You probably want to come up with a reasonable naming scheme. And ideally your naming scheme should make it obvious what slice the mirror metadevice refers to, and what disk and slice each submirror refers to. Physics uses the following:

Other than convenience, however, there is nothing special about the naming.

Setting up Disksuite Database replicas

Regardless of whether you want to do logging, mirroring, striping, or RAID, you need to create the metadb DB replicas for Disksuite. Because this step is so universal, it is being covered in its own section.

Before creating the DB replicas, you should have:

  1. Decided where you are going to put them. This is discussed in more detail in the section on DB location policies.
  2. Define the partitions the DB replicas are going to go on. You need an empty slice for this (you may be able to use the slice for a filesystem later, but the slice must be empty at this stage). The partition needs to be defined on each disk the DB replicas are going on (generally all disks). You can use the command to do this. If you are mirroring the disks, you want them to have the same partition structure anyway, so once the first disk is set up, you can use the command
    prtvtoc /dev/rdsk/DISK1 | fmthard -s - /dev/rdsk/DISK2
    to copy the partition table from DISK1 to DISK2.

We are now ready to create the state meta-databases. First, make sure no one configured disksuite without your knowledge by checking for the existence of DB replicas with the command metadb. Solaris 2.7 users may have to give a full path to the metadb command, e.g. /usr/opt/SUNWmd/sbin/metadb. On Solaris 8 and 9, it is in /usr/sbin which should be in your path. This should return an error complaining that "there are no existing databases". It might also just return nothing (usually indicating that DB replicas were set up once and then all were deleted).

If you get a list of replicas, STOP. Someone set up or tried to set up disksuite before you, and figure out what the status is before proceding further. Using the command below to try to create another initial database set will hopefully yield an error, but if not could be disastrous, wiping out the previous DB and making the previously mirrored, striped, etc. disks inaccessible.

For a two disk system, Sun advises a minimum of 2 replicas per disk; physics uses 3. To create the initial replicas, issue the command (as root):

metadb -a -f -c 3 slice
replacing slice with the slice to put the initial replicas on. E.g., /dev/dsk/c0t0d0s7 to put it on slice 7 of the 1st disk. The -c 3 in the above command instructs it to put three copies of the DB there. The -a says we are adding replicas, and the -f forces the addition. NOTE: the -f option should only be used for the initial DB replica, when it is REQUIRED to avoid errors due to lack of any previously existing replicas.

You can check the databases were created successfully by issuing the metadb command without arguments, or with just the -i argument to get a legend for the flags. You should now see 3 (or whatever value you gave to the -c argument) DB replicas in successive blocks on the slice you specified. At this point, only the a (active) and u (up-to-date) flags should be set.

Now add the replicas on the second (and any other) drives. This is done with a command like:

metadb -a -c 3 /dev/dsk/c0t1d0s7
NOTE: make sure add same number of replicas to each disk, or it may cause problems when a disk fails (you may end up with less than half the replicas; having exactly half is bad enough). If problems, use -d option to delete all replicas on the named partition, and then re-add the correct number.

Again, you can use the plain metadb command (or give it the -i option for the flags legend) to verify the databases were created successfully. This command is also useful to use later to verify things are OK. At this early stage, you should see a line with the a and u flags for each replica on each disk.

Once disksuite is fully functioning and operational, you should again see a line for each replica on each disk. The following flags seem to be set on a functioning system (flags should appear for every replica unless otherwise stated):

Setting up DiskSuite to Mirror the Root disk of a Glued Solaris Box

This section gives instructions on setting up mirroring of the root disk on a new Glued Solaris box with 2 internal disks. Most of the instructions can be easily adapted for systems with more than 2 disks, or if mirroring something other than the root disk.

The instructions assume that we are enabling mirroring on a newly installed, non-production machine. These restrictions are not strictly required, but obviously enabling mirroring of the system disk on a system already in production runs some risks. The system WILL need to be rebooted at least once when mirroring the system disk; if a non-system disk is being mirrored you can probably get away with just umounting and remounting the partitions being mirrored at the appropriate times. NOTE: all filesystems being mirrored must be mounted using the single-ply metadisk mirror before attaching the second submirror to the mirror metadevice; if new data written to the filesystem will only be written to one submirror, and disksuite will get very upset upon attempt to reboot (as the two submirrors are listed as being in sync but are not); if you do this with root will probably not even be able to get to single-user mode.

  1. Do a "normal" install of Glue onto the first of the two internal disks. The only thing special compared to any other Glue install is that you need to have an idea of where the metadb DB replicas are to go and make sure the partition table has room for them. See this section for more information about locating the DB replicas.
  2. After Glue is installed, the initial rdist done, and the system is up and healthy, create the partition table on the second disk. I am not sure if this is strictly required, but if not it is strongly advised that all slices to be mirrored have the same slice number and be identical in the partition table (same starting and ending cylinder number, tags, flags, etc.). Indeed, since typically you will be mirroring all defined slices on the two disks, it is usually easiest to have identical partition tables, which you can do by copying the table between DISK1 and DISK2:
    prtvtoc /dev/rdsk/DISK1 | fmthard -s - /dev/rdsk/DISK2
  3. Perform an preliminary steps required. In particular, for Solaris 2.7, a number of devices and files need to be created.
  4. Decide on the location of the DB replicas (see here for help), and install the DB replicas as discussed in the previous section. E.g., if you are installing to slice 7 of the two disks c0t0d0 and c0t1d0, you would use
    metadb -f -c3 -a /dev/dsk/c0t0d0s7 metadb -c3 -a /dev/dsk/c0t1d0s7
  5. Decide on a metadevice naming scheme.
  6. Create the metadevices for the two submirrors and for a single-ply mirror device for each partition to be mirrored. I find this easiest to do by changing the md.tab file (found in /etc/opt/SUNWmd on Solaris 2.7 systems, and /etc/lvm on Solaris 8 and 9 systems); this has the advantage of allowing comments and a more lasting record of what was done (especially if save a copy in the glue config tree).
  7. Issue the mount command and make sure all filesystems to be mirrored are listed as mounted under the metadisk mirror device, not the physical device. Do not proceed any further unless this is the case, as attaching the second submirror while the physical device for the first submirror is mounted will result in metadisk syncing the disks, but any writes to the file system go to one disk only, thereby breaking the synchronization. What is worse, is that metadisk thinks they are synchronized, and will fail hard. You might not even be able to get into single user on next reboot. Make sure all filesystems to be mirrored are mounted with the metadisk mirror device before attaching the other submirror. This generally means a reboot after editting vfstab as indicated in previous step.
  8. At this point, the filesystems to be mirrored are using the mirrored metadevices, but we are not really mirroring yet because the mirrors all consist of only a single submirror. We can now attach the second submirror to each of the mirrors with a command like:
    metattach MIRROR SUBMIRROR2
    This causes SUBMIRROR2 to be attached to MIRROR, and will cause the newly attached submirror to be synchronized with the previous submirror, copying information from the old submirror to the new one. For the /var filesystem in our example, metattach d6 d61. Repeat this for all filesystems to mirror. Because the synchronization can take a while and keep the disks quite busy, you may want to allow each slice to finish synchronizing before starting the next on a production system (see the metastat command below). On non-production systems I usually get let them all sync in parallel.
  9. You can run the metastat to see how things are going. All the mirrors you recently attached should be syncing up. This can take some time. The new submirrors should now show up as "Submirrors" and not as "Concat/Stripes". States for the newly attached submirrors will be "Resyncing" but should eventually change to OK when synchronized.
  10. If the swap device was mirrored, you will need to change the crash dump device to the swap metadevice. This can be done with the command
    dumpadm -d `swap -l | tail -1 | awk '{print $1}'`
    It should report the new dump device as the swap mirror metadevice, e.g. /dev/md/desk/d2. You can verify this at a later point by issuing the command dumpadm without any arguments. Failure to do this may cause serious problems if a crash dump ever gets written, and may make system unbootable
  11. If you are mirroring the boot disk (e.g. root (/), you need to make the mirror disk bootable. On SPARC systems, this can be done with the command
    installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/ROOT2
    where ROOT2 is the root slice of the mirrored disk. E.g. c0t1d0s0.
  12. If you are mirroring the boot disk you will probably want to create an alias for the alternate boot disk. Openboot does not know about the software mirrorings, and will only boot from physical disks. Normally it boots from the first disk, but it is helpful to define an alias for the mirrored disk so you can easily boot off of that disk if the primary disk fails. This can be done either from the shell or from openboot (as was mentioned when we suggested that you might want to halt instead of reboot after modifying the vfstab file.) If you did not do it then, you can do it now, as follows:
    1. Determine the physical device path of the mirror disk by examining the symbolic link for the device under /dev/dsk and extracting everything following the /devices part. E.g., if the mirrored root disk is c0t1d0s0, issue the command ls -l /dev/dsk/c0t1d0s0. This should return something like
      benfranklin:~# ls -l /dev/dsk/c1t1d0s0
      lrwxrwxrwx 1 root root 70 May 4 2004 /dev/dsk/c1t1d0s0 -> ../../devices/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a
      so what you want is pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a.
    2. If you are doing this from a system that is up in unix, first check for any previously existing device aliases with the command eeprom nvramrc. If it returns something like data not available, you can just go ahead. Otherwise you need to determine if the previous definitions should be kept or not. If you want to wipe them out, just proceed as if there were no previous definitions, otherwise you need to cut and paste the previous definitions into the next command.
    3. Create a device alias with an easily remembered name (e.g. "mirror") for this disk, and then issue the following commands from the unix prompt
      eeprom "nvramrc=OLDDEFS devalias mirror PHYSPATH
      where OLDDEFS are any old definitions in the NVRAM that you wish to keep, and PHYSPATH is the physical path to the secondary boot disk. For the above example, assuming no previous definitions exist (or want to keep), you would have:
      eeprom "nvramrc=devalias mirror /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a"
      You can, of course choose another name than "mirror" for the alias if desired. Alternatively, you can do this with the corresponding command at the openboot prompt, e.g.
      nvalias mirror /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a
      (Actually, you can save some typing by running the show-disks command, and selecting the device you want, and then typing ^Y (control-Y) instead of the long device name in the nvalias command).
    4. Make sure the NVRAM aliases are read at boot up. From Unix, use the command
      eeprom "use-nvram?=true"
      or from openboot prompt the command
      setenv use-nvramrc? true
    5. You can verify the settings with the commands
      eeprom "nvramrc"
      from Unix of
      devalias
      from openboot prompt.


Main Physics Dept site Main UMD site


Valid HTML 4.01! Valid CSS!