Some issues that have cropped up since gluster was installed.

  1. Slow writes to disk. These come in two categories

    1. for large enough fortran/ matlab(using netCDF) writes writing to gluster seems to be about 1.4 times slower than writing to the home area. (see, especially, top of pe_timing_results.txt).
    2. NetCDF writes from our PE model are catastrophically slow. See "short PE run" tests in pe_timing_results.txt. We still need to isolate this more, see PEtest/plans.txt.
  2. The read privileges have been apparently lost for some files. This does not necessarily appear this way on "ls"

    This issue seems to be 0byte meta-pointers left behind (see Gluster Community Chat Archive Logs, search on "sticky"). Logging onto each of mseas-data, nas-0-0 and nas-0-1 and deleting the files we run across (ls -lhat | grep "\-\-\-\-\-\-\-\-\-T" ) clears this up, but isn't a long term solution. For an example of this on our systems, see the gmeta files in /projects/philex/PE/2011/Jan09/arch75

    ls -lh gmeta*
    -rw-rw-r-- 1 phaley philex  94M Feb 11 16:11 gmeta_ccnt_arch75
    -rw-rw-r-- 1 phaley philex  16M Feb 11 16:17 gmeta_ccntW0_arch75
    -rw-rw-r-- 1 phaley philex 9.3M Feb 11 16:15 gmeta_ccntW_arch75
    -rw-rw-r-- 1 phaley philex  36M Feb 11 16:14 gmeta_dccnt_arch75
    
    file gmeta*
    gmeta_ccnt_arch75:   writable, regular file, no read permission
    gmeta_ccntW0_arch75: writable, regular file, no read permission
    gmeta_ccntW_arch75:  writable, regular file, no read permission
    gmeta_dccnt_arch75:  data
    
  3. Inability to re-install compute nodes. Need to test this on an otherwise healthy node when Greg is around.

  4. When nas-0-0 required rebooting, the 10GbE card came up as eth2 not eth3. This confused gluster. Why did this happen? Can it be prevented from happening again? Is nas-0-1 vunerable to the same thing?

  5. Intermittent failure of SGE jobs to start on compute nodes. This seems to be on compute nodes that can't find my home area. Rebooting may help, may not remain. Need to see if reinstallation works.

  6. PFJL has noticed that glusterfs averages around 10% cpu/memory on top on mseas. Is this normal? Note that 10% memory on mseas may translate to more on a compute node.

  7. We probably should synchronize the clocks on nas-0-0, nas-0-1 to the same as serving mseas and mseas-data.


Some issues that seem to have been solved since gluster was installed.

  1. Intermittent general slowness of mseas (e.g. 30-60s before any response from "ls"). For examples look in GenSlow subdirectory. (have seen significant swap usage on mseas-data)

    issue seems to have been seltroubleshootd grabbing massive amounts of memory and putting mseas-data into swap