[Note: this report is rather lengthly & technical, but is meant as a fair/blunt
summary of the month's happenings with the (ideal) target audiance being the
Sysadmin staff who can use this to understand what happened, and take
corrective steps to pro-actively prevent problems in the future]

     Server and Network Availability Report for June/1997


Except where noted, we have complete data for the entire month.
We are convienently ignoring downtime for the swatch host;
which really wasn't that much ...

The first set of entries below are availability/uptimes for each
segment (see below which identifies the machines monitored on each).
The second set of entries is machine availability ... but note that
downtime due to network outage on that segment (since these are all
monitoring from a machine in BLD on the .198 subnet) are subtracted
to give a "true" machine availability. Note that "ping" is used 
except where noted ... it basically tests that a machine is alive
and reachable ... but not higher level services.



                                        UPTIME%
Host/Location           Jun/97    May/97    Apr/97    Mar/97    Feb/97    Jan/97

XXX.YYY.AA.BBB           99.47     97.89     99.94     99.97     99.76     99.53    
XXX.YYY.AAA.BBB          99.91    100.00    100.00    100.00    100.00    100.00    
XXX.YYY.AAA.BBB          99.88    100.00    100.00    100.00    100.00    100.00    
DNS (AA.B/AA.B)        99.3/98.8   99.7/8   99.97/7   99.8/7    99.99/9   99.98/98.8 
BL1 (old subnet)                                                99.98     99.99     
BL1 (.AA subnet)         99.41     99.81     99.99
BL1 (.AAA subnet)        99.40     99.81     99.96               99.24*M1  99.30*F1
BL2 (.AA subnet)         99.33     99.87     99.98     99.89     99.91     99.98    
BL2 (.AAA subnet)              
BL3 (.AAA subnet)        99.21     99.79     99.51     99.36*M2  99.52     99.70    
BL4 (.AA subnet)         99.11     99.75     99.61     99.82     99.81     99.87    
BL5 (.AAA subnet)        99.07     99.63     99.42
BL5 (.AAA subnet)        99.07     99.63     99.42
BLD (.AA subnet)         99.45     97.89     99.89     99.96     99.76     
BLD (.AAA subnet)        99.91    100.00    100.00    100.00    100.00     
BLD (.AAA subnet)        99.84     99.99    100.00    100.00     99.15*F2  
REMOTE1                  99.34     99.79     99.97     99.83     99.82     99.95    
REMOTE2                  99.15*J1  99.71*M1  99.75     99.79     98.89*F3  99.88*J1 

REMOTE3                  99.24*J1  99.79*M1  98.95*A1
REMOTE4                  99.34*J1  99.79*M1  99.94     99.81     99.95*F3  99.96*J1
REMOTE5                  99.12     99.78     98.86     99.93*M3  99.94     99.91   

H1 (mountd)              99.87     99.89     99.96     99.92     99.94     99.98    
H1 (nfs)                 99.96     99.99    100.00     99.98     99.94     99.98    
H1 (ping)                99.96    100.00    100.00     99.98     99.94     99.98    
HOST22 (mountd)          99.87*J2  99.99*M2  99.90     99.96     99.46*F5  99.99    
HOST22 (nfs)             99.87*J2  99.99*M2  99.92     99.96     99.72*F5 100.00    
HOST22 (ping)            99.93*J2  99.99*M2  99.99     99.99     99.97*F5 100.00  
HOST1 (mountd)           99.96     99.02*M3  99.99*A2  99.90     99.97     97.78  
HOST1 (nfs)              99.96     99.02*M3  99.99*A2  99.90     99.97     97.78  
HOST1 (ping)             99.96     99.02*M3  99.99*A2  99.94     99.97     97.78  
HOST2 (mountd)           99.97     99.99*M4  99.99     99.97     99.98*F6  99.99  
HOST2 (nfs)              99.97     99.99*M4  99.99     99.98    100.00*F6  99.99  
HOST2 (ping)             99.97     99.99*M4  99.99     99.98    100.00*F6  99.97     
HOST3 (mountd)           99.87     99.99     99.99     99.94     99.64*F7  99.99*J2  
HOST3 (nfs)              99.87     99.99     99.99     99.94     99.64*F7  99.99*J2  
HOST3 (ping)             99.87     99.99     99.99     99.94    100.00*F7  99.99*J2  
HOST4 (ping)             99.71     99.68     99.70     99.68     99.11*F4  99.68*J1  
HOST123 (ping)           99.97     99.99     99.56    100.00     99.98     99.41     
HOST5(ping)              99.89*J3  99.87*M5  99.90     99.89     99.98    100.00     


No correction for network related or scheduled/planned machine downtime on these machines.
HOST6                    99.19     95.06     99.93     99.07
HOST7                    99.46     99.90     99.97     98.43
HOST8                    99.41     99.81     99.91     98.41  
HOST9                    99.42     99.90     99.97     98.78


Applications Availablity (based on actual tests to see it if is up)
FCS                        *J4   see below   85        85.0      83.3*F8   84.3*J1  
PDM                                                               *F9      99.97*J3 


*J1: Does not include power outage from 1519 6/21 to 1646 6/22
*J2: Does not include scheduled weekend outages to swap out disk arrays
*J3: Does not include weekend outage from 2114 6/27 to 0840 6/28
*J4: FCS monitoring turned off per request ...


*M1: Does not include scheduled network outage on 5/22 from 17:45 to midnight ...
*M2: Does not included scheduled downtime on Sat, May 10th,  Sun, May 11, Sun May 18th
*M3: Most of this was on Sunday mornings due to failure of HOST4 to reboot cleanly
*M4: Does not include scheduled Memorial weekend upgrade downtime
*M5: Does not include weekend outage from 1832 5/16 to 0748 5/19


*A1: 93% of the downtime was on Fri, April 4th from 0925 to 1618
               Monitoring was started at 1141 on April 3rd
*A2: Does not include HOST4 getting "stuck" from 0321 4/6 to 0043 4/7


*M1: Does not include scheduled machine room outage on 3/4/97 from 1711 to 1854
*M2: Does not include scheduled network outage on 3/12/97 from 1926 to 0016
*M3: Does not include extended outage from on 3/30/97 from 1255 to 0921

*F1: BLDG1 subnet monitoring started Feb 7th at 1601
*F2: Almost all of this was on 2/18 from 0619 to 1200
*F3: Does not include Feb 15th weekend downtime due to planned power outage in REMOTE
*F4: Interestingly enough, the sdibm was NOT affected by the power outage!  ;-)
*F5: Automounter loopback on 2/28 caused HOST1 downtime - see comments below
*F6: Does not include scheduled downtime on Feb 6th between 0620 to 0638
*F7: First three days of month ignored ... machine was in transit and not coordinated
*F8: FCS-checker burped from 2250 Feb 9th to 2200 Feb 15th ... so data corrected for this dropout
*F9: PDM checker turned off at 1430 on Feb 25th ... tired of people changing the config files and not telling me.

*J1: Does not include weekend outage from 1438 1/11 to ~1800 1/12
*J2: Does not include being shutdown from 1024 on 1/31
*J3: 99.97 (HOST1), 99.93 (HOST2), 99.46 (HOST3), 98.99 (HOST4)
     Misc. boofoonery caused pdm-checker to stop gathering data around noon on Jan 31st.


Routers/Servers are pings to a single point ... whereas the various 
locations listed above are multiple machines that must all show down 
at the same time. Pings are also sent to machines behind routers 'cause
sometimes the routers are pingable, but not forwarding stuff.

Servers: Generally good ... 

Routers: Generally good ... although some misc. wide-spread outages.



   ------- HOW THIS DATA IS ACTUALLY COLLECTED -------


It is very IMPORTANT to understand how this data was collected ...
and I'm somewhat remiss to show 4 decimal points above ... but
I believe it is a fairly accurate representation of how things went.

pingem runs "fping" ... which is a "broadcast" ping program against
the above mentioned hosts (and others). fping tries each host three
times before determining that it is down ... and then a failure is logged.
I.e. it is quite possible we may miss a temporary outage ...
but our "granularity" is one minute ... which is pretty fine.

For the Servers, we do an "rpcinfo" to insure that it is responding to 
mountd and nfs requests ... and we follow a similar algorithm as stated above. 

For the applications (FCS), an attempt is made to login/logoff for FCS
and the pdmstatus command. The former is attempted every 10 minutes, and the
later every minute.

All of this output (including by minute timestamps) are logged ...
and then at the 1st of the month, these logs are rotated. 
Totally raw data is in /home/mail1-local/home/swatch/local/logs/pingem/
I then run them through ~swatch/scripts/generate-stats, which
gives me output with stuff of "interest" ... which I then
manually scan to determine approximate minutes of downtime
and I make some misc. arbitrary decisions to get the stats.

We are only checking connectivity to the pingem host within BLD. 
I.e. remote sites may/will experience more downtimes because of 
network problems we don't see here ... although we have a fairly
good handle on that by seeing if there is base level connectivity.

You should not infer adequate network performance from the above;
these are very simple, low-level tests that basically see if the
machine is alive and a few services are running.