Network Data Collection at SLAC
Collect data via SNMP from:
- Bridges, routers, ethermeters, hubs and switches
- Data collected includes:
- # good packets, # kilobytes, pkt size distribution
- # errors (# of types of errors)
- # pkts dropped, discarded, buffer/controller overflows
- top-10 talkers & protocol distributions Collect data via Ping – for response, pkt loss, connectivity from:
- critical servers, router interfaces, ethermeters
- off-site collaborators’ nodes Other Sources:
- Poll critical Unix network daemons & services (e.g. mail, WWW, name, font, NFS …)
- ARP caches
- appearance of new unregistered nodes
Data Analysis at SLAC
Once a day (in the early morning), via batch jobs:
- The previous day’s data is analyzed and summarized into ASCII files (usually tabular) and graphs
- Long term graphs (fortnightly, monthly, 180 days) are updated Ongoing analysis during the day consists of:
- Generating files of hourly graphs and other displays of data collected to far today.
- Bridge, router and ethermeter interface stats
- Top10 talkers and subnet protocol usage
Data Reduction at SLAC
Analysis generates thousands of reports most of which are uninteresting
Reduction examines the analysis reports and extracts the exceptions e.g.
- Duplicate IP addresses
- Appearance of new unregistered nodes
- Loss of connectivity
- Data values exceeding thresholds, e.g.
- CRC & alignment errors > 1 in 10000 packets
- total utilization on a subnet of > 10% for the day
- broadcast rate > 150 pkts/sec
- (shorts+collisions)/good_packets > 10%
- packet loss from onsite pings > 1% in a day
- bridge/router overflows and queue drops
- Creates exception reports (for display by WWW) with hypertext links to tables and plots with more information
Alert Notification
The daily WWW visible exception reports are manually reviewed each working morning and used as input to the morning H. O. T. meeting
- 5-15 min open meeting of network ops & development, systems admins, help desk and other interested people
- covers: scheduled outages and installations, newly identified problems, outstanding/unresolved problems In addition:
NMS maps show when a managed critical interface becomes inactive (goes red)
SNMP and ping-polling of critical interfaces results in:
- issuing of X-window pop-up windows
- phone pages being issued
- e-mail messages Security intrusions result in:
- phone pages being issued by the pager system
Results
Service Level Expectations:
- Examples
- Ping response time for on-site network layer < 10msec for 95% of samples
- Network reachability of critical nodes of >= 99%
- Sub-second response for trivial network services (name, font, network daemons (smtp, nfsrpc) …)
- 95% of trivial mail delivered on site in 10 minutes
- 95% of requests for SLAC WWW home page served in < 0.1 secs.
Leave a Reply
You must be logged in to post a comment.