1. Minutes of the OpStats Working Group : March 1991 IETF


Chairpersons: Bernard Stockman of NORDUnet and Phill Gross of CNRI
Notetaker:    Dan Friedman of BBN Communications.



The OSWG (Operational Statistics Working Group) met for three sessions. 
The following report summarizes the proceedings. It is organized along the 
lines of "Accomplishments", "Issues" and "Process" rather than as a 
sequential narrative. At the request of the chairpersons, the minutes contain 
proposals to resolve some of the open issues: basically, a (concrete) cut at 
what we should do next.  

2. Summary of Accomplishments

Our main accomplishments were to agree upon objectives for the work and 
to take some steps towards realizing those objectives. The objectives are  

 - To define an architecture for providing Internet access to operational 
   statistics for any Regional or the NSFnet.  

 - To classify the types of information that should be available 

 - To develop (or foster the development of) public domain software 
   providing this information. The aim here is to specify a baseline 
   capability that all the Regionals can support with minimal development 
   effort and minimal ongoing effort. (It is hoped that if they can do it with 
   minimal effort, they  in fact will.) 

Our progress in each of these areas is described next.

2.1.	Architecture

We selected a client/server architecture for providing Internet access to 
operational statistics, as shown in the figure.  

This architecture envisions that each NOC will have a server who provides 
locally collected information in a variety of forms (along the "raw <--> 
processed" continuum) for clients. High level proposals for the client/server 
interaction and functionality for the "first release" of the software are 
discussed later in the minutes. 

2.2.	Classification of Opstats Information

We identified three classes of reports based upon prospective audiences. They 
are:

  - Monthly Reports (a.k.a. "Political Reports") aimed at Management.
  - Weekly Reports aimed at Engineering (i.e.  planning)
  - Daily Reports aimed at Operations

2.3.	Development Plan

We decided that it was most important and easiest to address the 
management reports first, and therefore, we spent the most time focussing on 
them. We arrived at several key areas:

  - Offered Load (i.e. traffic at external interfaces)
  - Offered Load segmented by "Customer"
  - Offered Load  segmented protocol/application
  - Resource Utilization (Link/Router)
  - Availability

The first report came to be known as the "McDonald's Report" (N Billion 
Bytes/Packets Served). 

3.	Technical Issues

3.1.	Client/Server Interaction

The following was proposed for Client/Server Commands. (The initial 
proposal was put forth by Dan Long of NEARnet.)

Commands:
  - Login (with authentication)
  - Help -- Returns a description of the available data (names, a pointer
    to a map, gateways, interfaces, and variables) 
  - Format -- Defines retrieval format
  - Select/Retrieve -- Pose a query to server. (This generates a response 
    containing the data.) 
  - Exit.

Proposed Query Language:

  "SQL-like": SELECT <router interface> AND <variable> FROM 
 <startdate> TO <enddate> AT <granularity> WITH <conditions-met>

The authentication issue was considered important as some of the traffic 
information, i.e. who's talking how much to whom, will be sensitive.
We also felt that the "name/map" issue is important for the following 
reasons:  It will be impossible to agree on a naming structure that is 
universally meaningful.  Even if we could agree on such a convention, it will 
always be most convenient for the local network operators to maintain 
information using names that are meaningful to them.  Therefore, the server 
should be permitted to deliver results using the internal names but must able 
to provide file(s) that enable a person to figure out what the names mean. 

Notetaker's Proposal:
Maintain the following information in one or more files. Pointers to 
information are obtained by the Help command. 

Router names: 
Gives the name of the router as used in the statistics data. 
Gives a (human-supplied) description of the router's location, e.g. 
University XYZ, MegaBig International Corporate Headquarters, or some 
other information that enables an outsider to determine what role the 
router is playing in the network. This information embodies the 
knowledge contained in the network operators' heads. 

Net Names: 
Provides the (internal) names of the networks attached to 
the routers' external interfaces. (Router names can be internal here since 
the information in a) provides a mapping).  Gives associated IP 
addresses. 

ASCII file containing backbone point-to-point links (using router names 
to specify endpoints).  If the link also has an internal name that will be 
use when providing link information, give this name. Also gives 
linespeed. Need to think of a way to specify a connection to a public data 
service. All data provided by the server is given using internal names.

3.2.	Contents of Monthly Reports

We had three presentations on the Monthly Reports. (The groups were 
commended for their pioneering use of the 11PM-2AM time slot.) Members 
of the groups were 

 - Kannan Varadh? (Photocopy blurred here), Eric Carroll, Bill Norton, 
   Vikas Aggarwal

 - Sue Hares, Et. Al.  (Sorry, that's all I have on the hardcopy.)

 - Charles Carvalho, Ross Veach, David O'Leary

The following is a synthesis of the presentations and attendant discussions:

3.2.1.	The McDonald's report

The main issues here were: whether to provide packets or bytes or both and 
whether to provide input or output or both.

Notetaker's Opinion: 
I was convinced by the argument that, unless 
something is radically wrong with the network, differences between input 
and output should be "down in the noise", and the explanations for the 
differences will be too obscure for a management report. (If the network is 
really throwing away a large amount of traffic, we'll hear about it well 
before a management report has to be written.) So I vote for input only in the 
McDonald's Report.  More on bytes vs. packets later.

3.2.2.	Offered Load by Customer

There was agreement that this is useful. The main controversy was how 
customers should be identified in a publicly available report. 

Notetaker's Proposal: 
We present the cumulative distribution or density 
function of offered load vs. number of interfaces. That is:  Sort the offered 
load (in decreasing order) by interface.  Plot the function F(n), 
where F(n) is percentage of total traffic offered to the top n interfaces or 
the function f(n) where f is the percentage of traffic offered by the n'th 
ranked interface. (An example appears toward the end of the minutes.)

I feel that the cumulative is useful as an overview of how the traffic is 
distributed among users since it enable you to quickly pick off what 
fraction of of the traffic comes from  what number of "users." (It will be 
technically and politically difficult to resolve "user" below the level of 
"interface.") This graph will suggest more detailed explorations to people 
who have access to customer "names."

3.2.3.	Offered Load  by Protocol Type and Application

People seemed to agree that this is valuable and that pie charts are a good way
to present the information (since there is no "natural" ordering for the 
elements of the X-axis, a.k.a "Category Axis" in spreadsheet lingo.) "By 
protocol" means TCP, UDP etc. "By application" means Telnet, FTP, SMTP 
etc. It was also pointed out that it is potentially useful to do this both by 
packets and by bytes since the two profiles could be very different (e.g. FTP 
typically uses large packets, Telnet small packets etc.) 

3.2.4.	Resource Utilization

Everyone agreed that the objectives of this report should be to provide some 
indication of whether the network has congestion and if/where it needs more 
capacity. There was considerable debate on exactly how often one would have 
to poll utilization to determine whether there is congestion and also on 
exactly what summary statistics to present: averages, peaks, peak of peaks, 
peak of averages, averages of peaks, peaks of averages of peaks.....
We seemed to focus more on link utilization than on router utilization, 
probably for two reasons. It is more difficult to standardize measures of 
router utilization, and link costs dominate router costs.
We kept looking for some underlying "physics" of networks to determine the 
collection interval.  Here's one opinion. 

Notetaker's Opinion: 
It will be impractical to determine congestion solely 
from link utilization, since one would have to collect at a very small interval
(certainly less than one minute). Therefore, we should use estimate 
congestion by looking at dropped packet statistics. 

We should use link utilization to capture information on network loading. 
The polling interval must be small enough to be significant with respect to 
variations in human activity since this is the activity that drives loading in 
network variation. On the other hand, there is no need to make it smaller 
than an interval over which excessive delay would noticeabley impact 
productivity. For  example, people won't notice congestion if it only occurs 
for 10 seconds a day.

30 minutes is a good estimate for the time at which people remain in one 
activity and over which prolonged high delay will affect their productivity. 
To track 30 minute variations, we need to sample twice as frequently, i.e. 
every 15 minutes. 

3.2.5.	Availability

We didn't have much time to get to this.  There was discussion of presenting 
the information "By Customer" (e.g. Customers with Top N Total Outage 
Times) or just reporting on # outages that last longer than a certain amount 
of time. 

Notetaker's Proposal:
We should omit Availability reports from the first deployment for several 
reasons. First, we didn't spend enough time to obtain consensus. Second, they 
can be politically sensitive. Third, outage data can be very tough to process. 
Think of trying to determine exactly how a network partition affects 
connectivity between different pairs of end users. It's an "N-Squared" 
problem. If we do want to address this, we should start with site, router, and 
external interface outages only, since these are O(N) problems.

4.	Development Proposal

The following is a proposal for a "development/deployment" plan that tries 
to reach a reasonable compromise among  functionality, burden on network 
operations resources,  and "time to market." The discussion is segmented into 
three parts:

  - What information is to be  available through the server
  - What are the collection/storage requirements
  - What presentation tools should we build
 
4.1.	Information Base

The goal of the Server piece is to provide access to data in a fairly raw 
form (to be described next) and should be the first thing we do. Presentation 
tools that use this as input can be developed in parallel if people want to 
but we shouldn't put them on the critical path. 
We will have to provide the collection tools as well (unless every NOC  is 
already collecting enough data to supply the information outlined below.) 
The capabilities of the "first release" are to support the

  - McDonald's Report
  - Offered Load by Interface Report
  - Offered Load by Application Report
  - Link Utilization Report
  - Congestion Report

The Availability Report is missing because it is hard to do and (based upon 
the level of discussion we had) seemed to be of lower priority.
In the first release, we provide a server and client that can deliver the 
following statistics. For N specified days over a rolling three month interval:

  - Total Input Packets and Input Octets per day per external interface.
  - Total Input Packets and Octets across the network per day per 
    application.  (Note that this is NOT per interface.)
  - Mean, Standard Deviation, and Peak 15 minute utilization per day per 
    (unidirectional link) 
  - Peak discard percentages over fifteen minute intervals per link-direction 
    per day.

The Exchange Format between Server and Client should be ASCII-based 
because this enables people to quickly look at the data to see if it makes 
sense and because it enables quick, custom data reduction via AWK. (I have 
found both these capabilities to be useful in my own analyses of network data.)
The first Client that we write should simply retrieve the data in the exchange 
format and write it to disk. Rationale for this Base:

This information supports the reports described below and then some, so that 
presentation tools development will not be limited to these reports.
The three month collection interval is short enough to keep storage 
requirements under 5 Mbytes but long enough so that one can examine 
longer term trends by "dumping" the data a few times a year.  (These files 
should be highly compressible, easily 2:1, since they'll contain mainly ASCII 
numerals, repetitions of the names of entities, and whitespace, colons etc.)
The ASCII-based format will enable us to develop interoperable tools more 
quickly. TBD:

  - The exact exchange format (no real opinion here other than that it be 
    ASCII-based).
  - The command structure. The proposed format seems to be an excellent 
    starting point.

4.2.	Collection/Storage Requirements:

Input bytes and packets per external interface must be collected frequently 
enough to prevent counter overflow. As they are collected, they can be added 
to running totals for the day. At the end of the day, the daily totals for each
external interface are stored.

Input bytes and packets per application over all interfaces frequently enough 
to prevent overflow. At the end of the day these can be aggregated into daily 
totals. (I guess you have collect these per external interface but they can be 
aggregated into a network-wide total as the day goes on.)

Per link interface per 15 minutes: bytes sent, packets sent, packets received.
(To get the drop rate, you have to correlate sent and received at the two ends 
of the link.) At the end of the day, store away the average utilization, the 
standard deviation, the peak utilization, and the peak drop percentage.
Assuming 10 octets per item for storage, I estimate that the necessary 3 month 
history can be maintained with <5 Mbytes for a network with 100 routers, 500 
external interfaces, and 200 links.

4.3.	Reports/Presentation Tools:

My hunch is that standardization of presentation tools will come about based 
on who does the work first. (It's hard to argue with decent code that's in 
place: to wit, the entire TCP/IP phenomenon.) Here are some suggestions 
(and the reasoning)  for what we should do first.

4.3.1.	McDonald's Report:

For an N day period, graph Total Input Bytes per day. Put the average packet 
length as a "note" on the graph. 

Reason: 
Bytes is a better measure of the "useful" load carried by the network, 
i.e. the information sent around by the applications; packets are really an 
artifice of the way we do things.  As a network manager, I would be interested 
in the end-user volume of information. By putting the average packet length, 
one can convert to packet volumes if need by.

For the same reason, I suggest that the next two reports be done in bytes as 
well. Note that the suggested initial information base will support 
comparable presentations by packets as well.

4.3.2.	Offered Load by Customer Report:

Based on total input bytes for an N day period: Graph the distribution (or 
density function) of total input bytes vs. external interfaces as shown below. 
The external interfaces should be put in decreasing order of offered load (in 
bytes). 
                                                     
4.3.3.	Offered Load by Application Report

Based upon total input bytes for the N day period, present a pie chart of the 
distribution by application. 

4.3.4.	Link Utilization

The objective here is to provide some information on the utilization of the 
total set of links and on the "worst" link.
The input "data" we have to work with comprises two matrices:

 A(i,j) = average utilization of link i on day j
 P(i,j) = peak (15 minute) utilization of link i on day j.

Define TAVG(A(i)) = time average of A(i,j) (i.e. sum-over-j(A(i,j))/#days).
Define TAVG(P(i)) = time average of P(i,j) (i.e. sum-over-j(P(i,j))/#days).

I suggest that we order links by the TAVG(P(i)) measure, i.e. the "worst" link 
is the one that has the highest average peak utilization  over the period. 
Graph the following:

A histogram of the collection of A(i,j) values, using 10% buckets on the 
X-axis, i.e. plot the function F(n) where F(n) = percentage of A(i,j) entries 
in the (n-1)*10% -- n*10% range. 
                                         
  A comparable histogram of the P(i,j).

Histograms are useful for summarizing the data over all links over the entire 
period and can suggest further explorations. 
For the "worst link" (as defined above), plot as a function of day, its average
utilization for the day and its peak utilization for the day. (Note that the 
data that we collect supports exploration of these time series for any link.) 

Note that the proposed initial information base will support such analyses for 
any subset of the links.

4.3.5.	Congestion

The available data as specified in section is 

  D(i,j) = peak drop rate (during any fifteen minute interval) for link i on 
  day j.

  Plot a histogram of D(i,j). For the "worst" link (as defined above), 
  say link I, 

  plot D(I,j) as a function of j.

5.	Presentations

In addition to the groups on the monthly reports, we had presentations from 
Bill Norton of Merit and Chris Meyers of Wash. U.
Chris proposed an exchange format. I'm guessing that the document is 
available on-line if you wish to review it. 
Bill discussed Merit's OpStats activities for NSFnet. He focussed on their 
presentation tools as well as the way that they internally organize the data 
(a tree structure of Unix files). One important point made during this 
discussion is that relational databases are not good for storing OpStats. 
(Performance is the issue.)  This is unfortunate since many commercial DBMSs 
are relational in nature, and therefore, we cannot leverage their (usually 
substantial) report facilities. The idea of a "client/server" model grew out 
of Bill's presentation.

6.	Notable and Quotable

We had some discussion of how Network Managers use Management 
Reports and, therefore, what the reports need to present. One significant 
observation was that "Political Graphs don't have to make sense."
During Sue Hare's presentation of her group's work on the monthly reports, 
the KISS acronym was re-interpreted as Keep It Simple Sue.

Participants (who signed the list):
	
	Vikas Aggarwal <vikas@jvnc.net>	
	Bill Barns <barns@gateway.mitre.org>	
	Eric Carroll <eric@utcs.utoronto.edu>	
	Charles Carvalho <charles@salt.acc.com>
	Bob Collet /PN=Robert.D.Collet/O=US.SPRINT/ADMD=TELEMAIL/C=US/@SPRINT.C
     >OM
	Dale Finkelson <dmf@westie.unl.edu>
	Dan Friedman <danfriedman@bbn.com>
	Demi Getschko <demi@fpsp.fapesp.br>
	Dave Geurs <dgeurs@mot.com>
	Fred Gray <fred@homer.msfc.nasa.gov>
	Phill Gross <pgross@nri.reston.va.us>
	Olafur Gudmundsson <ogud@cs.umd.edu>
	Steven Hunter <hunter@es.net>
	Dale S Johnson <dsj@merit.edu>
	Dan Jordt <danj@nwnet.net>
	Tracy LaQuey Parker <tracy@utexas.edu>
	Nik Langrind <nik@shiva.com>
	Walt Lazear <lazear@gateway.mitre.org>
	Dave O'Leary <oleary@sura.net>
	Dan Long <long@nic.near.net>
	Garry Malkin <gmalkin@ftp.com>
	Lynn Monsanto <monsanto@sun.com>
	Don Morris <morris@ucar.edu>
	Bill Norton <wbn@merit.edu>
	Rehmi Post <rehmi@ftp.com>
	Joel Replogle <replogle@ncsa.uiuc.edu>
	Robert J. Reschly Jr. <reschly@brl.mil>
	Ron Roberts <roberts@jessica.stanford.edu>
	Manoel A Rodriques <manoel.rodrigues@att.com>
	Jim Sheridan <jsheridan@ibm.com>
	Brad Solomon <bsolomon@hobbes.msfc.nasa.gov>
	Osmund deSouza <desouza@osdpc.ho.atcom> ???
	Mike Spengler <mks@msc.edu>
	Bob Stewart <rlstewart@eng.xyplex.com>
	Roxanne Streeter <streeter@nsipo.nasa.gov>
	Kannan Varadhan <kannan@oal.net>
	Ross Veach <???????>
	Sue Wang <swang@ibm.com>
	Carol Ward <cward@spot.colorado.edu>
	Cathy Withbrodt <cjw@nersc.gov> ?????
	Wing Wong <ww14706@malta.sbi.com>