Reported by Claudio Topolcic/CNRI and Bernhard Stockman/NORDUnet

Minutes of the Operational Statistics Working Group (OPSTAT)

Monday's Session

The purpose of this meeting were:

  1. Review the current status of the OPSTATS activities

       o Bernhard's papers
       o Other related efforts, specifically, Susan Estrada's BOF

  2. Decide what can be progressed now and progress it

       o Model
       o Set of metrics (simple SNMP only)
       o Display formats
       o Simple collection, storage, and exchange

  3. Define what is still left to do

       o MIB for new SNMP variables
       o Exchange protocol
       o More sophisticated storage formats
       o Develop publicly available collection tools
       o Display formats for weekly and instantaneous reports

  4. Specific actions to be taken in this meeting were:

       o Decide polling period
       o Agree on what to progress
       o Edit Bernhard's papers, review on Thursday, submit as Internet

The model was presented for people who were new to the group.  A
fundamental part of this model is the agreement on a common minimal set
of metrics that will be collected.  It was noted that some of these may
be difficult to obtain.

It had been proposed that there would be three report formats that would
be produced; a monthly report, a weekly report, and an instantaneous
display.  A format for the monthly report had been agreed to.  It was
described as a ``Macdonalds'' report because it would contain only total
aggregates.  It was felt that this report would support management
activities, whereas the weekly report would support engineering
planning, and the instantaneous display would support problem
resolution.  However, it was realized that the real distinction was not
the time frame but the degree of aggregation of the data.  The data in


the management reports would be more aggregated that that in the
engineering reports, regardless of the time they covered.

Bernhard's documents described the data that would be collected from
each router, both for each of the router's interface, and for the router
itself.  These are all MIB variables.  It was at first assumed that the
per interface variables were specific to IP, but it was pointed out that
the loading data needs to be total, not IP specific, or the link loading
could not be determined.  It was also pointed out that the MIB interface
variables are multi-protocol anyway, so there is no problem.  However,
it was also pointed out that if the router variables are IP only, then
they do not give a measure of the router's loading.

It was noted that the loading information that is important is not
related to any interface, but to the links.  Links are occasionally
rehomed when interfaces fail.  Currently, the data is processed by hand
to compensate for such rehoming.  The documents do not make this
distinction and need to be clarified.

Dropping the ``storage requirements'' section of Bernhard's document was
considered, but it was decided to keep it in, since dropping it would
give the misimpression that the group hadn't thought about the problem.

It had been proposed that the client-server model not be covered in the
current documents.  The reason, in part, was that the original purpose
of the Working Group was to get the various network operators to produce
consistent reports that could be compared, not to exchange information,
and that exchanging information is not required very often.

The data storage format was discussed.  The format impacts what will be
stored and what can be done with it.  To reduce storage requirements,
several people proposed that raw data could be kept for some period of
time, and then aggregated somewhat and kept for some other period of
time, and then further aggregated.  The proposals differed in the time
periods, and the form of aggregation.  However, it was pointed out that
although engineering requirements tend to be common, so common
non-aggregated data will be useful, management requirements tend to
differ, so common aggregated data is not useful.  In the end, it was
realized that how much data is retained, and how long, are local
decisions that cannot be standardized.

The data format should support the process that the data will undergo.
The process was identified as:

  1. Collect status data about routers and interfaces.

  2. Collect ``resource'' data, for example, about the mapping of links
     to interfaces.

  3. Process the data to merge 1 and 2, decreasing the quantity of data
     but without loss of information.


  4. Produce reports from the above reduced data.

It was understood that the processing in step 3 would not lead to
sufficient reduction in quantity to address long term data storage
problems.  However, it was felt that this processing should not be
combined with the report generation.

Bernhard proposed a raw data format, which was discussed.  He will
incorporate suggestions into his document.

It was suggested that the monthly reports be based on a matrix that
identified all the variables that would be collected and processing
functions that could be applied to them.  This would not only clearly
delimit the scope of the report generation process, but would also allow
new variables to be added easily.  However, this approach would not
support functions that are based on multiple variables, and although the
matrix could be relatively full, any network operator might select only
a few possibilities, and worse, the different operators might select
different sets.

It was felt that the Working Group should recommend a specific polling
period.  Two were on the table; 5 minutes and 15 minutes.  Concern was
expressed that 5 minutes or less might result in excessive overhead or
be impossible to implement with a poller that polls one router at a
time.  For variables describing link loading, such as bytes transmitted,
the polling period is a function of the line speed.  A one minute
polling period will miss the interesting peaks of a T1 line, but will
show the individual packets on a 1200 baud line.  For variables not
describing link loading, such as packets dropped, the polling interval
can generally be very long, until the value changes, at which time the
polling period should be shortened to help identify the problem.  So it
may be that a 15 minute polling period is sufficient for anything other
than link utilization.  This discussion was deferred until the next
meeting on Thursday.

Geoff Huston suggested a different approach.  He proposed that the link
utilization parameter that is most closely correlated to the clients'
dissatisfaction is the mean standard deviation of inter-packet arrival
times of evenly spaced (when transmitted) TCP packets.  He suggested
that this parameter explodes as soon as congestion appears.

Thursday's Session

During the second OPSTAT session the storage format and the polling
periods were discussed in more detail.

The Storage Format

The placeholder for the header section is suggested to be within the
log-file.  However, there might be useful with both separate and in-band


headers.  It was expressed the need for multiple header sections within
one log-file.  When closing and reopening the same log-file there is the
need for close and start time specifications.  When changing log-source
there is the need of specifying a new device.  Three delimiter pairs
were suggested:


There are currently two storage formats.  The version presented by
Bernhard Stockman and and earlier version produced by Chris Myers.
Chris Myers volunteered to produce a second version of his storage
format strawman.

The generic log data format is:

  timestamp, tag, delta_sample_interval, data1, data2, data3, ..., dataN

where the tag defines the logged variables.

The Polling Period

The reason for the polling is to achieve statistics to serve as base for
trend and capacity planning.  From the operational data it shall be
possible to derive engineering and management data.

It will not be sufficient with a polling period of 15 minutes to detect
variations in peak-behavior.  It was suggested that a period of maximum
1 minute would be needed.  Using such a tight polling period will create
a need for aggregating stored data.  Aggregation here means to over a
period with logged entries, a new aggregated entry is created by taking
the first and last of the previously logged entries over some
aggregation period and compute a new entry.

A method of displaying both average and peak-behaviors in the same
bar-diagram is to compute both the average value over some period and
the peak value during the same period.  The average and peak values are
then displayed in the same bar.

A problem here is how to aggregate peak values.  There is the
possibility of creating a new peak value being the peak of all the
peaks, the average of all the peaks, etc.


Another reason for aggregation is the differentiation of needed polling
periods depending on the reason for and source of the polling.

What is foreseen is that over a relatively short period, polled data
will be logged at the tightest polling period (1 minute) regularly these
data will be pre-processed into the actual files being stored.  The
pre-processing may include steps such as the computation of percent
samples above a certain limit, average of all samples during the
aggregation period, cumulative histograms.  This pre-processing will
than not only serve as storage compacting but also provide some initial
statistical processing.

Recommendation on polling period:

    Basic polling period    1 minute (60 seconds).

Recommendation on aggregation periods:

Over a

    24 hour period        aggregate to 15 minutes,
    1 month period        aggregate to 1 hour,
    1 year period         aggregate to 1 day

Aggregation is the computation of new average and maximum values for the
aggregation period based on the previous aggregation period data.

Recommendation for saving periods of logged and aggregated data:

    15 minute aggregation period      saved 1 week.
    1 hour aggregation period         saved 1 month.
    1 day aggregation period          saved 1 year.

Finally it was decided that, as the current document will not contain
the protocol specification of the client-server model, it will be
sufficient to put the comming RFC into the informational track.


