1. Minutes of the OpStats Working Group : March 1991 IETF Chairpersons: Bernard Stockman of NORDUnet and Phill Gross of CNRI Notetaker: Dan Friedman of BBN Communications. The OSWG (Operational Statistics Working Group) met for three sessions. The following report summarizes the proceedings. It is organized along the lines of "Accomplishments", "Issues" and "Process" rather than as a sequential narrative. At the request of the chairpersons, the minutes contain proposals to resolve some of the open issues: basically, a (concrete) cut at what we should do next. 2. Summary of Accomplishments Our main accomplishments were to agree upon objectives for the work and to take some steps towards realizing those objectives. The objectives are - To define an architecture for providing Internet access to operational statistics for any Regional or the NSFnet. - To classify the types of information that should be available - To develop (or foster the development of) public domain software providing this information. The aim here is to specify a baseline capability that all the Regionals can support with minimal development effort and minimal ongoing effort. (It is hoped that if they can do it with minimal effort, they in fact will.) Our progress in each of these areas is described next. 2.1. Architecture We selected a client/server architecture for providing Internet access to operational statistics, as shown in the figure. This architecture envisions that each NOC will have a server who provides locally collected information in a variety of forms (along the "raw <--> processed" continuum) for clients. High level proposals for the client/server interaction and functionality for the "first release" of the software are discussed later in the minutes. 2.2. Classification of Opstats Information We identified three classes of reports based upon prospective audiences. They are: - Monthly Reports (a.k.a. "Political Reports") aimed at Management. - Weekly Reports aimed at Engineering (i.e. planning) - Daily Reports aimed at Operations 2.3. Development Plan We decided that it was most important and easiest to address the management reports first, and therefore, we spent the most time focussing on them. We arrived at several key areas: - Offered Load (i.e. traffic at external interfaces) - Offered Load segmented by "Customer" - Offered Load segmented protocol/application - Resource Utilization (Link/Router) - Availability The first report came to be known as the "McDonald's Report" (N Billion Bytes/Packets Served). 3. Technical Issues 3.1. Client/Server Interaction The following was proposed for Client/Server Commands. (The initial proposal was put forth by Dan Long of NEARnet.) Commands: - Login (with authentication) - Help -- Returns a description of the available data (names, a pointer to a map, gateways, interfaces, and variables) - Format -- Defines retrieval format - Select/Retrieve -- Pose a query to server. (This generates a response containing the data.) - Exit. Proposed Query Language: "SQL-like": SELECT <router interface> AND <variable> FROM <startdate> TO <enddate> AT <granularity> WITH <conditions-met> The authentication issue was considered important as some of the traffic information, i.e. who's talking how much to whom, will be sensitive. We also felt that the "name/map" issue is important for the following reasons: It will be impossible to agree on a naming structure that is universally meaningful. Even if we could agree on such a convention, it will always be most convenient for the local network operators to maintain information using names that are meaningful to them. Therefore, the server should be permitted to deliver results using the internal names but must able to provide file(s) that enable a person to figure out what the names mean. Notetaker's Proposal: Maintain the following information in one or more files. Pointers to information are obtained by the Help command. Router names: Gives the name of the router as used in the statistics data. Gives a (human-supplied) description of the router's location, e.g. University XYZ, MegaBig International Corporate Headquarters, or some other information that enables an outsider to determine what role the router is playing in the network. This information embodies the knowledge contained in the network operators' heads. Net Names: Provides the (internal) names of the networks attached to the routers' external interfaces. (Router names can be internal here since the information in a) provides a mapping). Gives associated IP addresses. ASCII file containing backbone point-to-point links (using router names to specify endpoints). If the link also has an internal name that will be use when providing link information, give this name. Also gives linespeed. Need to think of a way to specify a connection to a public data service. All data provided by the server is given using internal names. 3.2. Contents of Monthly Reports We had three presentations on the Monthly Reports. (The groups were commended for their pioneering use of the 11PM-2AM time slot.) Members of the groups were - Kannan Varadh? (Photocopy blurred here), Eric Carroll, Bill Norton, Vikas Aggarwal - Sue Hares, Et. Al. (Sorry, that's all I have on the hardcopy.) - Charles Carvalho, Ross Veach, David O'Leary The following is a synthesis of the presentations and attendant discussions: 3.2.1. The McDonald's report The main issues here were: whether to provide packets or bytes or both and whether to provide input or output or both. Notetaker's Opinion: I was convinced by the argument that, unless something is radically wrong with the network, differences between input and output should be "down in the noise", and the explanations for the differences will be too obscure for a management report. (If the network is really throwing away a large amount of traffic, we'll hear about it well before a management report has to be written.) So I vote for input only in the McDonald's Report. More on bytes vs. packets later. 3.2.2. Offered Load by Customer There was agreement that this is useful. The main controversy was how customers should be identified in a publicly available report. Notetaker's Proposal: We present the cumulative distribution or density function of offered load vs. number of interfaces. That is: Sort the offered load (in decreasing order) by interface. Plot the function F(n), where F(n) is percentage of total traffic offered to the top n interfaces or the function f(n) where f is the percentage of traffic offered by the n'th ranked interface. (An example appears toward the end of the minutes.) I feel that the cumulative is useful as an overview of how the traffic is distributed among users since it enable you to quickly pick off what fraction of of the traffic comes from what number of "users." (It will be technically and politically difficult to resolve "user" below the level of "interface.") This graph will suggest more detailed explorations to people who have access to customer "names." 3.2.3. Offered Load by Protocol Type and Application People seemed to agree that this is valuable and that pie charts are a good way to present the information (since there is no "natural" ordering for the elements of the X-axis, a.k.a "Category Axis" in spreadsheet lingo.) "By protocol" means TCP, UDP etc. "By application" means Telnet, FTP, SMTP etc. It was also pointed out that it is potentially useful to do this both by packets and by bytes since the two profiles could be very different (e.g. FTP typically uses large packets, Telnet small packets etc.) 3.2.4. Resource Utilization Everyone agreed that the objectives of this report should be to provide some indication of whether the network has congestion and if/where it needs more capacity. There was considerable debate on exactly how often one would have to poll utilization to determine whether there is congestion and also on exactly what summary statistics to present: averages, peaks, peak of peaks, peak of averages, averages of peaks, peaks of averages of peaks..... We seemed to focus more on link utilization than on router utilization, probably for two reasons. It is more difficult to standardize measures of router utilization, and link costs dominate router costs. We kept looking for some underlying "physics" of networks to determine the collection interval. Here's one opinion. Notetaker's Opinion: It will be impractical to determine congestion solely from link utilization, since one would have to collect at a very small interval (certainly less than one minute). Therefore, we should use estimate congestion by looking at dropped packet statistics. We should use link utilization to capture information on network loading. The polling interval must be small enough to be significant with respect to variations in human activity since this is the activity that drives loading in network variation. On the other hand, there is no need to make it smaller than an interval over which excessive delay would noticeabley impact productivity. For example, people won't notice congestion if it only occurs for 10 seconds a day. 30 minutes is a good estimate for the time at which people remain in one activity and over which prolonged high delay will affect their productivity. To track 30 minute variations, we need to sample twice as frequently, i.e. every 15 minutes. 3.2.5. Availability We didn't have much time to get to this. There was discussion of presenting the information "By Customer" (e.g. Customers with Top N Total Outage Times) or just reporting on # outages that last longer than a certain amount of time. Notetaker's Proposal: We should omit Availability reports from the first deployment for several reasons. First, we didn't spend enough time to obtain consensus. Second, they can be politically sensitive. Third, outage data can be very tough to process. Think of trying to determine exactly how a network partition affects connectivity between different pairs of end users. It's an "N-Squared" problem. If we do want to address this, we should start with site, router, and external interface outages only, since these are O(N) problems. 4. Development Proposal The following is a proposal for a "development/deployment" plan that tries to reach a reasonable compromise among functionality, burden on network operations resources, and "time to market." The discussion is segmented into three parts: - What information is to be available through the server - What are the collection/storage requirements - What presentation tools should we build 4.1. Information Base The goal of the Server piece is to provide access to data in a fairly raw form (to be described next) and should be the first thing we do. Presentation tools that use this as input can be developed in parallel if people want to but we shouldn't put them on the critical path. We will have to provide the collection tools as well (unless every NOC is already collecting enough data to supply the information outlined below.) The capabilities of the "first release" are to support the - McDonald's Report - Offered Load by Interface Report - Offered Load by Application Report - Link Utilization Report - Congestion Report The Availability Report is missing because it is hard to do and (based upon the level of discussion we had) seemed to be of lower priority. In the first release, we provide a server and client that can deliver the following statistics. For N specified days over a rolling three month interval: - Total Input Packets and Input Octets per day per external interface. - Total Input Packets and Octets across the network per day per application. (Note that this is NOT per interface.) - Mean, Standard Deviation, and Peak 15 minute utilization per day per (unidirectional link) - Peak discard percentages over fifteen minute intervals per link-direction per day. The Exchange Format between Server and Client should be ASCII-based because this enables people to quickly look at the data to see if it makes sense and because it enables quick, custom data reduction via AWK. (I have found both these capabilities to be useful in my own analyses of network data.) The first Client that we write should simply retrieve the data in the exchange format and write it to disk. Rationale for this Base: This information supports the reports described below and then some, so that presentation tools development will not be limited to these reports. The three month collection interval is short enough to keep storage requirements under 5 Mbytes but long enough so that one can examine longer term trends by "dumping" the data a few times a year. (These files should be highly compressible, easily 2:1, since they'll contain mainly ASCII numerals, repetitions of the names of entities, and whitespace, colons etc.) The ASCII-based format will enable us to develop interoperable tools more quickly. TBD: - The exact exchange format (no real opinion here other than that it be ASCII-based). - The command structure. The proposed format seems to be an excellent starting point. 4.2. Collection/Storage Requirements: Input bytes and packets per external interface must be collected frequently enough to prevent counter overflow. As they are collected, they can be added to running totals for the day. At the end of the day, the daily totals for each external interface are stored. Input bytes and packets per application over all interfaces frequently enough to prevent overflow. At the end of the day these can be aggregated into daily totals. (I guess you have collect these per external interface but they can be aggregated into a network-wide total as the day goes on.) Per link interface per 15 minutes: bytes sent, packets sent, packets received. (To get the drop rate, you have to correlate sent and received at the two ends of the link.) At the end of the day, store away the average utilization, the standard deviation, the peak utilization, and the peak drop percentage. Assuming 10 octets per item for storage, I estimate that the necessary 3 month history can be maintained with <5 Mbytes for a network with 100 routers, 500 external interfaces, and 200 links. 4.3. Reports/Presentation Tools: My hunch is that standardization of presentation tools will come about based on who does the work first. (It's hard to argue with decent code that's in place: to wit, the entire TCP/IP phenomenon.) Here are some suggestions (and the reasoning) for what we should do first. 4.3.1. McDonald's Report: For an N day period, graph Total Input Bytes per day. Put the average packet length as a "note" on the graph. Reason: Bytes is a better measure of the "useful" load carried by the network, i.e. the information sent around by the applications; packets are really an artifice of the way we do things. As a network manager, I would be interested in the end-user volume of information. By putting the average packet length, one can convert to packet volumes if need by. For the same reason, I suggest that the next two reports be done in bytes as well. Note that the suggested initial information base will support comparable presentations by packets as well. 4.3.2. Offered Load by Customer Report: Based on total input bytes for an N day period: Graph the distribution (or density function) of total input bytes vs. external interfaces as shown below. The external interfaces should be put in decreasing order of offered load (in bytes). 4.3.3. Offered Load by Application Report Based upon total input bytes for the N day period, present a pie chart of the distribution by application. 4.3.4. Link Utilization The objective here is to provide some information on the utilization of the total set of links and on the "worst" link. The input "data" we have to work with comprises two matrices: A(i,j) = average utilization of link i on day j P(i,j) = peak (15 minute) utilization of link i on day j. Define TAVG(A(i)) = time average of A(i,j) (i.e. sum-over-j(A(i,j))/#days). Define TAVG(P(i)) = time average of P(i,j) (i.e. sum-over-j(P(i,j))/#days). I suggest that we order links by the TAVG(P(i)) measure, i.e. the "worst" link is the one that has the highest average peak utilization over the period. Graph the following: A histogram of the collection of A(i,j) values, using 10% buckets on the X-axis, i.e. plot the function F(n) where F(n) = percentage of A(i,j) entries in the (n-1)*10% -- n*10% range. A comparable histogram of the P(i,j). Histograms are useful for summarizing the data over all links over the entire period and can suggest further explorations. For the "worst link" (as defined above), plot as a function of day, its average utilization for the day and its peak utilization for the day. (Note that the data that we collect supports exploration of these time series for any link.) Note that the proposed initial information base will support such analyses for any subset of the links. 4.3.5. Congestion The available data as specified in section is D(i,j) = peak drop rate (during any fifteen minute interval) for link i on day j. Plot a histogram of D(i,j). For the "worst" link (as defined above), say link I, plot D(I,j) as a function of j. 5. Presentations In addition to the groups on the monthly reports, we had presentations from Bill Norton of Merit and Chris Meyers of Wash. U. Chris proposed an exchange format. I'm guessing that the document is available on-line if you wish to review it. Bill discussed Merit's OpStats activities for NSFnet. He focussed on their presentation tools as well as the way that they internally organize the data (a tree structure of Unix files). One important point made during this discussion is that relational databases are not good for storing OpStats. (Performance is the issue.) This is unfortunate since many commercial DBMSs are relational in nature, and therefore, we cannot leverage their (usually substantial) report facilities. The idea of a "client/server" model grew out of Bill's presentation. 6. Notable and Quotable We had some discussion of how Network Managers use Management Reports and, therefore, what the reports need to present. One significant observation was that "Political Graphs don't have to make sense." During Sue Hare's presentation of her group's work on the monthly reports, the KISS acronym was re-interpreted as Keep It Simple Sue. 