Excerpt from Difference Equations
7.1 Introduction
Chapter 6 served as a brief review of the theory of runs, or the consecutive occurrence of events in a sequence of Bernoulli trials. During this process we developed a family of difference equations denoted by Ri,k(n). From this family of equations, we were able to compute the probability distribution of the number of failure runs of a given length. With access to hydrology data from the state of Texas, we were able to verify that the distribution of drought lengths observed in Texas in the early to middle 20th century was what we might predict from the Ri,k(n) model. This was a gratifying finding since we were particularly concerned about the applicability of the assumption of applying independence across Bernoulli trials to the model.
The purpose of this chapter is to further elaborate the difference equation approach to run theory by providing a specific application to the prediction of droughts. The T[K,L](n) model mentioned in the previous chapter will be the focus of the mathematical developments of this chapter. In order to confirm the ability of a difference equation model to predict droughts, this system of difference equations must be motivated and solved.
7.2 Introduction to the T[K,L] (n) Model
One might wonder why, after the derivation of the complicated R0,k(n) model in the previous chapter, an additional difference family should be considered. The reason is that although the distribution of the number of runs of specified lengths is useful, there are other, more useful quantities specified by the order statistics (for example the minimum and maximum run lengths). Knowledge of the likelihood of occurrence of the worst run length and the best run length help workers to plan for the worst and best contingencies. However, the Ri,k(n) model does not directly reveal the probability distribution of these order. In this section we will derive a model based on the use of difference equations that will predict the probability distribution of shortest and longest run lengths using the prospective perspective. That is, we will identify the probability distribution of the minimum and maximum run lengths before the sequence has occurred, and therefore make these predictions without knowledge of the number of successes and failures.
7.2.1 Definition and assumptions of the T[K,L] (n) model
We will assume that we will observe a sequence of independent Bernoulli trials with fixed probability of success p. However we are not interested in the total number of runs that occurs in this sequence, but instead concentrate on the failure runs. Thus, we are interested in addressing the following question. What is the probability that in a future occurrence of n Bernoulli trials, each of the minimum failure run length and the maximum failure run length occur are between length K and length L when 0 ≤ K ≤ L ≤ n?
Before we proceed with this, however, some statements on order statistics are required. Order statistics are measures of the relative ranks of a collection of observations. For example, consider a collection of random variables, e.g., the annual rainfall in a region. Let the annual rainfall in the ith year be xi, for i = 1 to n. With only this knowledge, it is possible to index the rainfall as x1, x2, x3,…,xn. However, it is also possible to order them using a different framework. One alternative is to order these observations from the smallest to the largest, creating the new sequence x[1], x[2], x[3],…,x[n], where x[1] ≤ x[2] ≤ x[3] ≤…≤ x[n]. These ranks are called the order statistics of the sample. Identify x[1] as the minimum or smallest observation in the sequence, and x[n] is the maximum or largest observation.
Events involving these order statistics require careful thought and attention. For example, if the minimum observation of a sequence of random variables is greater than a constant c then all of the observations are greater than c. Since all the observations are greater than c and the minimum is the smallest of these observations, then the minimum must be greater than c. Setting a lower bound for the minimum sets a lower bound for the entire set of observations.
A similar line of reasoning applies to the maximum of the observations. If the maximum of an observation is less than a constant c then all of the observations must be less than c. Thus if we know that for a collection of observations the minimum is greater than or equal to K and the maximum is less than or equal to L, then all of the observations are “trapped” between K and L.
Define T[K,L](n) as
T[K,L](n) = P[the minimum and maximum failure run lengths are between length K and length L respectively in n trials]
T[K,L](n) is the probability that all failure run lengths are trapped between length K and length L in a sequence of n Bernoulli trials.[1] For a simple example, T[5,7](9)is the probability that in n Bernoulli trials all failure runs have lengths of either 5, 6, or 7. If consists of the following enumerated events
FFFFFHHHH FFFFFFHHH FFFFFFFHH
HFFFFFHHH HFFFFFFHH HFFFFFFFH
HHFFFFFHH HHFFFFFFH HHFFFFFFF
HHHFFFFFH HHHFFFFFF
HHHHFFFFF
Note that all sequences with failure run lengths of one, two, three, four, eight, and nine have been excluded. The probability of the sequences depicted above is T[5,7](9). Notice that in the above sequences, the sequence of all heads (no failures) is not considered.
Although each of the Rik(n) and T[K,L](n) models have as their common foundation, the Bernoulli model, they compute two different probabilities. The Rik(n) model focuses on a failure run of a specific length, and computes the probability of the number of failure runs of that length. The T[K,L](n) focuses on the probability not of a single failure run length, but whether all failure runs are in a certain run length interval. It is useful to think of T[K,L](n) as trapping failure run lengths within intervals.
7.2.2 Motivation for the T[K,L](n) model
The goal of this chapter is to formulate T[K,L](n) in terms of K, L, q, and n. By completing this computation, the user has access to the following probabilities for n consecutive Bernoulli trials
T[K,K](n) = P[all failure runs have the same run length (= K)]
T[0,L](n) = P[all failure run lengths are ≤ L in length]
1 – T[0,L](n) = P[at least one failure run length is greater than length L]
Further investigation reveals that this model provides important information involving order statistics. Observe that T[K,n](n) is the probability that the minimum failure run length is greater than or equal to K.
Similarly, T[0,L](n), (that is the probability that all failure run lengths are less than or equal to length L) is the probability that the maximum failure run length is ≤ L. T[0,L-1](n) is the probability that the maximum failure run length is less than or equal to L – 1 in length. Thus T[0,L](n) – T[0,L-1](n) is the probability that the maximum failure run length is exactly of length L. With this formulation, the expected maximum failure run length E[MF (n)] and its variance V(MF(n)) can be computed as
An analogous computation for the minimum failure run length is available. Observe that T[K,n](n) is the probability that the minimum run length is ≥ K. Then T[K+1,n](n) is the probability that the minimum failure run length is ≥ K+1, and T[K,n](n) – T [K+1,n] (n) is the probability that the minimum run length is exactly K. We can therefore compute the expected value of the minimum failure run length mF(n), E[mFn)], and its variance Var[mF(n)]
7.2.3 An application in hydrology
Consider the following formulation of a problem in water management. Workers in hydrology have long suspected that run theory with a basis in Bernoulli trials may be an appropriate foundation from which to predict sequences of hydrologic events. The hydrologist Yevjevich [1] was among the first at attempting a prediction of properties of droughts using the geometric probability distribution, defining a drought of k years as k consecutive years when there are not adequate water resources. In this development, the hallmark of events of hydrologic significance is consecutive occurrences of the event. We define a drought (negative run) as a sequence of consecutive years of inadequate water resources. We will also assume that the year to year availability of adequate water resources can be approximated by a sequence of Bernoulli trials in which only one of two possible outcomes (success with probability p or failure with probability q) may occur. These probabilities remain the same from year to year, and knowledge of the result of a previous year provides no information for the water resource availability of any following year. This model has been employed in modeling hydrologic phenomena [2].
If q is the probability of inadequate water resources, then T[K,L](n) is the probability that all droughts that occur in n trials are between K and L years in length. If T[K,L](n) can be formulated in terms of K, L, q, and n, quantities to which the hydrologist has access, the following probabilities for n consecutive years may be identified:
T[K,L](n) = P[all drought lengths are between K and L years long, inclusive]
1 – T[0,L](n) = P[at least one drought > L years in length]
Probabilities for the order statistics involving the maximum and minimum drought length are particularly noteworthy.
P[minimum drought length ≥ K years ] = T[K,n](n)
P[maximum drought length is ≤ L years ] = T[0, L](n)
Access to these probabilities provides useful information for predicting extremes in drought lengths. The notion of order statistics can be further developed by observing that T[0,L](n) is the probability that all droughts in n years are ≤ L years long, i.e., the maximum drought length is ≤ L years in length. T[0,L-1](n) is the probability that the maximum drought length is ≤ L-1 years in length. Thus T[0,L](n) – T[0,L-1](n) is the probability that the maximum drought length is exactly L years in length.
[1] A Bernoulli trial was defined in Chapter 2.