NOTES ON DATA REDUCTION


 J. A. Rupley, Tucson, Arizona


INTRODUCTION:

 In order to derive conclusions from quantitative measurements, there
must be some form of data reduction.  This can be as simple as a
comparison by eye of two curves drawn through the data.  If, however,
the data set is large and complex, for example with more than one
independent variable, and if the questions posed are detailed or involve
a complicated nonlinear model, then visual or graphical methods are less
satisfactory than a computer-based analysis.  The latter is now widely used.

 The intent of this short introduction is to show that a sophisticated
computer program can be handled easily, that its use saves time and
effort, that it can treat a more complicated model than can be handled
graphically, and that it provides information such as estimates of
uncertainties in the parameters that is difficult or impossible to
obtain from graphical methods.  

 These comments, although general, are addressed to the solution of a
particular problem, an analysis of enzyme kinetic data.  These are
initial rate measurements made on the lactate-dehydrogenase-catalyzed
reaction of pyruvate with NADH, in the presence and absence of lactate
as inhibitor.  The results of the computer fit are the following:  (1)
values of the kinetic constants V, KmA, KmB, KmAB, KmQ/KmPQ, and
KBInhib.  The defining equation is given below.  The first five
constants are those that can be evaluated by the standard graphical
methods of primary and secondary reciprocal plots, as described in
introductory textbooks.  The constant KBInhib is the dissociation
constant for the dead-end complex LDH-NADH-lactate, which is included in
the mechanism fit by the computer program but cannot be included in a
mechanism on which the graphical methods are based.  (2) Estimates of
the standard deviations of the kinetic constants.  These are needed for
an understanding of the reliability and significance of the values
calculated for the kinetic constants.  (3) A list of the coordinates of
points suitable for construction of the lines of the reciprocal plots of
the standard graphical methods.


REMARKS ON FITTING OF A MODEL TO DATA

 In a typical data reduction, a particular model to be tested is fit to
a set of data points under some criterion for best-fit.  The ith data
point of a set of N data points consists of a single value for the
dependent variable, Yobserved(i), measured for corresponding single
values for the one or more independent variables, Xobserved(i).  The
commonly-used least squares criterion for quality of fit is the minimum
value of the weighted sum of the squares of the deviations between the
observed values of Y and the values of Y calculated according to the
model.

Working from the model to be fit to the data, one develops an equation
relating the dependent variable Y to the independent variables X and to
a set of M variable parameters p, for each of the N data points.  For the
ith data point:

     Ymodel(i) = F(Xobserved(i); p(j), j=1,M)        eq. 1

If the model predicts a linear relationship between Y and a single
independent variable X :

     Ymodel(i) = p(1) + p(2) * Xobserved(i)          eq. 2

The constants p(1) and p(2) of equation (2) are the Y-axis intercept and
the slope, respectively, and of course are the same for all data points
(for all pairs of values Y(i) and X(i)).

 Fitting of a model to data consists of finding the values of the M
variable parameters p that give best agreement between the N pairs
of values of Ymodel(i) and Yobserved(i).  Best agreement can be defined
as the minimum value of the least squares function y:

          N
     y = SUM (Yobserved(i) - Ymodel(i))^2 * W(i)     eq. 3
         i=1


The factor W(i) of equation (3) is the normalized reciprocal variance
(the statistical weighting) of the ith data point, and it can be set at
unity if the data points are all of equal estimated uncertainty.

 Combining equations (1) and (3), one sees that the least squares
function y of equation (3) is a function of the full set of N data
points and a set of M variable parameters:

  y = f(Yobserved(i), Xobserved(i), i=1,N; p(j), j=1,M)   eq. 4

 The fitting problem therefore consists of finding the minimum value of
the least squares function y.  For a given set of data, y depends only
on the M variable parameters p (the data points Yobserved(i) --
Xobserved(i) in equation (4) are constant in the fitting).  There are
several methods commonly used to find the minimum of y and thus evaluate
the best-fit values of the parameters p. The more useful of these can
handle nonlinear model functions F (equation (1)) of essentially arbitrary
mathematical form.  The rate law for lactate dehydrogenase, given below,
is an example of a nonlinear model function.

 In the simplex method used for solving the fitting problem, one
constructs an M dimensional polyhedron with M + 1 vertices (the
simplex).  Each dimension of the simplex corresponds to a variable
parameter of equation (4).  Each vertex of the simplex is a point in the
M dimensional space, which is called "parameter space" or "factor
space." The M coordinates of each vertex are values of the M parameters.
Thus each vertex of the simplex has an associated value of the least
squares function y. The starting simplex is constructed to be so large
as to include within it the point corresponding to the minimum value of
y. This minimum point has as its coordinates the best-fit values of
the parameters.

 The minimization process shrinks the simplex about the minimum point,
even though the coordinates of the minimum are not known beforehand,
until the vertices of the simplex are so close together and so nearly
equal that an exit test is satisfied.  The exit test is set so that a
desired level of accuracy is obtained.  The values of the M parameters
averaged over all the vertices, i.e., the parameter values for the centroid
of the simplex, serve as reliable estimates of the best-fit parameter
values (those for the least squares function minimum), because the
minimum point is known to be inside the shrunken simplex and thus near
the centroid.

 We generally want to estimate the uncertainties in the parameter values
obtained for a model fit to a particular set of data points.  To this end,
one calculates standard deviations of the parameters.

 There are likely to be large uncertainties in the parameters if there
are few data points or if there are large deviations between Ymodel and
Yobserved.  As a rule, one should have 5 to 10 times as many data points
as parameters.

 The first try at estimating uncertainties of the parameters can fail.
The calculation involves matrix inversion, the use of differences
between nearly equal large numbers, and the approximation of a complex
surface by a simple quadratic function.  It may be necessary to change
certain test values and then to repeat the calculation of the standard
deviations.  In particular, if a parameter is close to a bound, so that
expansion of the simplex in that dimension is not possible, then that
parameter should be fixed in the quadratic fit.

 All fitting methods can fail.  We will not discuss problems with
bounds, local minima, ill-behaved functions, poor quality data,
physically unreasonable best fits, etc.  References given below should be
read for more complete discussions of the fitting problem.

 For additional discussion of fitting by use of the simplex
method, see "Notes on the fitting program", and the article by
Nelder and Mead (1965).


THE FUNCTION FIT TO THE DATA

 The function to be fit to data obtained in steady-state kinetic
experiments with lactate dehydrogenase is for an ordered ternary-complex
pathway with dead-end complexes (EAP and EQB):  

                    EAP
                     |   KPInhib = ea * p/eap
                     |
          ----------EA-----------
   k1 * a | k-1             k-2 | k2 * b
          |                     |
          |                     |
          E               EAB <---> EPQ                 eq. 5
          |                     |
          |                     |
       k4 | k-4 * q     k-3 * p | k3
          ----------EQ-----------
                     |
                     |   KBInhib = eq * b/eqb
                    EQB

where E is lactate dehydrogenase, A is NADH, B is pyruvate, P is
lactate, and Q is NAD.  Lower case letters denote reactant or product
concentrations.

 This pathway is more complicated than the one without dead-end
complexes, which is the basis of the standard graphical methods of
analysis of two-substrate--two-product reactions:

          -----------EA----------
          |                     |
          |                     |
          E               EAB <---> EPQ                 eq. 6
          |                     |
          |                     |
          -----------EQ----------

 For the direction of reaction pyruvate reduction by NADH and the
product inhibitor lactate, the rate law for the pathway of equation
(5) is:

     vo  =  V / [ 1  +  KmQ/KmPQ * p  +  KmA/a       eq. 7

               +  KmB/b * (1 + KmQ/KmPQ * p) * (1 + 1/KPInhib * p)

               +  KmAB/(a * b) * (1 + KmQ/KmPQ * p)

               +  k3/(k3 + k4) * 1/KBInhib * b ]

 The presence in the pathway of equation (5) of the dead end complexes
EAP and EQB leads to a significantly more complicated rate law than is
found for the simpler pathway of equation (6); compare equation (7) with
the following equation, for the "bare" compulsory order pathway without
dead-end complexes (the pathway of equation (6)):

     vo  =  V / [ 1  +  KmQ/KmPQ * p  +  KmA/a       eq. 8

               +  KmB/b * (1 + KmQ/KmPQ * p)

               +  KmAB/(a * b) * (1 + KmQ/KmPQ * p) ]

 Measurements of the initial rate, vo, made as a function of the
concentrations of NADH, pyruvate, and lactate are fit with equation (7),
by use of the program "ldhfit".  The parameters evaluated in the fitting
are those of equation (7) and are listed in Table I.

 Several points should be noted regarding equations (7) and (8):  (1) The
last term of equation (7), containing the equilibrium constant KBInhib,
probably can be neglected under initial rate conditions with q=0; in the
fitting this is recognized by setting the last term at a small value,
1E-10, which eliminates it.  (2) With this change, equations (7) and (8)
are identical except for the appearance in one term of equation (7) of a
factor containing the equilibrium constant KPInhib, which is for
dissociation of the dead-end complex EAP defined in the pathway of
equation (5).


REFERENCES:

A simplex method for function minimization.
J.A. Nelder and R. Mead (1965. Computer J. 7, 308.

Digital computer user's handbook.
M. Klerer and G.A. Korn (1967). Mcgraw-Hill, New York.

Data analysis in biochemistry and biophysics.
M.E. Magar (1972). Academic, New York.

The solution of the general least squares problem with special
reference to high-speed computers.
R.H. Moore and R.K. Ziegler (1960). Los Alamos Scientific
Laboratory Report LA-2367.


TABLE I:

                         EQUATION (7)
FITTING                  RATE LAW
PARAMETERS               PARAMETERS

1                        V

2                        KmA  =  KmNADH

3                        KmB  =  KmPyruvate

4                        KmAB =  KmNADH-Pyruvate

5                        KmQ/KmPQ = KmNAD/KmLactate-NAD

6                        1/KPInhib = 1/KInhibLactate

7                        k3/(k3 + k4) * 1/KInhibPyruvate
                              parm(7) approx. equal to 0 at t=0


INDEPENDENT VARIABLES:

a = [NADH]     b = [Pyruvate]    p = [Lactate]    q = [NAD]
                                                  q = 0 at t=0


DEPENDENT VARIABLE:

vo = initial rate of conversion of pyruvate to lactate