NOTES ON DATA REDUCTION J. A. Rupley, Tucson, Arizona INTRODUCTION: In order to derive conclusions from quantitative measurements, there must be some form of data reduction. This can be as simple as a comparison by eye of two curves drawn through the data. If, however, the data set is large and complex, for example with more than one independent variable, and if the questions posed are detailed or involve a complicated nonlinear model, then visual or graphical methods are less satisfactory than a computer-based analysis. The latter is now widely used. The intent of this short introduction is to show that a sophisticated computer program can be handled easily, that its use saves time and effort, that it can treat a more complicated model than can be handled graphically, and that it provides information such as estimates of uncertainties in the parameters that is difficult or impossible to obtain from graphical methods. These comments, although general, are addressed to the solution of a particular problem, an analysis of enzyme kinetic data. These are initial rate measurements made on the lactate-dehydrogenase-catalyzed reaction of pyruvate with NADH, in the presence and absence of lactate as inhibitor. The results of the computer fit are the following: (1) values of the kinetic constants V, KmA, KmB, KmAB, KmQ/KmPQ, and KBInhib. The defining equation is given below. The first five constants are those that can be evaluated by the standard graphical methods of primary and secondary reciprocal plots, as described in introductory textbooks. The constant KBInhib is the dissociation constant for the dead-end complex LDH-NADH-lactate, which is included in the mechanism fit by the computer program but cannot be included in a mechanism on which the graphical methods are based. (2) Estimates of the standard deviations of the kinetic constants. These are needed for an understanding of the reliability and significance of the values calculated for the kinetic constants. (3) A list of the coordinates of points suitable for construction of the lines of the reciprocal plots of the standard graphical methods. REMARKS ON FITTING OF A MODEL TO DATA In a typical data reduction, a particular model to be tested is fit to a set of data points under some criterion for best-fit. The ith data point of a set of N data points consists of a single value for the dependent variable, Yobserved(i), measured for corresponding single values for the one or more independent variables, Xobserved(i). The commonly-used least squares criterion for quality of fit is the minimum value of the weighted sum of the squares of the deviations between the observed values of Y and the values of Y calculated according to the model. Working from the model to be fit to the data, one develops an equation relating the dependent variable Y to the independent variables X and to a set of M variable parameters p, for each of the N data points. For the ith data point: Ymodel(i) = F(Xobserved(i); p(j), j=1,M) eq. 1 If the model predicts a linear relationship between Y and a single independent variable X : Ymodel(i) = p(1) + p(2) * Xobserved(i) eq. 2 The constants p(1) and p(2) of equation (2) are the Y-axis intercept and the slope, respectively, and of course are the same for all data points (for all pairs of values Y(i) and X(i)). Fitting of a model to data consists of finding the values of the M variable parameters p that give best agreement between the N pairs of values of Ymodel(i) and Yobserved(i). Best agreement can be defined as the minimum value of the least squares function y: N y = SUM (Yobserved(i) - Ymodel(i))^2 * W(i) eq. 3 i=1 The factor W(i) of equation (3) is the normalized reciprocal variance (the statistical weighting) of the ith data point, and it can be set at unity if the data points are all of equal estimated uncertainty. Combining equations (1) and (3), one sees that the least squares function y of equation (3) is a function of the full set of N data points and a set of M variable parameters: y = f(Yobserved(i), Xobserved(i), i=1,N; p(j), j=1,M) eq. 4 The fitting problem therefore consists of finding the minimum value of the least squares function y. For a given set of data, y depends only on the M variable parameters p (the data points Yobserved(i) -- Xobserved(i) in equation (4) are constant in the fitting). There are several methods commonly used to find the minimum of y and thus evaluate the best-fit values of the parameters p. The more useful of these can handle nonlinear model functions F (equation (1)) of essentially arbitrary mathematical form. The rate law for lactate dehydrogenase, given below, is an example of a nonlinear model function. In the simplex method used for solving the fitting problem, one constructs an M dimensional polyhedron with M + 1 vertices (the simplex). Each dimension of the simplex corresponds to a variable parameter of equation (4). Each vertex of the simplex is a point in the M dimensional space, which is called "parameter space" or "factor space." The M coordinates of each vertex are values of the M parameters. Thus each vertex of the simplex has an associated value of the least squares function y. The starting simplex is constructed to be so large as to include within it the point corresponding to the minimum value of y. This minimum point has as its coordinates the best-fit values of the parameters. The minimization process shrinks the simplex about the minimum point, even though the coordinates of the minimum are not known beforehand, until the vertices of the simplex are so close together and so nearly equal that an exit test is satisfied. The exit test is set so that a desired level of accuracy is obtained. The values of the M parameters averaged over all the vertices, i.e., the parameter values for the centroid of the simplex, serve as reliable estimates of the best-fit parameter values (those for the least squares function minimum), because the minimum point is known to be inside the shrunken simplex and thus near the centroid. We generally want to estimate the uncertainties in the parameter values obtained for a model fit to a particular set of data points. To this end, one calculates standard deviations of the parameters. There are likely to be large uncertainties in the parameters if there are few data points or if there are large deviations between Ymodel and Yobserved. As a rule, one should have 5 to 10 times as many data points as parameters. The first try at estimating uncertainties of the parameters can fail. The calculation involves matrix inversion, the use of differences between nearly equal large numbers, and the approximation of a complex surface by a simple quadratic function. It may be necessary to change certain test values and then to repeat the calculation of the standard deviations. In particular, if a parameter is close to a bound, so that expansion of the simplex in that dimension is not possible, then that parameter should be fixed in the quadratic fit. All fitting methods can fail. We will not discuss problems with bounds, local minima, ill-behaved functions, poor quality data, physically unreasonable best fits, etc. References given below should be read for more complete discussions of the fitting problem. For additional discussion of fitting by use of the simplex method, see "Notes on the fitting program", and the article by Nelder and Mead (1965). THE FUNCTION FIT TO THE DATA The function to be fit to data obtained in steady-state kinetic experiments with lactate dehydrogenase is for an ordered ternary-complex pathway with dead-end complexes (EAP and EQB): EAP | KPInhib = ea * p/eap | ----------EA----------- k1 * a | k-1 k-2 | k2 * b | | | | E EAB <---> EPQ eq. 5 | | | | k4 | k-4 * q k-3 * p | k3 ----------EQ----------- | | KBInhib = eq * b/eqb EQB where E is lactate dehydrogenase, A is NADH, B is pyruvate, P is lactate, and Q is NAD. Lower case letters denote reactant or product concentrations. This pathway is more complicated than the one without dead-end complexes, which is the basis of the standard graphical methods of analysis of two-substrate--two-product reactions: -----------EA---------- | | | | E EAB <---> EPQ eq. 6 | | | | -----------EQ---------- For the direction of reaction pyruvate reduction by NADH and the product inhibitor lactate, the rate law for the pathway of equation (5) is: vo = V / [ 1 + KmQ/KmPQ * p + KmA/a eq. 7 + KmB/b * (1 + KmQ/KmPQ * p) * (1 + 1/KPInhib * p) + KmAB/(a * b) * (1 + KmQ/KmPQ * p) + k3/(k3 + k4) * 1/KBInhib * b ] The presence in the pathway of equation (5) of the dead end complexes EAP and EQB leads to a significantly more complicated rate law than is found for the simpler pathway of equation (6); compare equation (7) with the following equation, for the "bare" compulsory order pathway without dead-end complexes (the pathway of equation (6)): vo = V / [ 1 + KmQ/KmPQ * p + KmA/a eq. 8 + KmB/b * (1 + KmQ/KmPQ * p) + KmAB/(a * b) * (1 + KmQ/KmPQ * p) ] Measurements of the initial rate, vo, made as a function of the concentrations of NADH, pyruvate, and lactate are fit with equation (7), by use of the program "ldhfit". The parameters evaluated in the fitting are those of equation (7) and are listed in Table I. Several points should be noted regarding equations (7) and (8): (1) The last term of equation (7), containing the equilibrium constant KBInhib, probably can be neglected under initial rate conditions with q=0; in the fitting this is recognized by setting the last term at a small value, 1E-10, which eliminates it. (2) With this change, equations (7) and (8) are identical except for the appearance in one term of equation (7) of a factor containing the equilibrium constant KPInhib, which is for dissociation of the dead-end complex EAP defined in the pathway of equation (5). REFERENCES: A simplex method for function minimization. J.A. Nelder and R. Mead (1965. Computer J. 7, 308. Digital computer user's handbook. M. Klerer and G.A. Korn (1967). Mcgraw-Hill, New York. Data analysis in biochemistry and biophysics. M.E. Magar (1972). Academic, New York. The solution of the general least squares problem with special reference to high-speed computers. R.H. Moore and R.K. Ziegler (1960). Los Alamos Scientific Laboratory Report LA-2367. TABLE I: EQUATION (7) FITTING RATE LAW PARAMETERS PARAMETERS 1 V 2 KmA = KmNADH 3 KmB = KmPyruvate 4 KmAB = KmNADH-Pyruvate 5 KmQ/KmPQ = KmNAD/KmLactate-NAD 6 1/KPInhib = 1/KInhibLactate 7 k3/(k3 + k4) * 1/KInhibPyruvate parm(7) approx. equal to 0 at t=0 INDEPENDENT VARIABLES: a = [NADH] b = [Pyruvate] p = [Lactate] q = [NAD] q = 0 at t=0 DEPENDENT VARIABLE: vo = initial rate of conversion of pyruvate to lactate