###########################################################################
###########################################################################
NOTE! NOTE!
This file, written for "old" S, has not been adapted for R.
Also, it uses graphics applications that may not be
available for Mac OS X:
"brush" - SGI Irix application
"xgobi" or successor "ggobi"
Nevertheless, the discussion of the methods remains accurate and,
IMHO, worth a read, even if long.
###########################################################################
###########################################################################
###########################################################################
#################### TUTORIAL ON S AND FITTING OF DATA ##################
###########################################################################
###########################################################################
#################### INTRODUCTION ######################################
###########################################################################
This tutorial uses the S language and environment to fit and display
data. S is an integrated suite of software for statistical analysis,
computation, and the display and manipulation of data. S is built
around operations on vectors and arrays. S has a powerful programming
language and a development environment. There are various versions of
S. The one installed on this computer is from the 1992 AT&T sources.
A commercial version of S, called S-Plus, for workstations and PC's, is
available from Statistical Sciences and is used by several University
of Arizona Departments.
The user should be familiar with the basic operations of S. For an
introductory description of S, see:
"Notes on S-PLUS", by Bill Venables and Dave Smith, 1992.
Hard copies and computer readable copies of "Notes on S-PLUS" have been
made available. A new user of S should work through the sample S
session on pages 61-63. A copy of these commands is in the file
"venables.s".
For a full description of S and its use, see:
The S "Blue Book": R. A. Becker, J. M. Chambers, and A. R.
Wilks, "The New S Language", Wadsworth, 1988.
The S "White Book": J. M. Chambers and T. J. Hastie,
"Statistical Models in S", Wadsworth, 1988.
In the first chapter of each book the authors work through examples
that illustrate the use of S. Both books have been made available for
short-term reference (checkout for two hours during the day or
overnight if after 4pm).
Steady-state initial-rate data measured for the enzyme lactate
dehydrogenase (LDH) with NADH and pyruvate as substrates are used as a
test set for this tutorial. The model for the LDH reaction is a
compulsory-order mechanism, with binary LDH-NADH and LDH-NAD complexes
and ternary substrate complexes, LDH-NADH-pyruvate and
LDH-NAD-lactate. There are also ternary abortive complexes,
LDH-NADH-lactate and LDH-NAD-pyruvate, formed from reduced cofactor and
reduced substrate and oxidized cofactor and oxidized substrate,
respectively. These are not on the reaction path and when present
inhibit the reaction.
The LDH reaction is described by the mechanism:
LDH + NADH <--> LDH-NADH
LDH-NADH + pyruvate <--> LDH-NADH-pyruvate
(LDH-NADH-pyruvate == LDH-NAD-lactate)
LDH-NAD-lactate <--> LDH-NAD + lactate
LDH-NAD <--> LDH + NAD
LDH-NADH + lactate <--> LDH-NADH-lactate
LDH-NAD + pyruvate <--> LDH-NAD-pyruvate
which gives the rate equation:
v0 = Vmax / (1 + C1 * p + KmA/a +
KmB/b * (1 + C1 * p) * (1 + C2 * p) +
KmAB/(a * b) * (1 + C1 * p) + C3 * b) eq. 1
where:
v0 is the initial rate of reaction.
a, b, and p are the concentrations of NADH,
pyruvate, and lactate, respectively.
Vmax, KmA, KmB, and KmAB have the usual
meaning as kinetic constants for a
two-substrate reaction.
C1 = KmQ/KmPQ is the kinetic constant that
describes inhibition by the first product.
C2 = 1/KiP is the equilibrium constant that
describes formation of the first abortive
complex, LDH-NADH-lactate.
C3 is the kinetic constant that describes
formation of the LDH-NAD-pyruvate abortive
complex. C3 can be shown to be unimportant
for these data, and so can be fixed at a very
small value or eliminated from the rate
equation.
We use here two methods of fitting data to a model that is nonlinear
in the parameters. Each method minimizes a test function, the
commonly-used sum of residuals squared, so obtaining a least-squares
best estimate of the parameter values.
The simplex method does this by trying systematically various values
for each parameter, i.e. by a guided search over a region of parameter
space. This is accomplished by constructing a polygon, the simplex,
that initially is made so large as to be sure to include the best
estimate of the parameters. During the search the polygon is shrunk
about the (unknown) best estimate until it is arbitrarily small. The
average of the parameter values at the vertices then is taken as the
best estimate of the parameter values for the model. The simplex
method is particularly effective in handling ill-conditioned functions,
such as those with singularities or discontinuities. The method easily
handles bounds on the parameter values, etc.
Gradient methods locate the best estimate by adjusting the current
estimate of the parameter values according to the gradient of the
least-squares function. Iteration (repetition of the gradient
calculation and adjustment of the parameters) moves the current
estimate arbitrarily close to the best estimate of the parameters,
where the gradient is zero. Gradient methods are expected to be faster
than a trial and error search like the simplex. Gradient
methods are less able to handle ill-conditioned functions.
For a description of the simplex method of function optimization, see
the paper by Nelder and Mead, Computer J. 7, 308, 1965. For general
discussion of the simplex and gradient methods of fitting data and
function optimization, see Press et al, "Numerical Recipes in C", 2nd
edition, Cambridge, 1992.
###########################################################################
#################### NONLINEAR FIT -- SIMPLEX SEARCH ####################
###########################################################################
The simplex fitting program simpfit() must be given the following: a
set of data to be fit to the model, a function describing the model,
and initial estimates for the parameters of the model. The matrix
ldhfn.dat has initial rate measurements (v0) for lactate dehydrogenase
(LDH). The v0 values are for varied concentrations of first substrate,
NADH (A), second substrate, pyruvate (B), and first product released,
lactate (P). The model described in eq. (1) is coded in the function
ldhfn(). ldhfn() is written for use with the data matrix ldhfn.dat.
The matrix ldhfn.p has parameter estimates that describe the vertices
of a starting simplex for the function and data of ldhfn() and
ldhfn.dat.
Please look at function, data and initial parameter
estimates.
ldhfn
ldhfn.dat
ldhfn.p
You may want to look at the online documentation for
simpfit().
? simpfit
Run the simplex fitting, storing the results in
ldh.simpfit.out.
ldh.simpfit.out<-simpfit(ldhfn,ldhfn.p,,,ldhfn.dat)
simpfit() returns a list of values that describe the
fit, including best estimates of the parameters (Vmax,
etc.), their standard deviations, the unscaled
covariance matrix, the variance of the sample, etc.
To view these results.
ldh.simpfit.out
Further discussion of the results is deferred until
after introduction of another method of fitting the
data.
###########################################################################
#################### NONLINEAR FIT -- GRADIENT METHOD ###################
###########################################################################
The function simpfit() does not return a list compatible with the
summary and plotting functions of the 1992 version of S. Also,
although functions like ldhfn() for some particular model may not be
difficult to code, still it is necessary to have written such a
separate model function in order to use simpfit().
Generally it should be more convenient, faster, and easier to fit the
data by use of a standard 1992 S function, nls(). This function
carries out a gradient search for the best values of the parameters.
nls() may not handle ill-conditioned functions or parameter bounds and
other frills. If these are a problem, then perhaps resort to
simpfit().
nls() and related functions use data frames, a structure introduced
with the 1992 S. Their usefulness will become apparent.
Construct a data frame from the LDH data in ldhfn.dat.
Change the names of column headers so that they do not
contain the characters "[]". The use of attach()
allows addressing the columns of a matrix or data frame
by name. Create a new data frame without the unneeded
columns (weights and v0calc).
xxxdf<-as.data.frame(ldhfn.dat)
names(xxxdf)<-c("v0","v0calc","weight","a","b","p")
attach(xxxdf)
objects(2)
ldh.df<-data.frame(v0,a,b,p)
A data frame can be used like a matrix.
xxxdf
ldhfn.dat
xxxdf[1,5]
ldhfn.dat[1,5]
But a data frame is _not_ a matrix. (It is really a
list.)
as.list(xxxdf)
as.list(ldhfn.dat)
Cleanup.
search()
detach(2)
rm(xxxdf)
View the ldhfn.p parameter estimates used for the
simplex fit.
ldhfn.p
Introduce the same values into ldh.df. C3 for
the LDH-NAD-pyruvate abortive complex does not
contribute. It was fixed at a value 1e-10 that removed
it from the simplex fit. We remove it entirely from
this fit.
parameters(ldh.df)<-list(Vmax=1e+00,KmA=1e-03,KmB=1e-02,KmAB=1e-03,C1=1e-03,C2=1e-06)
View the results.
ldh.df
Read the documentation on nls().
?nls
Carry out a fit using nls() with the rate law described
in eq. (1).
ldh.out1<-nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), ldh.df)
Note that no separate function needed to be written
(the model is expressed on the command line and nls()
estimates gradients numerically). Also, the fit is
faster than for the simplex method (a gradient search
is expected to be faster than a search like the simplex
that samples all or much of a region of parameter
space).
Look at the results.
as.list(ldh.out1)
summary(ldh.out1)
residuals(ldh.out1)
fitted.values(ldh.out1)
ldh.df$v0-fitted.values(ldh.out1) - residuals(ldh.out1)
coef(ldh.out1)
summary(ldh.out1)$parameters
sigma = standard error of the fit
= standard deviation
= sqrt(variance)
= sqrt ( (sum residuals^2)/degrees_of_freedom )
= rmssq = rmsd = rmse
summary(ldh.out1)$sigma
sqrt(sum(residuals(ldh.out1)^2)/summary(ldh.out1)$df[2])
Compare the parameter estimates from nls() with the
parameter estimates from simpfit(). The difference in
the estimates is 1 part in 10^5 or less.
(ldh.simpfit.out$coef[1:6] - coef(ldh.out1))/coef(ldh.out1)
Compare the standard errors for the parameters obtained
from nls() with those from simpfit(). The standard
errors for each parameter are close, but as is
reasonable, different fitting methods lead to slightly
different descriptions of the least-squares surface
about the minimum. The fitting methods define the
minimum more accurately than the shape of the LSQ
surface about it.
ldh.simpfit.out$std.err
summary(ldh.out1)$parameters[,"Std. Error"]
(ldh.simpfit.out$std.err - summary(ldh.out1)$parameters[,2])/ldh.simpfit.out$std.err
The specification of the starting estimates of the
parameters can be handled differently. They can be
given as an argument of the function. They can be
given as an array, i.e., as done below, an array C of
length 6. The array indices in the model function can
be data values (not done below).
nls(v0 ~ C[1]/(1 + C[5]*p + C[2]/a + (C[3]/b)*(1 + C[5]*p)*(1 + C[6]*p) + (C[4]/(a*b))*(1 + C[5]*p)), data=ldh.df,start=list(C=c(1e+00,1e-03,1e-02,1e-03,1e-03,1e-06)))
###########################################################################
#################### QUALITY OF THE NONLINEAR FIT #######################
###########################################################################
We now ask about the quality of the fit and of the data.
###########################################################################
1. Are the residuals (deviations between measured and calculated values
= v0 - v0calc) normally distributed? Least-squares fitting assumes
this, and failure of the assumption may be cause for concern.
The function qqnorm() sorts the data (here the residuals) and plots
them against the corresponding theoretical values of the quantiles for
the normal distribution. If the points in the plot approximate a
straight line, the the assumption of normality is supported. (For a
sample x, the quantile corresponding to a probability P is the value X
such that the fraction of the sample x having values less than X is P,
e.g., for P=.5, X=mean of sample).
The function qqline() draws a line through the 1st and 3rd quartiles of
the data, which include the central (most probable) part of the
distribution and of course the majority of the sample points.
Display the results of qqnorm() and qqline() operating
on the residuals of the fit done with nls().
x11()
qqnorm(resid(ldh.out1))
qqline(resid(ldh.out1))
The data in this case satisfy the Gaussian assumption.
There is some deviation for the tails of the
distribution. To identify the tail points, construct a
sorted list of the residuals with the ordinal in the
data set attached.
attach(ldh.df)
x<-cbind(sort.list(resid(ldh.out1)),resid(ldh.out1),a,b,p)
x[x[,1],]
No pattern is apparent.
Can identify points by point-and-click.
xx<-residuals(ldh.out1)
plot(qnorm(ppoints(length(xx)))[order(order(xx))],xx)
identify(qnorm(ppoints(length(xx)))[order(order(xx))],xx)
Left mouse to label.
Right mouse to exit.
Cleanup.
rm(x, xx)
detach(2)
###########################################################################
2. Is each parameter significant in influencing the fit? What is the
confidence level for each of the parameters being different from 0?
First display the values of the parameters, their std. errors, and
their values of t = parameter value/std. error. Then evaluate the
probability corresponding to this value of t and the number of degrees
of freedom of the fit. This probability is 1 - confidence level. Note
that the parameter estimates obtained in a particular experiment are a
single sample drawn from a population of possible results. The
associated standard errors can be used to estimate the variance of this
population of parameter values. Use of Student's t distribution takes
into account the lack of knowledge of the true variance and the fact
that the standard errors themselves have uncertainty. For an
experiment with a large number of data points (degrees of freedom), the
t and normal distributions are the same.
For the lowest (least significant) t value:
summary(ldh.out1)$parameters
summary(ldh.out1)$df
1-pt(2.554559,20)
1-pt(min(summary(ldh.out1)$parameters[,3]),summary(ldh.out1)$df[2])
The result, 0.009445276, shows that the least
well-defined parameter differs from zero at the 0.99
confidence level.
###########################################################################
3. Is the std. error of the fit reasonable from an experimental view?
Compare the std. error of the fit to the min, max, and
mean of the quantity fit.
attach(ldh.df)
range(v0)
mean(v0)
summary(ldh.out1)$sigma
x<-c(min(v0), mean(v0), max(v0))
cbind(x,summary(ldh.out1)$sigma/x*100)
detach(2)
rm(x)
The standard error of the fit is 1.5 to 6 percent of
the v0 values. This percentage error is reasonable, if
anything small for a steady-state kinetic experiment.
###########################################################################
4. Are the same parameter estimates (the same fit) obtained with
different starting values of the parameters? If not, then the fit is
suspect.
For convenience, construct a data frame from ldh.df
without the starting parameter estimates.
xxx.df<-ldh.df
parameters(xxx.df)<-NULL
Carry out the fit with the previously-used starting
parameter estimates and with estimates uniformly
increased and decreased by powers of 2 or 10.
Previous estimates:
nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-03,"C2"=1e-06))
2x smaller -- result = as for previous:
nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=.5e+00,"KmA"=.5e-03,"KmB"=.5e-02,"KmAB"=.5e-03,"C1"=.5e-03,"C2"=.5e-06))
4x smaller -- result = as for previous:
nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=.25e+00,"KmA"=.25e-03,"KmB"=.25e-02,"KmAB"=.25e-03,"C1"=.25e-03,"C2"=.25e-06))
10x smaller -- result = no fit, singular gradient matrix:
nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e-01,"KmA"=1e-04,"KmB"=1e-03,"KmAB"=1e-04,"C1"=1e-04,"C2"=1e-07))
10x larger -- result = as for previous:
nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e+01,"KmA"=1e-02,"KmB"=1e-01,"KmAB"=1e-02,"C1"=1e-02,"C2"=1e-05))
100x larger -- result = no fit, singular gradient matrix:
nls(v0 ~ Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p)), xxx.df, list("Vmax"=1e+02,"KmA"=1e-01,"KmB"=1e-00,"KmAB"=1e-01,"C1"=1e-01,"C2"=1e-04))
Clearly, a range of starting estimates are acceptable,
and the fit is independent of the starting point within
this range. The gradient algorithm of nls() cannot
handle a rough LSQ surface, and starting estimates far
from the fit values cannot be accommodated (10x smaller
or 100x larger).
The simplex algorithm of simpfit() is less sensitive.
For a range of 100x above to 100x below the previous
estimates:
x.porig<-ldhfn.p[1,]
x.p<-x.porig*c(rep(1.e2,6),1)
x.pfactor<-c(rep(1.e-4,6),1)
x.out<-simpfit(ldhfn,x.p,x.pfactor,,ldhfn.dat)
x.out
(x.out$coef - ldh.simpfit.out$coef)/x.out$coef
(x.out$std.err - ldh.simpfit.out$std.err)/x.out$std.err
(x.out$cov.unscaled - ldh.simpfit.out$cov.unscaled)/x.out$cov.unscaled
rm(x.out, x.pfactor, x.p, x.porig)
Note that a rather large number of iterations were
needed and the fit is slow and memory growth is large.
Iter = 1763 vs 447 for the previous fit. The results
are equal within a small tolerance to the previous
fit. Common sense should allow guessing initial
parameter estimates closer than 100x above or below the
best fit values.
###########################################################################
5. What is the effect of leaving parameters out of the fit?
(xxx.df<-ldh.df)
x.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6))
x.c2.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3))
x.c1.out<-nls(v0 ~ (Vmax/(1 + KmA/a + (KmB/b)*(1 + C2*p) + (KmAB/(a*b)))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C2"=1e-6))
x.kmab.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"C1"=1e-3,"C2"=1e-6))
x.kmb.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6))
x.kma.out<-nls(v0 ~ (Vmax/(1 + C1*p + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("Vmax"=1e+00,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6))
x.vmax.out<-nls(v0 ~ (1/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p))), xxx.df,list("KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03,"C1"=1e-3,"C2"=1e-6))
y<-list(x.out,x.c2.out,x.c1.out,x.kmab.out)
names(y)<-list("x.out","x.c2.out","x.c1.out","x.kmab.out")
for (i in seq(along=y)){print(names(y)[[i]]);print(summary(y[[i]])$parameters)}
for (i in seq(along=y)){print(names(y)[[i]]);print(summary(y[[i]])$sigma)}
rm(x.c1.out,x.c2.out,x.kma.out,x.kmab.out,x.out,xxx.df)
Removal of Vmax or KmB resulted in singular gradients and thus no fit.
Although removal of any one of the other parameters was less drastic,
The values of fit$sigma, to be compared to the full fit value
0.001616097, are for x.c2.out=0.002042659, x.c1.out=0.0021374,
x.kmab.out=0.002025575, x.kma.out=0.006302654. Looking at the smallest
increase in sigma, for KmAB:
F(1,20) = increase in SSQ/mean residual SSQ
= (0.002025575^2*21 - 0.001616097^2*20)/(0.001616097^2)
= 12.98991
Although it is somewhat improper to use this F statistic to draw
a conclusion about a nonlinear model from a linear approximation of it,
the high probability corresponding to the above F value,
pf(12.98991, 1, 20) = 0.9982288, suggests that KmAB and thus also
the other more "well-defined" parameters are all significant.
To examine the quality of the linear approximation, try profile().
###########################################################################
6. What is the effect of introducing the C3 parameter into the fit?
yyy.df<-ldh.df
parameters(yyy.df)<-list(Vmax=0.2541638,KmA=0.03251107,KmB=0.3056165,KmAB=0.002343344,C1=0.003951771,C2=0.006755385,C3=1.e-10)
y.c3.out<-nls(v0 ~ (Vmax/(1 + C1*p + KmA/a + (KmB/b)*(1 + C1*p)*(1 + C2*p) + (KmAB/(a*b))*(1 + C1*p) + C3*p)), yyy.df, list(C3=1.e-3))
summary(y.c3.out)
sum(y.c3.out$res^2) - sum(ldh.out1$res^2)
rm(yyy.df,y.c3.out)
The t value for C3 is -0.000186992 (effectively zero).
The reduction in SSQ = 7.173252e-14 on introduction of
the C3 parameter. The reduction ranges from
3.392665e-05 to 7.819570e-04 for the four parameters
described above.
Apparently, C3 has an insignificant effect upon the
fit, in agreement with the earlier decision to omit C3
or fix it at a very small value.
###########################################################################
#################### GRAPHICS DISPLAYS ##################################
###########################################################################
The quality of the data and of the results of the fits can be examined
by suitable graphics displays. Examination of multidimensional data,
i.e., the dependence of a response on more than one variable, requires
special strategies.
###########################################################################
1. Xgobi is a UNIX program for interactive dynamic graphics and data
visualization. Hard copy and computer-readable versions of the xgobi
manual are available. the S function xgobi() starts up the UNIX
program with data from an S object (data frame, matrix, or set of
vectors).
Look at online help.
!man xgobi
?xgobi
Construct a data frame from the LDH data and the
calculated values of the initial rates.
ldh.plot.df<-as.data.frame(cbind(ldh.df,v0calc=(ldh.df$v0 - ldh.out1$res)))
Start xgobi().
xgobi(ldh.plot.df)
With focus in the xgobi window, examine the data.
a. select xyplot, select y=v0 and x=p, select brush, color p=0 data red.
b. select xyplot, select y=v0, fix y, cycle.
v0-v0calc plot shows residuals small.
c. transform v0, v0calc, a, b to inverse, then cycle again.
1/v0-1/a and 1/v0-1/b are Lineweaver-Burke plots
d. select rotate, select pause, select 1/v0, 1/a, 1/b.
learn to rotate the data cloud with the mouse.
the p=0 data define a surface with a slight twist.
replace v0 by v0calc, and see better the twist in the plane.
the twist reflects the contribution of the KmAB/(a*b) term.
e. cycle quickly between v0 and v0calc to get a feeling for
deviations between measured and calculated values.
f. select rock, deselect pause, use mouse to change view.
g. this is a 5-dimensional data set. select tour and all 5 variables.
although not useful for these data, tour can help discover
patterns in multidimensional data and can isolate the important
variables by principal component analysis or projection
pursuit.
there are sample data supplied with xgobi that illustrate the
use of tour and its dependents.
h. other options: dotplots, rescaling, erasing data, adding lines,
postscript plots, etc.
Quit.
###########################################################################
2. S on an sgi machine has a mouse- and menu-based facility for two- and
three-dimensional display of data.
Read documentation.
? brush
? iris4d
Start the iris device driver and brush.
brush(ldh.plot.df)
With focus in the iris4d window,
right-mouse brings up menus.
left mouse highlights.
middle mouse removes highlights.
etc.
Brush comes up in scatterplot mode. For these data
there are 25 cells, each a plot of one variable against
another. Labels for the variable are along the
diagonal.
Try the following:
a. focus in a cell with something vs p.
brush p=0 data red by clicking left with brush box over portion
of data.
b. focus in some cell of interest.
blowup the plot with right-mouse menu selection. then reset.
c. identify points by number, by selecting highlight and placing the
cursor on the number list on the right of the display,
or by selecting identify and putting the cursor on a point in
a cell.
d. change the size of the datum symbol, by selecting appropriate
option and clicking middle mouse for larger, left for smaller.
Prepare to go into rotation mode. Rotation mode is
similar to rotation mode of xgobi, but perhaps a bit
snazzier.
With right-mouse menu and submenus, delete variables v0calc and p.
Select rotation from menu.
a. select velocity rotation mode, and play with data display.
position of cursor about center functions like a joy stick in
changing _rate_ of movement of data cube.
left and middle mouse give left and right rotation of cube.
b. select mouse rotation mode.
as in a., except cursor changes position of cube instead of
velocity of movement of cube.
c. deselect variable v0 and select v0calc, etc., etc..
Quit.
To get display of (1/v0, 1/z, 1/b, 1/v0calc, p), one
must restart brush with appropriately modified data.
attach(ldh.plot.df)
brush(cbind("1/v0"=1/v0, "1/v0calc"=1/v0calc, "1/a"=1/a, "1/b"=1/b, "p"=p))
In scatterplot mode:
a. brush p=0 data red
b. go to cell v0 vs 1/b, select blowup.
note that both the slope and intercept for p=150 (white data)
differ from those for p=0 (lowest line of red data). this is
diagnostic of a compulsory ordered ternary complex mechanism.
Select rotation mode.
a. select v0calc (or v0), 1/a, 1/b as variables
b. select mouse or velocity rotation
c. the effect of p=150 on slope and intercept is easily seen.
d. the twist in the plane of the data is easily seen
Quit.
Clean up.
detach(2)
###########################################################################
3. Plots made with S functions. S has a rich plotting language. The
following are relatively unsophisticated examples.
Start the X11 device driver (or the iris4d driver) and
for convenience attach a data frame.
iris4d() or x11()
attach(ldh.plot.df)
For scatterplots like that obtained with brush():
pairs( ~ v0 + a + b + p)
pairs( ~ I(1/v0calc) + I(1/a) + I(1/b) + p)
For plots of a single variable, first eliminate data
with p>0.
nop<-ldh.plot.df[ldh.plot.df$p==0,]
Make a reciprocal plot of v0 for varied [NADH], and add
points for calculated values.
plot(1/nop$a,1/nop$v0)
points(1/nop$a,1/nop$v0calc,pch="+")
Plot the cumulative distribution function for each
variable, which for a numeric variable is: sort(x) vs
ppoints(x), where ppoints(x) is a vector of
probabilities. plot() with a data frame object is the
same as plot.data.frame().
plot(ldh.plot.df)
Plot the cumulative distribution function for a single
variable.
plot(ppoints(v0), sort(v0))
plot(ppoints(v0-v0calc), sort(v0-v0calc))
Make a quantile plot of the residuals.
qqnorm(v0-v0calc)
qqline(v0-v0calc)
Box plots also display the distribution of a variable.
boxplot(v0,v0calc)
boxplot(v0-v0calc)
S can produce "3-d" plots == perspective plots. All
duplicate values of the independent variables must be
removed. Also, include data only for p=0, to simplify
the surface.
nop<-ldh.plot.df[ldh.plot.df$p==0,]
nop.uniq <- nop[!duplicated(paste(nop[, "a"], nop[, "b"])),]
Values of z (1/v0) are calculated over a grid of x (a)
and y (b) points.
i<-interp(1/nop.uniq$a, 1/nop.uniq$b, 1/nop.uniq$v0)
j<-interp(1/nop.uniq$a, 1/nop.uniq$b, 1/nop.uniq$v0calc)
k<-interp(1/nop.uniq$a, 1/nop.uniq$b, 1/nop.uniq$v0 - 1/nop.uniq$v0calc)
l<-interp(nop.uniq$a, nop.uniq$b, nop.uniq$v0)
m<-interp(nop.uniq$a, nop.uniq$b, nop.uniq$v0calc)
n<-interp(nop.uniq$a, nop.uniq$b, nop.uniq$v0 - nop.uniq$v0calc)
Draw perspective plots.
persp(i)
persp(j)
persp(k)
persp(l)
persp(m)
persp(n)
Draw contour plots.
contour(i,nint=30)
contour(j,nint=30)
contour(k,nint=30)
contour(l,nint=30)
contour(m,nint=30)
contour(n,nint=30)
Clean up.
detach(2)
rm(??????)
###########################################################################
#################### LINEAR FITS ########################################
###########################################################################
S has powerful operators to handle linear models.
The LDH rate equation can be thrown into linear form if the denominator
terms with product are deleted, i.e., if measurements are only for p=0.
Make the new data frame.
attach(ldh.df)
objects(2)
ldh.nop.df<-data.frame(v0,a,b)[1:20,]
For reference carry out a nonlinear fit for
ldh.nop.df.
ldh.nop.out1<-nls(v0 ~ (Vmax/(1 + KmA/a + KmB/b + KmAB/(a*b))), ldh.nop.df,list("Vmax"=1e+00,"KmA"=1e-03,"KmB"=1e-02,"KmAB"=1e-03))
And now a linear fit of the reciprocal rate equation
with lm().
lm.out1<-lm(1/v0 ~ I(1/a)*I(1/b),data=ldh.nop.df)
And weight the linear fit by the theoretical v0^4.
lm.wt.v04.out1<-lm(1/v0 ~ I(1/a)*I(1/b),data=ldh.nop.df, weights=ldh.nop.df$v0^4)
Compare the weighted linear fit and the nonlinear fit.
x<-coef(lm.wt.v04.out1)
(coef(ldh.nop.out1) - x/c(x[1]^2,x[1],x[1],x[1]))/coef(ldh.nop.out1)
The differences in parameter estimates range from 0.25
to 1.3 percent. This is plausible, given the weighting
introduced, which is itself subject to the uncertainty
of the sample.
Compare the unweighted linear fit and the nonlinear
fit.
x<-coef(lm.out1)
coef(ldh.nop.out1) - x/c(x[1]^2,x[1],x[1],x[1]))/coef(ldh.nop.out1)
The differences range from 7.6 to 28 percent, as
expected because of the incorrect weighting. It is not
nice to do an unweighted reciprocal plot.
A difficulty with the linear fit to a reciprocal model is that the
estimates of the standard error in the parameters are for (1/Vmax) and
(KmX/Vmax), X={A, B, AB}. Typically one wants uncertainties for Vmax
and KmX, as are obtained with the nonlinear model of the nls() fit.
The uncertainties for Vmax and KmX are calculated as:
var(z) = sum_over_i sum_over_j d(z)/d(P_i)*d(z)/d(P_j)*covar_i_j
Where P_i are the linear model parameters, (1/Vmax) and (KmX/Vmax).
Covar_i_j is the covariance matrix for the linear model. Var(z) is the
variance to be calculated for the parameter Vmax or KmX. z is a
function expressing the parameter Vmax or KmX in terms of the linear
model parameters, (1/Vmax) and (KmX/Vmax). The calculation is
appropriately carried out within a function.
lm.to.nls <- function(lmout) {
#for linear fit to LDH data
#compute values and std.error of Vmax and KmA, etc.
#from values and std.errors of 1/Vmax and KmA/Vmax, etc.
#
#value: data frame with estimate, std.error and t value for each parameter
#argument: data frame returned by lm()
#
std.error<-numeric(4)
parameters<-numeric(4)
sigmasq<-sum(summary(lmout)$residuals^2)/16
cij<-summary(lmout)$cov*sigmasq
x<-coef(lmout)
parameters[1]<-1/x[1]
std.error[1]<-sqrt(1/x[1]^4*cij[1,1])
for (i in 2:4) {
parameters[i]<-x[i]/x[1]
std.error[i]<-sqrt((1/x[1])^2*cij[i,i] + (-x[i]/(x[1]^2))^2*cij[1,1] + 2*(1/x[1])*(-x[i]/(x[1]^2))*cij[1,i])
}
data.frame(parameters = parameters, std.error = std.error, t = parameters/std.error)
}
Run this function, lm.to.nls(), on the results of the
weighted linear fit and compare with the results of the
nonlinear fit.
x<-lm.to.nls(lm.wt.v04.out1)
x
summary(ldh.nop.out1)$parameters
(summary(ldh.nop.out1)$parameters - x)/x
The ldh.nop.out1 values for Vmax and KmA are close to
the values calculated from lm.wt.v04.out1. The largest
differences are in KmAB: 1.3 percent in the parameter
estimate, 3 percent in the std. error, and 4.5 percent
in t.
Analysis of variance, with the function aov(), can be
used to assess the linear model.
summary(aov(1/v0 ~ I(1/a) + I(1/b) + I(1/a):I(1/b),data=ldh.nop.df, weights=ldh.nop.df$v0^4))
The term with the coefficient (KmAB/Vmax), which is the
least significant factor, enters into the fit at the
.005 confidence level.
The confidence level for the (KmAB/Vmax) parameter
estimated from the t value of the weighted linear fit
is .0025.
When viewing the surface, 1/v0 = f(1/a, 1/b), a slight
twist was seen. We can examine the origin of this
twist.
Carry out a linear fit without the term
(KmAB/Vmax)*(1/a)*(1/b).
lm.wt.nokmab.out1<-lm(1/v0 ~ I(1/a) + I(1/b),data=ldh.nop.df, weights=ldh.nop.df$v0^4)
Construct data frames for use in graphics displays.
ldh.nop.plot.df<-as.data.frame(cbind(ldh.nop.df,v0calc=(ldh.nop.df$v0 - ldh.nop.out1$res)))
lm.nokmab.plot.df<-as.data.frame(cbind(ldh.nop.df, v0calc=lm.wt.nokmab.out1$fitted.values))
Display of the results with xgobi shows that the plane
of the calculated values is not twisted for
lm.nokmab.plot.df and is twisted for ldh.nop.plot.df,
showing that the twist comes from a contribution of the
cross product term 1/a * 1/b.