## Abstract

Quantum chemistry is a field of science that has undergone unprecedented advances in the last 50 years. From the pioneering work of Boys in the 1950s, quantum chemistry has evolved from being regarded as a specialized and esoteric discipline to a widely used tool that underpins much of the current research in chemistry today. This achievement was recognized with the award of the 1998 Nobel Prize in Chemistry to John Pople and Walter Kohn. As the new millennium unfolds, quantum chemistry stands at the forefront of an exciting new era. Quantitative calculations on systems of the magnitude of proteins are becoming a realistic possibility, an achievement that would have been unimaginable to the early pioneers of quantum chemistry. In this article we will describe ongoing work towards this goal, focusing on the calculation of protein infrared amide bands directly with quantum chemical methods.

## 1. Introduction

The ultimate goal of quantum chemists is to be able to model any chemical problem with an accuracy comparable with (or greater than!) experimental measurements without any *a priori* information except a few fundamental constants. While this aim may appear unrealistic, it is in principle possible, and in fact the mathematical framework for achieving it is well established. The reason why the fumehoods and lasers in chemistry departments have not yet been removed and replaced with computers is the computational cost of these calculations soon becomes prohibitive as the size of the system of interest increases. This places limits on the type of problem that can be studied. The simplest quantum chemical method Hartree–Fock (HF) theory scales formally as *O*(*n*^{4}), where *n* is a measure of the molecular size. In practice, this *O*(*n*^{4}) scaling is reduced through the use of integral cut-offs. For example, a typical HF calculation on the benzene molecule would take a few seconds on a modern computer. A similar calculation on the C_{60} molecule would take a few hours and a calculation on a model of a (15, 0) nanotube approximately 31 Å long would take days. However, HF theory does not achieve the desired accuracy. Methods that approach the desired accuracy, such as coupled cluster theory, scale as *O*(*n*^{6}) or greater. With these methods, calculations on a C_{60} molecule and nanotube would take months and years, respectively. This is further exacerbated since many properties require multiple energy evaluations or derivatives of the energy to be computed.

The introduction of density functional theory (DFT; Hohenberg & Kohn 1964) revolutionized quantum chemistry, since it predicts accurately a wide range of molecular properties at a relatively low computational cost. This vastly increased the breadth of problems that can be studied, and quantum chemistry is now used routinely in many branches of chemistry. Despite some well-known deficiencies, DFT is by far the most widely used quantum chemistry method. Currently, there are a number of interesting developments in wave function-based approaches, such as local correlation and explicitly correlated methods (Klopper *et al*. 2006), which greatly reduce the cost of these methods to become competitive with DFT. However, to match the versatility and efficiency of DFT remains a difficult challenge.

Despite these advances, the application of quantum chemistry to calculate the properties of large biomolecules like proteins directly has seemed a remote possibility. This is particularly frustrating since quantum chemistry could contribute to solving many of the major problems in molecular biology. Currently, most information about the structure and properties of proteins has been gleaned from classical force fields and molecular dynamics simulations. These approaches also have their limitations. The accuracy relies on the parametrization of the force fields and classical methods cannot describe chemical processes, such as making or breaking bonds. One solution is the QM/MM approach, which is a hybrid method in which a small fraction of the system of particular interest is treated using quantum mechanics, for example, DFT, with the remainder treated at the molecular mechanics (classical) level. The success of the QM/MM framework relies on being able to identify a small region of the system which is of chemical importance, for example, an active site. In computing protein spectroscopy, large regions or many small distinct areas of the protein may be important, and the application of QM/MM methods is problematic. Furthermore, derivatives of the energy may be required. While the availability of increasingly powerful computers will aid the application of quantum chemical methods to biological problems, it does not represent a solution and a continued effort to develop new approaches to tackle such problems is required.

## 2. Amide I band

Despite being one of the most widely studied problems in molecular biology, the folding of a protein into its unique native state remains a poorly understood process. Recent developments in infrared (IR) spectroscopy have begun to provide a glimpse into the mechanism of protein folding (Gilmanshin *et al*. 1997). The amide I band arises from the stretching of the carbonyl bond, and results in a distinct profile between 1600 and 1700 cm^{−1} which reflects the secondary structure of the protein. In general, α-helices are characterized by a strong band at 1650–1660 cm^{−1}. β-Sheets show peaks at low (1620–1640 cm^{−1}) and high (1680 cm^{−1}) frequencies. Turns show peaks at 1670 and 1685 cm^{−1} and disordered structure exhibits a peak at 1640–1650 cm^{−1}. Figure 1 shows amide I bands of the predominantly α-helical protein carbonmonoxyglobin (PDB code 1MBC) and the mixed α-helix and β-sheet protein flaxodoxin (PDB code 5NLL).

The early events in protein folding occur on the nanosecond to microsecond time scale. The introduction of temperature jump (T-jump), techniques with IR spectroscopy (Callender *et al*. 1998; Dyer *et al*. 1998) and the discovery of fast-folding proteins (Myers & Oas 2001) has allowed these events to be studied directly. In the T-jump technique, initial conditions are chosen to establish an equilibrium between native and denatured forms. A subsequent increase in temperature perturbs the equilibrium towards a more folded or unfolded state. From these experiments, information on folding and unfolding changes on the nanosecond time scale can be gleaned (Gilmanshin *et al*. 1997). Two-dimensional IR spectroscopy is a relatively new technique that provides another avenue of investigation (Hamm *et al*. 1998). The principle of two-dimensional IR is similar to two-dimensional NMR, but applied to the amide I mode instead of spin transitions. Two-dimensional IR provides additional information to the one-dimensional measurements. This information is contained in the off-diagonal peaks which appear when the corresponding resonances are coupled. Since this coupling is determined by the geometry of the peptide chain, the conformation of the peptide backbone can be probed (Woutersen & Hamm 2002*a*). Two-dimensional IR has been used to study the structure of small peptides and, recently, the breaking of a hydrogen bond and opening of a β-turn have been monitored (Kolano *et al*. 2006). To exploit these experiments fully requires a direct link between the spectroscopy and the atomistic detail of the molecular conformation. In order to bridge this gap, theoretical models are required which can decipher the structural fingerprint contained within the spectral profile.

## 3. Computation of protein amide I bands

Currently, calculations of the amide I band of proteins adopt a simplified model of the amide I vibration where the carbonyl group is treated as an oscillating dipole. A Hamiltonian matrix is constructed where the diagonal elements are given by the frequencies of the individual amide oscillators and the off-diagonal elements describe the coupling between the different oscillators. The coupled amide I modes of the protein are obtained from diagonalization of this Hamiltonian matrix. The success of this approach relies on the parametrization or calculation of the matrix elements. The amide I site frequencies have been correlated with the electrostatic potential (Ham *et al*. 2003*a*) and electric field (Schmidt *et al*. 2004) at the atoms of the amide group. The off-diagonal elements are computed within the framework of the transition dipole coupling (TDC) method (Krimm & Abe 1972; Torii & Tasumi 1992). More recent changes to the model have been the inclusion of higher-order multipole moments in the transition charge coupling model (Moran & Mukamel 2002; Woutersen & Hamm 2002*b*), the Hessian matrix reconstruction method (Choi *et al*. 2003; Ham *et al*. 2003*a*,*b*) and nearest neighbour coupling maps to account for short range through bond interactions (Jansen *et al*. 2006). These approaches have been applied, most notably by Cho and co-workers, successfully to study a wide range of problems (e.g. Choi *et al*. 2002, 2003, 2007). The contribution of TDC to the coupling in amide I vibrations in model peptides has been assessed (Kubelka *et al*. 2006). The results showed that TDC alone cannot account for the vibrational interactions. The amide I band of large polypeptides has also been studied using the property transfer algorithm in which parameters from DFT calculations on small systems are transferred to larger structurally related molecules (Bour *et al*. 1997, 2002; Kubelka & Keiderling 2001).

Despite these successes, it remains desirable to compute protein IR spectroscopy directly using quantum chemical methods. The current floating oscillator approaches are based on a simplified model of the amide vibration. The subsequent process of improving its accuracy largely involves reintroducing some of the physics lost by the original model, or more elaborate parametrization schemes. Ultimately, it can become difficult to achieve greater accuracy. Direct quantum chemical calculations represent a very different approach to the problem, which is based on making a number of approximations to the *exact* model in order to make the calculations computationally tractable. These approximations can be quantified, and relaxed in the future when computational resources allow. Through these calculations, the individual environment of the amide group and through bond couplings are described rigorously, leading to quantitatively accurate amide I band profiles. Furthermore, the amide bands can be reliably resolved with respect to the different secondary structure elements yielding a direct probe of the relationship between the spectral band and the underlying secondary structure. However, these calculations represent a formidable challenge to quantum chemistry, and for many years have been considered unfeasible.

## 4. Quantum chemical vibrational analysis

The vast majority of quantum chemical calculations of IR spectroscopy exploit the harmonic approximation. Within the harmonic approximation, vibrational modes (*u*_{i}) and frequencies (*λ*_{i}) are obtained from the eigenvectors and eigenvalues of the Hessian matrix (** H**) in mass-weighted coordinates(4.1)where the Hessian matrix contains the second derivatives of the electronic energy with respect to the nuclear coordinates:(4.2)For a molecule comprising

*N*atoms, the Hessian matrix has dimensions 3

*N*×3

*N*. The IR intensities are evaluated through the derivative of the dipole moment with respect to the normal coordinates. The advantage of the harmonic approach is that only second derivatives are required, while higher-order derivatives are necessary for anharmonic frequencies. The error in the computed harmonic frequencies can be large. Within DFT, the computed harmonic frequencies are often in good agreement with experiment either fortuitously or by design (Lin

*et al*. 2004). It is also common to scale harmonic frequencies to give better agreement with experiment (Scott & Radom 1996).

The majority of the computational effort in evaluation of harmonic frequencies is the computation of the energy second derivatives. There are two general approaches to their computation. The derivatives can be evaluated numerically within a finite difference approach, which simply requires multiple single point energy evaluations. While computationally simple, this approach is prohibitively expensive for even moderate size molecules due to the very large number of energy evaluations required. Alternatively, for analytical derivatives the energy expression is formally differentiated and the resulting expression evaluated directly. The analytical evaluation of the energy second derivatives can be achieved efficiently via the coupled-perturbed Hartee–Fock (or coupled-perturbed Kohn–Sham) equations using the iterative procedure introduced by Pople *et al*. (1979). These schemes have been implemented in many of the modern quantum chemistry software packages. Despite these advances, the computation of vibrational frequencies for molecules with over 100 atoms remains difficult.

In recent years, one of the major advances in computing hardware is the availability of multiple processor machines. Quantum chemists have been quick to exploit this technology, and the parallelization of the analytical derivative code within the Q-CHEM software has been described in detail (Korambath *et al*. 2002). In this scheme, the most time-consuming steps, namely the coupled-perturbed Kohn–Sham calculation and contraction of the density matrix with the electron repulsion integrals are distributed. The only replicated step is the evaluation of the electron repulsion integrals, in particular, the second derivatives of these integrals. For standard harmonic frequency calculations, this replicated step represents a relatively small part of the total workload.

A number of groups have reported harmonic frequency calculations on large polypeptides. For example, DFT calculations of the frequencies of a chain of 10 formamide molecules and a 10 residue polyglycine chain to study through bond and through space vibrational couplings have been reported (Kobko & Dannenberg 2003). Keiderling and and co-workers have reported DFT calculations on model β-sheet and α-helical fragments with up to 11 residues (Kubelka & Keiderling 2001; Bour *et al*. 2002; Huang *et al*. 2004; Kubelka *et al*. 2006). The IR spectroscopy of the penta-peptide [Leu]enkephalin has been computed using DFT (Watson & Hirst 2004). These calculations are quantitatively accurate and can be used to probe the conformation of the peptide. Larger peptides can be studied using semi-empirical methods, and α-helices (Ham *et al*. 2004) and antiparallel β-sheets (Lee & Cho 2004) have been studied using the AM1 method.

## 5. Partial Hessian approximation

In order to study the IR spectroscopy of proteins or large polypeptides, it is necessary to make further approximations to the harmonic frequencies. One feature of the amide I band is that it is largely localized on the backbone carbonyl groups. We recently exploited this property of the amide I mode in a partial Hessian approach to computing amide I bands (Besley & Metcalf 2007). Constructing a Hessian matrix comprising only the derivatives of the energy with respect to the displacement of the amide carbon, oxygen and nitrogen atoms introduced an error of approximately 10 cm^{−1} from the full Hessian values for a test set of polypeptides. Furthermore, the corresponding band profiles were reproduced well. Figure 2 shows amide I band profiles for three penta-peptides computed using full and partial Hessians using DFT with the EDF1 exchange–correlation functional (Adamson *et al*. 1998) and 6-31G^{*} basis set. The partial Hessian approximation neglects the coupling between the vibrations of the amide I mode and the remaining vibrational modes of the polypeptide. However, since the derivatives computed correspond to the whole system, effects such as hydrogen bonding (Besley 2004*a*) are included explicitly. An alternative physical picture of the approximation is that the atoms not included within the partial Hessian have an infinite mass, but since they are not moving in the amide I vibration the error in the computed frequencies is small.

This approximation reduces the dimensions of the Hessian matrix from 3*N*×3*N* to 9*N*_{res}×9*N*_{res}, where *N*_{res} is the number of residues (or backbone carbonyl groups). We have implemented this partial Hessian methodology within the analytical frequency code of a developmental version of the Q-Chem software package (Kong *et al*. 2006). The partial Hessian approximation leads to substantial computational savings. For the penta-glycine, penta-alanine and penta-valine calculations shown, the approximation reduces the calculation time by approximately 35, 48 and 41%, respectively. This saving grows as the number of residues and size of the side chains grows. For, polypeptides comprising residues with large side chains, such a tryptophan, the calculation time can be reduced by over 90% (Besley & Metcalf 2007).

## 6. Integral cut-offs

The partial Hessian approximation in conjunction with the parallelization of the analytical derivatives code means that the time for calculations of very large polypeptides can be greatly reduced. For DFT within the partial Hessian framework, the remaining bottleneck in the calculations is the computation of the second derivatives of the Coulomb and exchange–correlation integrals. In the current parallel implementation, this work is not distributed. For standard full Hessian calculations, this is not a major overhead. However, in the partial Hessian framework, it becomes significant in terms of computational time and memory required. Consequently, it is necessary to make the evaluation of the second derivatives of the Coulomb integrals and integrals over the exchange–correlation functional as efficient as possible. The algorithms for evaluating these integrals are different, and the integrals over the exchange–correlation functional use quadrature. In standard quantum calculations, integral cut-offs or thresholds are used routinely to assess *a priori* whether an integral will make a significant contribution to the energy or property of interest. Non-significant integrals can be neglected, greatly reducing the cost of the calculation. Since we are interested in a specific vibrational mode and are neglecting the coupling of this mode with the majority of other vibrational modes, we have re-examined the choice of integral thresholds.

Q-Chem has an established process of integral screening controlled by threshold parameters. For our modified cut-offs, the cut-off for the neglect of the Coulomb integrals is reduced from the default value of 11 to 4. It should be noted that for HF calculations in which this parameter controls screening of Coulomb and exchange, such a low value is not appropriate. The corresponding parameter for the exchange–correlation integrals is also reduced from 11 to 4. In addition, the integral threshold for the coupled-perturbed Kohn–Sham procedure is reduced from 6 to 4. These modified cut-offs are only used in determining the Hessian and not in the preceding DFT calculation to determine the energy and Kohn–Sham orbitals. For more details about integral screening in Q-Chem, the reader is referred to the Q-Chem user manual (www.qchem.com).

Table 1 shows the effect of the relaxed integral thresholds on the computed amide I vibrational modes of four penta-peptides. The size of the error introduced in both the computed frequencies and intensities is small. In particular, it is much less than the error introduced through the partial Hessian approximation and will have a negligible effect on the computed band profiles. The reduction in the calculation time compared with the partial Hessian calculation with standard integral cut-offs is substantial, ranging from 25 to over 40%. These timings are for the serial implementation of the code, and the corresponding values for a parallel run will be greater. In conjunction with the time saved, there is also a reduction in the memory required for the calculation. This allows the amide I band of much larger polypeptides to be studied with DFT.

## 7. Illustrative calculations

The following calculations illustrate the type of system that can be studied with the new partial Hessian implementation, and are performed using DFT with the EDF1 exchange–correlation functional. Harmonic frequencies should be computed at a stationary point. However, full geometry optimization of protein fragments in the gas phase results in the fragments losing their secondary structure identity. In our calculations we adopt a compromise. Since we are working within the partial Hessian approximation, the structures are optimized under the constraint that the position of all atoms except the carbonyl carbon and oxygen atoms are fixed. Earlier work by other authors has also used restricted geometry optimization to study the amide I band of polypeptides (Kubelka & Keiderling 2001). Spectra are generated by representing each vibrational mode with a Gaussian function with full width at half maximum of 10 cm^{−1}.

### (a) Model helices

Figure 3 shows computed amide I bands for model 3_{10} polyglycine helices with 12 and 16 residues and the 6-31G^{*} basis set. The helices were constructed with initial *ϕ* and *ψ* angles of −48, −26° and −58, −26°. The amide I band of an 11-residue polyanaline helix has been computed using DFT with the 6-31G^{*} basis set. This calculation has a total of 930 basis functions and was reported to take approximately 34 CPU days (Kubelka *et al*. 2006). The 16-residue helix calculation considered here comprises 1060 basis functions and takes less than 3 days, which is a very small fraction of the time for an equivalent full Hessian calculation. For all helices, the amide I band is dominated by an intense band centred between 1700 and 1720 cm^{−1}, which occurs at a lower frequency for the longer helix. Earlier work has shown this band to arise from an in-phase C=O stretch delocalized over many carbonyl groups (Besley & Metcalf 2007). The location of this band agrees well with previous full Hessian calculations. The discrepancy with the location of helical band in proteins is likely to be due to the presence of the remaining protein and solvent. The shapes of the computed amide I bands differ between the *ϕ*=−48° and *ϕ*=−58° helices. For the *ϕ*=−48° helices, distinct bands are observed on both high- and low-frequency sides of the central band, while for the *ϕ*=−58° helices a distinct shoulder is only observed on the high-frequency side. While direct comparison with calculations on the polyanaline helix cannot be made, it is encouraging to note that the spread of frequencies of the amide I modes computed with the partial Hessian method is closer to the full Hessian calculations than those obtained with TDC approaches (Kubelka *et al*. 2006).

### (b) Agitoxin 2 (1AGT)

The protein agitoxin 2 has three distinct secondary structure elements, an α-helix, β-sheet and a turn (figure 4). The largest fragment is the β-sheet which comprises 190 atoms and is significantly larger than the model 3_{10} helices studied in the previous section. Consequently, the 3-21G basis set has been used. However, in the future we hope to use a more substantial basis set to study this system. The computed spectra for the secondary structure elements are in good agreement with the characteristic features observed in experiment. The computed spectrum of the α-helix has the typical spectral profile of an α-helix. It has an intense broad band which is centred at 1642 cm^{−1}, a smaller weaker band is also evident at approximately 1600 cm^{−1}. The computed spectrum for the β-sheet fragment shows two distinct bands at 1639 and 1682 cm^{−1}. This is consistent with experiment which shows β-sheets to contribute to bands at high and low wavelength. The computed spectrum for the turn has two peaks at similar wavelength to the β-sheet, but with weaker intensity. There are also two further peaks at higher (over 1700 cm^{−1}) and lower (less than 1600 cm^{−1}) frequencies. There are some qualitative differences between the present spectra and those presented in an earlier study (Besley & Metcalf 2007). Firstly, in the current work DFT is used and this makes it unnecessary to shift the computed spectra to agree with experiment. Secondly, the previous work computed the spectra directly for the PDB structure and omitted the geometry optimization step. This results in some significant differences in the spectra, in particular for the β-sheet.

## 8. Other spectroscopies

Other spectroscopic techniques are also commonly used to probe protein secondary structure. Electronic circular dichroism (CD) spectroscopy measures the differential absorption of circularly polarized light. Similar to the amide I band, CD spectra of proteins are characteristic of secondary structure, in particular, the intensity at 220 nm correlates with the helical content of the protein. Furthermore, CD spectroscopy can also be measured on the nanosecond time scale (Zhang *et al*. 1993). Current calculations of protein CD are based on the so-called matrix method which relies on parameters drawn from excited state calculations on small model amides (Besley & Hirst 1998, 1999; Hirst & Besley 1999). Direct calculation of the electronic excited states and CD spectra of large polypeptides using time-dependent DFT should be possible with current computing resources. However, extension to study an entire protein is unlikely to be possible for some time. One solution may be to focus on the CD spectrum of a secondary structure element in the environment of the surrounding protein. Simplified models of the remaining protein, such as point charges, may be adopted, or alternatively the protein may be included within the time-dependent DFT excited state calculation (Besley 2004*b*).

## 9. Conclusions and outlook

In the future, quantum chemical calculations will play a prominent role in the study and understanding of biological processes. This will be fuelled by the increasing availability of high-performance computing resources in conjunction with the development of improved theoretical models and algorithms. One of the most important problems in molecular biology is the folding of a protein into its unique native state. Recent advances in spectroscopic techniques that can be measured on a time scale appropriate to probe the early events in protein folding can provide valuable insight into this problem. However, to exploit these experiments fully, requires a synergy between experiment and theory. Accurate calculations from first principles can provide a connection between the experimental spectra and the underlying molecular structure.

Computing quantitatively accurate spectroscopy of proteins is a difficult challenge for quantum chemistry. For the case of the IR amide I band, exploiting the local nature of the vibration through a partial Hessian approximation in conjunction with optimized integral cut-offs can reduce greatly the cost of DFT calculations of the amide I band profile. Using this approach, much larger polypeptides can be studied, and we have shown that calculations of model 3_{10} helices and fragments of the protein agitoxin 2 reproduce the characteristic features observed in experiment. The direct calculation of the spectroscopy of large molecules on the scale of proteins is an area of research that is likely to develop rapidly in the near future. We are currently working on the calculation of much larger systems. The amide I band will also be affected by solvent, and we are currently exploring the inclusion of solvent within the calculations. It is becoming possible to perform MD simulations for much longer time scales, and it is possible to envisage the combination of such simulations with accurate calculations of the amide I profile to simulate the spectroscopy of a dynamical process. In summary, quantum chemical calculations of proteins and their spectroscopy are becoming increasing feasible and hold great promise to provide a platform to improve our understanding of biological processes.

## Acknowledgments

The author is grateful to the Engineering and Physical Sciences Research Council for the award of an Advanced Research Fellowship (GR/R77636), and to the University of Nottingham for access to its high-performance computing facility.

## Footnotes

One contribution of 20 to a Triennial Issue ‘Chemistry and engineering’.

- © 2007 The Royal Society