## Abstract

A model has been designed to predict the phase which forms in water for a non-ionic surfactant, at a given concentration and temperature. The full phase diagram is generated by selecting enough data points to cover the region of interest. The model estimates the probability for each one of 10 possible phases and selects the one with the highest likelihood. The probabilities are based on the recursive partitioning of a dataset of 10 000 known observations. The model covers alkyl chain length and branching, ethoxylate head length and number, and end capping of one or more of the ethoxylate chains. The relationship between chemical structure, shape and phase behaviour is discussed.

This article is part of the themed issue ‘Soft interfacial materials: from fundamentals to formulation’.

## 1. Introduction

The relationship between non-ionic surfactant chemical structure and liquid crystal phase behaviour has previously been studied by the use of a carefully designed array of pure surfactants [1,2]. This work showed that the phase behaviour depended upon chemical structure and provided a dataset that included hydrocarbon chain length and branching in addition to ethoxylate head length, for single and double head groups. The use of a packing parameter to describe the shape of each surfactant was supported by measured data and calculated results for the full array of surfactants [1–3].

The prediction of surfactant phase behaviour is an important target for formulation engineering; however, the modelling of surfactant properties has proven to be rather difficult, even for the simpler properties such as cmc, hlb and interfacial tension [4]. On the face of it, phase prediction should be possible if the lowest free energy state of each possible phase can be estimated, the phase with the lowest energy being the one most likely to form [5]. The thermodynamics of liquid crystal phase behaviour, however, is not simple [6]. Reviews of methods such as simulation [7] display how computationally intensive the methods are, and the limits placed on the need to define or find suitable field equations.

These approaches are not ideally suited to the needs of industry where surfactants are used extensively in the design of new products, and where the ability to rapidly and easily generate phase diagrams would be of great value.

The use of black box methods such as neural networks to predict surfactant phase behaviour [8,9] has indicated that working models can be developed. At the present time, however, there does not appear to be a suitable model for the generation of non-ionic surfactant phase diagrams in water.

In this work, we set out to design a simple method for the generation of phase diagrams for ethoxylated surfactants that contain hydrocarbon tails and ethoxylated head groups. The idea was to use the packing parameter [10] as a description of the surfactant and to use this with temperature and concentration to allow the most likely phase from simple probability theory to be estimated. Knowledge of the phase at each point in the phase diagram would then allow the full phase behaviour to be determined. The packing parameter is derived from the shape of three parts of the surfactant, and so a method of estimating this from simple atom counts on the tail and head groups would lead to a simple procedure.

The structure of a pure surfactant is relatively easy to define, however, most commercial surfactants are polydisperse and therefore contain many structures. In this work, pure surfactants were used to generate the model although the eventual goal would be to work towards polydisperse samples.

Polydisperse surfactants have been shown to produce the same phase diagrams as monodisperse surfactants [11]; however, the presence of ethoxylate distributions, which are skewed towards the lower end, and which contain excess alcohol, effectively reduces the mean EO number on the surfactant. The polydisperse surfactants studied, which had a carbon chain length of 12, acted like monodisperse samples with one EO unit fewer than expected.

One of the benefits of generating a model would be to examine how easily polydisperse samples could be described.

## 2. Material and methods

### (a) Datasets of surfactant shape

The tail length, tail volume and the area occupied by the head group of 39 monodisperse surfactants was taken from the literature [1–3,12–16]. Melting point values for the hexagonal phase of 27 surfactants were taken from the literature [1]. As there was overlap between the AL/V values and H1 melting point values in only 10 cases, some simple models were developed to calculate both of these properties. The models were developed by multiple linear regression methods using JMP software [17]. The length and volume of the tail fragments of each surfactant were calculated using the methods of previous authors [1]; however, an equation to calculate the area of the head group was not available. A simple correlation was therefore developed.

The calculated values of the head area were incorporated into the AL/V formula which, in turn, allowed the packing parameter of new surfactants to be estimated.

Table 1 shows the list of surfactants used, the literature values for AL/V and the calculated values for the new model.

### (b) Phase diagram data

Phase diagram data were collected from the literature. Each diagram was converted into an array of 441 data points representing steps of 5% w/w across the concentration range from 0 to 100% and temperature steps of 5°C from 0 to 100°C. Each surfactant, concentration and temperature was represented as one of 10 possible phases. The phases were solid (s), spherical micelles (L1), inverse spherical micelles (L2), lamellar sheets (La), cubic phase (I1), hexagonal phase (H1), bicontinuous cubic phase (V1), inverse bicontinuous (V2), miscible (M) and a separated phase that for convenience included several options such as L1, L2 or La with water (water+L).

The total number of data points for the 23 surfactants (table 1) was 10 143. Phase diagram data for the polydisperse samples oleyl E5, oleyl E10.7, oleyl E19.2 were also collected; however, these data were not used in the development of the model.

### (c) Model generation and statistics

The observed surfactant phase was modelled using recursive partitioning of the input data AL/V, concentration and temperature. The AL/V values were those generated from the calculation of AL/V with the modified results for the Y-shaped surfactants as shown in table 1. The data were fitted using fivefold cross-validation using the commercial software package JMP [17]. As the model predicts the most likely phase from the calculated probability of 10 options, the output was assessed using mosaic plots for each surfactant. This compares the known phase to the generated phase and gives equal weighting to a success or failure, thus allowing an *R*^{2} value to be assigned to the generated diagram for each surfactant, or the overall model. The *R*^{2} value for the overall model was 0.85. A mosaic plot is shown in figure 1.

Recursive partitioning generates a series of splits in the dataset by comparing groups of data separated by one of the input parameters. As each split is generated, the model improves in *R*^{2}. Cross-validation is carried out to prevent overfitting of the data. The progress of the model can be seen in figure 2 which shows a plot of *R*^{2} with number of splits.

As an example of the use of the model the predicted and measured phase diagrams for the methyl end capped 10 ethoxylate of diheptyl glycerol are presented in figure 3.

## 3. Results and discussion

### (a) Molecular description

The use of a critical packing parameter was the first choice for the development of a model. The phase diagrams can be placed in order by analysing the shapes of each phase and comparing them with one another. For example, the melting point of the H1 transition, the cloud point and the L-alpha phase transitions are correlated with each other and also with the ethoxylate number of the linear surfactants. The literature provided 31 values for the H1 melting point transition and 26 values of measured AL/V; however, there were only a few examples where these were common to the same surfactant. For this reason, a model was created to predict AL/V, so that comparisons within the dataset could be made. The model estimates the area of the head group using simple atom number counts as input parameters. It was developed using standard least-squared methods. Leverage plots based on linear regression were used to assess the fit between the data and the model. Fischer *F*-statistics were used at a level less than 5% for both the individual parameters and the overall model to check for correlation between these and the response. The correlation was 5.66−(16.1*number of carbon chains)+(14.7*number of tail glyceryl)+(10.6*number of EO chains)+(2.5*tail length)− (18.7*number of tail C=C)+(2.1*numberof EO groups)+1.9*(number of EO groups−(3*end cap carbon atoms)).

Head group areas estimated from this model were used to calculate new values for AL/V and compared with literature values using a line of best fit.

A comparison between predicted and measured AL/V results can be seen in figure 4.

If we now plot AL/V against the measured H1 melting point transition, we see that there is a very strong correlation with the linear and V shapes surfactants. Unfortunately, the Y-shaped surfactants are part of a different correlation, although one that appears to be linear. There are only six examples of Y-shaped surfactants on this graph; however, they appear to have shape factors that are 0.824 higher than the linear or V-shaped surfactants. If we subtract this value from the calculated AL/V estimates for the Y-shaped surfactants, then all of the surfactants fall within the confidence limits of the linear/V-shaped surfactant plot.

For modelling purposes, the adoption of modified values for the Y-shaped surfactants was necessary in order to develop a predictive model. The measured values of the head group areas for the Y-shaped surfactants were made either by X-ray diffraction or by analysis of interfacial tension data. There is no obvious reason why the Y-shaped surfactants should be on a different slope from the linear ones (figure 5).

### (b) Modelling method

Historically, it has proven difficult to model non-ionic surfactant phase diagrams. The energy difference between phases is generally small, initiated by concentration differences of a few per cent or degrees temperature. For this reason, it is inherently difficult to succeed with molecular modelling. Phase boundaries are complex curves and their shape not easy to either generalize or describe mathematically. Nonlinear methods such as neural networks are also limited by the on/off nature of many of the transitions.

It has, however, been shown that the behaviour of the phases is governed by shape and that there are distinct trends that can be mapped as the surfactants change in their critical packing parameter [18]. This indicated that a rule-based system of modelling, such as recursive partitioning, was the obvious way to approach the problem. It is well known that the behaviour of surfactants can be described by various heuristic statements. A rule of thumb can often be transcribed into an if/then statement, and so the problem of modelling phase diagrams potentially could be reduced to a series of these rules.

As an example, it is known that cloud points increase as the ethoxylate head increases and decreases as the tail length increases. If the cloud point is known for a given surfactant, then it can easily be surmised that there will be a series of surfactants with the same tail, but shorter head groups which will also be above their cloud point at and above the given temperature. Similarly, any surfactant with a longer head group will clearly not be above its cloud point at any temperature below the given cloud point. The data in a phase diagram therefore naturally split into easily separable groups. Simple statistics can be used to find the groups and define the conditions that apply. The method chosen here was to use recursive partitioning to separate the 10 000 point dataset into groups which were described as if/then statements. As an example, figure 6 shows the separation of the complete set into two groups, in this case split by the temperature of the sample. In one set, the samples are all above 55°C and in the other samples, they are below this temperature. The high temperature samples, as would be expected, have a high incidence of samples above their cloud points, and so we find that 69.9% of the samples are in this state. In the sample below 55°C, we find the most common state is an L1 phase (42%).

As the samples are further split, the probability of each phase being present in each group is determined. The probabilities can then be extracted and used to develop a formula for each phase. These formulae can be used for predictions of new surfactant behaviour.

The model was constructed with 436 such splits.

### (c) Surfactant shape

Small changes in the shape of the surfactant lead to large changes in the observed phase diagram. This is an important aspect of the physics that controls the system. It creates a challenge for the prediction of phase behaviour, because the description of shape requires a very accurate knowledge of the structure. This model was created entirely from monodisperse samples. Where other workers compared monodisperse with polydisperse samples [11], it was found that the polydisperse sample behaved like a monodisperse sample with one ethoxyate group fewer than expected based on the measured peak of the distribution of chains. Phase diagrams are available for oleyl ethoxylates [19] with a distribution of ethoxylate chains that were prepared from high-purity oleyl alcohol. The phase diagram for the 5E sample can be reproduced reasonably accurately using our model if the chain length is assumed to be 3E. The use of this model to predict polydisperse samples should, however, be carried out with caution as it is not known how to calculate a suitable mean from the distribution of ethoxylate chains. One might expect it to be less critical for longer ethoxylate chains, and this seems to be the case with the oleyl E10.7 sample from the same paper.

If we compare the predicted diagram of oleyl E10.7 with the measured one, then we see that the model has correctly identified all eight of the phases present and also placed them in the correct place on the diagram (figure 7). Where the model has not been too successful is in the correct identification of the phase boundaries. This may be a consequence of the shape being in a region where there is a limited amount of modelling data, in addition to the polydispersity of the sample.

### (d) Limitations of the model

Two observations can be made from the success of this model. First, it confirms our belief that surfactant phases are strongly correlated with molecular shape. Second, the packing parameter had to be modified to cope with Y-shaped surfactants. As this model only deals with symmetrical molecules, we might expect to see other similar deviations with unsymmetrical ones. The reason for this is that it may lead to angular or overlap effects at the interface.

The model does not include groups such as propylene or butylene oxide, benzene rings, etc. It should be straightforward to include these groups in a new model but of course monodisperse phase diagrams would be needed.

There is evidence that polydisperse samples can be represented as a monodisperse equivalent that has the same phase diagram. A major challenge in soft solid research is to explain the behaviour of mixtures of surfactants. Previous reports in the literature [14] indicate that monodisperse samples can be mixed and their behaviour understood by the use of packing parameters. If mixing rules can be devised for monodisperse samples, then models such as these might be the key to extending that knowledge to commercial polydisperse samples.

### (e) Overfitting

Recursive partitioning relies on statistical comparisons between samples. Where large datasets are used there is the possibility of overfitting. As group size increases the likelihood of finding a difference also increases. The use of fivefold cross-validation is helpful but it is worthwhile to examine the influence of each variable. If we remodel the data using only the temperature and concentration as input variables, then the model fit *R*^{2} value reduces to 0.36. Clearly, this is a very low value; however, it was generated without any knowledge of the surfactant structure. Cloud point behaviour tends to occur in the top left of the diagram and solid phases in the bottom right. These generalizations are encoded in the result of 0.36. If we now introduce the surfactant shape as a random value of AL/V, rather than the measured values, then we find that the *R*^{2}-value for a model using integral values between 1 and 4 is 0.46. This improvement in the model compared with the model without any structural data represents overfitting. Fortunately, it is very different to the model with measured values of AL/V which produced an *R*^{2}-value of 0.85.

The introduction of a random number between 0.5 and 4 to describe each surfactant was carried out several times. While this produced a range of results, it was not possible to randomly produce a model that could have been described as predictive. It is not possible to rule out overfitting, however, it is clear that the surfactant shape, as described by AL/V, is critical to the success of the model.

## 4. Conclusion

Surfactant phases for simple ethoxylated surfactants can be predicted from the chemical structure. Phase diagram data contain partitions in the data that can be extracted using simple statistics. This on/off behaviour would be difficult to model using linear methods.

A modification of the packing parameter was required to allow Y-shaped surfactants to be modelled alongside linear and V-shaped surfactants.

Very complicated phase diagrams arise from small changes in simple shapes. While this appears to be a major mathematical problem, it has been shown for a limited dataset that the problem can be resolved. Representing the structure of polydisperse samples should be possible although rules will be required to translate the distribution of chains into an equivalent single structure.

## Data accessibility

Data are available from the author.

## Authors' contributions

I declare there are no competing interests.

## Funding

All work reported was funded by Syngenta.

## Footnotes

One contribution of 15 to a discussion meeting issue ‘Soft interfacial materials: from fundamentals to formulation’.

- Accepted February 3, 2016.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.