# Systematic Sampling

F. Yates

## Abstract

This paper gives an account of the results of an investigation into one-dimensional systematic sampling, i.e. the sampling of sequences of quantitative values by the use of sampling points equally spaced along the sequence. New methods, using what are termed partial systematic samples, are evolved for estimating the systematic sampling error from short sections of sequences of completely enumerated numerical material. This gets over the difficulty, which previously existed, that the only estimates of the systematic sampling error of a numerical sequence, even when completely enumerated, were those provided by the actual deviations of the systematic samples of the whole sequence. Such deviations are few in number and by no means independent. Simple end-corrections are proposed for eliminating the errors, due to trend, which are otherwise inherent in randomly located systematic samples. It is demonstrated that it is impossible to make any fully reliable estimate of the sampling error from the systematic sampling results themselves, though if the continuous components of variation are not too marked, the sum of sets of terms taken alternately positive and negative, with suitable end adjustments, will provide a moderately satisfactory estimate, which will always be an over-estimate provided there are no periodicities. This estimate is substantially better than the customary estimate based on successive differences. In other cases supplementary sampling is required to furnish an estimate of error, and methods are described whereby estimates can be derived from supplementary samples at half-spacing, or at half and quarter spacing. The performance of systematic sampling is investigated theoretically for certain mathematical functions, and also by the numerical analysis of certain numerical sequences. The mathematical functions investigated are (1) the two-valucd function, f(x) = 0 or 1, corresponding to sampling for attributes, (2) the normal error function, which corresponds to sampling for density with material normally distributed about a point in a line, and (3) the one-term autoregressive function y$_{r+1}$ = by$_{r}$ + a$_{r+1}$. In the case of the two-valued function the relative performance of systematic and random samples is shown to depend on the lengths of the intervals of the function relative to the sampling interval. If these are small all forms of sampling are about of equal accuracy, but if they are large, systematic sampling is on the average twice as accurate as random sampling with one point per block, which is again twice as accurate as random sampling with two points per block. Similar results hold for the autoregressive function when b $\rightarrow$ 1. In the case of the normal function, numerical analysis shows that systematic sampling over the wholc of the curve is remarkably accuratc in determining the integral of the curve. Mathematical reasons why this should be so are put forward. The sampling of part of the curve by systematic sampling is also investigated, and is used to demonstrate the value of end-corrections. The effect on the sampling errors of departures of actual density distributions from the normal form due to random variations in the material are evaluated. Numerical analyses are made of five numerical sequences: (1) 288 altitudes at 0.1 mile intervals along a grid line of a 1 in. O.S. map, (2) yields of 96 rows of potatoes, (3) 192 daily maximum screen temperature readings, (4) 192 soil temperature readings (9 a.m.) at 4 in., (5) 192 similar readings at 12 in. These analyses confirm the findings of the theoretical part of the investigation, and show that for these types of material the gain in precision with systematic sampling over stratified random sampling of the same intensity with one point per block is of the same order as the gain in precision with stratified random sampling with one point per block over stratified random sampling of the same intensity with two points per block, though the former tends to be larger in material of the more continuous type. The actual average ratios of the variances for the five sequences range from 1 $\cdot$26 to 2 $\cdot$99 in the first case, and 1 $\cdot$31 to 1 $\cdot$90 in the second. The relation between the gain in precision and the gain in cfficiency is evaluated. The latter is always smaller owing to decrcase in accuracy per point for a given method of sampling with decrease in intensity. Consideration of the relation between sampling costs and the losses due to errors in the sampling results shows, however, that with a more precise method of sampling greater accuracy should be demanded in the results. The danger of using systematic sampling in material about which nothing is known, or on material which may be subject to periodicities, is stressed, as is the importance in large-scale sampling investigations of making a preliminary investigation before instituting systematic sampling and of arranging for adequate control of error in the form of error estimates, with supplementary observations if necessary, in systcmatic sampling or stratified random sampling with one point per block. Control of this type should of course also be employed in stratified random sampling with two or more points per block, but in this case no special provisions are necessary, since valid estimates of error are always available from the sampling results themselves.