In this work empirical models describing sampling error ( Δ) are reported based upon analytical findings elicited from 3 common probability density functions ( PDF): the Gaussian, representing any real-valued, randomly changing variable x of mean μ and standard deviation σ; the Poisson, representing counting data: i.e., any integral-valued entity’s count of x (cells, clumps of cells or colony forming units, molecules, mutations, etc.) per tested volume, area, length of time, etc. with population mean of μ and ; binomial data representing the number of successful occurrences of something ( x<sup>+</sup>) out of n observations or sub-samplings. These data were generated in such a way as to simulate what should be observed in practice but avoid other forms of experimental error. Based upon analyses of 10 4 Δ measurements, we show that the average Δ ( ) is proportional to ( σ<sub>x</sub>•μ<sup>-1</sup>; Gaussian) or (Poisson & binomial). The average proportionality constants associated with these disparate populations were also nearly identical ( ; ± s). However, since for any Poisson process, . In a similar vein, we have empirically demonstrated that binomial-associated were also proportional to σ<sub>x</sub>•μ<sup>-1</sup>. Furthermore, we established that, when all were plotted against either or σ<sub>x</sub>•μ<sup>-1</sup>, there was only one relationship with a slope = A (0.767 ± 0.0990) and a near-zero intercept. This latter finding also argues that all , regardless of parent PDF, are proportional to σ<sub>x</sub>•μ<sup>-1</sup> which is the coefficient of variation for a population of sample means ( ). Lastly, we establish that the proportionality constant A is equivalent to the coefficient of variation associated with Δ ( ) measurement and, therefore, . These results are noteworthy inasmuch as they provide a straightforward empirical link between stochastic sampling error and the aforementioned C<sub>v</sub>s. Finally, we demonstrate that all attendant empirical measures of Δ are reasonably small (e.g., ) when an environmental microbiome was well-sampled: n = 16 - 18 observations with μ∼3 isolates per observation. These colony counting results were supported by the fact that the two major isolates’ relative abundance was reproducible in the four most probable composition observations from one common population.
There are various analytical procedures for enumerating organisms in environmental samples which diverge in their experimental approach yet are mathematically inter-related. Thus, if V represents the sample volume and V e the volume occupied by a test entity of interest (e.g., colony forming units or CFUs), the probability that one particular V e will not contain this entity at concentration δ [
( V / V e − V ⋅ δ V / V e ) = ( 1 − V e ⋅ δ ) ;
i.e., V / V e ―maximum possible number of entities in V and V ⋅ δ ~the actual number of objects present.
Assuming that many V e aliquots have been combined to generate V, the probability that no organism will be contained in V is [
P − = [ 1 − V e ⋅ δ ] V V e
therefore
ln [ P − ] = V V e ⋅ ln [ 1 − V e ⋅ δ ] .
Since
ln [ 1 − ψ ] ~ − ψ − ψ 2 2 − ψ 3 3 − ψ 4 4 − ⋯
then, if ψ = V e ⋅ δ ,
ln [ P − ] ~ V V e ( − V e ⋅ δ − V e 2 ⋅ δ 2 2 − V e 3 ⋅ δ 3 3 − V e 4 ⋅ δ 4 4 − ⋯ ) ~ − V ⋅ δ ( 1 + V e ⋅ δ 2 + V e 2 ⋅ δ 2 3 + V e 3 ⋅ δ 3 4 + ⋯ ) .
For V e → 0 (e.g., E. coli [
ln [ P − ] ~ − V ⋅ δ
P − = exp [ − V ⋅ δ ] = exp [ − μ ]
therefore
P + = 1 − P − = 1 − exp [ − V ⋅ δ ] = 1 − exp [ − μ ] . (1)
In certain circumstances it is only possible to determine an organism’s δ by diluting the sample to such an extent that only a fraction of the n “technical” replicates tested are positive ( x + ) for the presence of the entity, or microbe, in question [
For example, were one to obtain a food sample containing ~14 CFU of a particular organism per 50 g, the cells would typically be washed from the food matrix, concentrated to a few mL (e.g., via centrifugation), and brought up to some appropriate volume (say 40 mL = Vsample) with media [
For MPN-based organism detection and subsequent enumeration, the number of positive occurrences of growth in any jth experiment out of n observations = x j + = ∑ i = 1 n θ i j (θ = either 1 [presence] or 0 [absence]) can be estimated as
x + ~ n ⋅ P + = n ( 1 − exp [ − V ⋅ δ ] ) (2)
whereupon x + is integral (=ROUND( n ⋅ P + , 0) in Excel). The probability of observing x + successes out of n Bernoulli trials [
P b = n ! x + ! ( n − x + ) ! ( P − ) n − x + ( P + ) x +
which is also known as the binomial PDF. Since n ⋅ P + = the population average (real) [
P b = n ! x + ! ( n − x + ) ! ( 1 − μ + n ) n − x + ( μ + n ) x + . (3)
The multiple dilution MPN calculation itself is determined by finding the value of δ at the maximum in the product of the P b s from all l t h dilutions ( ∏ l P b , l ) and is easily achieved by adding the scaled sum of all dilutions’ ∂ δ P b ÷ P b values to an initial guess for δ (i.e.,
δ m + 1 = δ m + λ m × ∑ l { ∂ δ P b , l ÷ P b , l } m = δ m + λ m × ∑ l { ( x l + − n + ( x l + ÷ ( exp [ V ⋅ δ m ⋅ 0.1 l ] − 1 ) ) ) ⋅ V ⋅ 0.1 l } for any particular
ℓth one-to-ten dilution and m iterations; λ is a monotonically changing, with m, scaling function) then solving for the MPN recursively [
At the limit n → ∞, Equation (3) simplifies to what is known as the Poisson PDF
P P = μ x exp [ − μ ] x ! . (4)
Under these circumstances, x is the observed and μ is the population average number of counts in/on the tested volume, surface, chosen time period, etc. This PDF is applicable to all analytical systems involving, essentially, the counting of objects. However this PDF is applied, the most conspicuous aspect [
σ 2 = ∑ x = 0 ∞ ( x − μ ) 2 P P = μ
equals the population mean ( μ or first moment)
μ = ∑ x = 0 ∞ x ⋅ P P .
The last probability density function utilized in this stochastic sampling exercise is also related to P b , Equation (3). This is the Gaussian PDF which we use to quantitatively examine the effects of n and σ (fixed μ ) on the variability of sample means ( x ¯ ) which have been created by randomly sampling from a population of real-valued variables (x; e.g., doubling time [
P G = Area σ 2π exp [ − 1 2 ( x − μ σ ) 2 ] ; (5)
in this relationship the Area term (~ Δ x ⋅ ∑ k = 1 K f k ; for large K) is the approximate area under the fitting function f (frequently taken to be 1 since Δ x is often = 1 and ∑ f is always ~1). There are several derivations of PG but none are as persuasive as the fact that this PDF is simple and has been experimentally shown to be the most likely probability distribution associated with most experimental observations [
The original purpose of our sampling-related investigations [
Herein we model stochastic sampling errors associated with all the aforementioned PDFs and empirically demonstrate that the resultant mathematical models are, in part, a consequence of the “central limit theorem” [
All counting data were created by multiplying Equation (4) by 360 in order to produce a large number of integral-valued repeats (=ROUND ( 360 ⋅ P P , 0)) for any particular count x: e.g., for μ = 1 particle per test volume, area, length of time, etc., there would be, most probably, 132 repeats of x = 0, 132 repeats of x = 1, 66 repeats of x = 2, 22 repeats of x = 3, 6 repeats of x = 4 and 1 repeat of x = 5 entities per test. From this pool of 360 counts for each μ, an n number of x values were randomly selected based upon random number tables created with Mathematica.
which generates n random numbers between 1 and 360. Thus, 100 such random number sets were utilized for the twenty-five n (= 3, 6, 9, 12, 24) × μ (= 1, 2, 4, 8, 16) combinations. Briefly, each procedure involved arranging the aforementioned 360 x values (one set for each μ) in one column of a spreadsheet followed by filling in n adjacent columns with formulae which refer to the calculated x values but where each row’s reference number was taken from the Mathematica-generated random number, Equation (6), next in sequence. MPN- and Gaussian-based data arrays were treated in an identical fashion. The formula (P-I: normalized deviations of s j from σ = μ ) for calculating our empirical measure of Poisson stochastic sampling error (Δ) was
Δ j = | μ − s j | μ (7)
whereupon the s j term is the experimental standard deviation ( ( n − 1 ) − 1 ∑ i = 1 n ( x i j − x ¯ j ) 2 or “=STDEV.S ( x i j -array )” in Excel) for each j th
( j = 1 , 2 , ⋯ , J ; J = 100) experiment and i th ( i = 1 , 2 , ⋯ , n ) x. The average across
100× experiments, regardless of formulation, were symbolized as Δ ¯ (= J − 1 ⋅ ∑ j = 1 J Δ j or “=AVERAGE ( Δ j -array )”). A second form for the Poisson-based measure of Δ was also calculated (P-II: normalized deviations of x ¯ j from known μ) from these same data
Δ j = | μ − x ¯ j | μ . (8)
Here the x ¯ j is the observed arithmetic mean for each j th counting experiment.
All MPN data were created by multiplying Equation (1) by 360 to produce the number (“=ROUND ( 360 ⋅ P + , 0)”) of positive responses (θ = 1) for any particular level of V ⋅ δ (=μ); e.g., for μ = 0.1 entity per volume tested there would be 34 repeats of θ = 1 and 326 repeats of θ = 0. From such a column of 360 θ values (one column for each μ), n were randomly selected based upon Mathematica tables, Equation (6), and treated similar to the Poisson data above. Thus, for each combination of n (= 3, 6, 9, 12, or 24) × μ (= 0.1, 0.2, 0.4, 0.8, 1.6), 100
random n-selections were performed. The formula for calculating our empirical measure of MPN sampling error was
Δ j = | n ⋅ P + − ∑ i = 1 n θ i j | n ⋅ P + = | μ + − x j + | μ + ; (9)
where θ = either a “1” (a positive occurrence) or a “0” (a negative occurrence). As before, the average Δ j across J = 100 experiments (each of n observations) = Δ ¯ . The MPN value for x ¯ j + = ln [ n ÷ ( n − x j + ) ] and provides the average MPN or CFU per sample; a rearrangement of Equation (2).
All Gaussian PDF data were produced by multiplying Equation (5) ( Δ x = 1 ) by 360 producing an integral number of observations (“=ROUND ( 360 ⋅ P G , 0)”) for each value of x as a function of μ (fixed at 20) and σ (= 1, 1.5, 2, 3, 4). For instance, for σ = 1 there would be 2 repeats of x = 17, 19 repeats of x = 18, 87 repeats of x = 19, 144 repeats of x = 20, 87 repeats of x = 21, 19 repeats of x = 22, and 2 repeats of x = 23. From this column of 360 values of x, n (= 3, 6, 9, 12, or
24) were randomly selected based upon Equation (6) and treated identically to the Poisson and MPN data sets. Thus, for each combination of n × σ 100× n-based selections were performed. The formula for calculating our empirical measure of Gaussian sampling error, similar to Equation (7), was
Δ j = | σ − s j | μ . (10)
As usual, the average Δ j across J = 100 such sets of experiments each of n observations = Δ ¯ .
All curve-fitting was based upon a modified Gauss-Newton algorithm by least squares [
C L = t ⋅ s f k = t s π 1 2 [ ∂ π 1 f k ] 2 + s π 2 2 [ ∂ π 2 f k ] 2 + 2 ⋅ s π 1 π 2 2 ⋅ ∂ π 1 f k ⋅ ∂ π 2 f k
where, for any particular fitting parameter ω , s ω = s Y 2 ⋅ [ Z T Z ] ω ω − 1 = “asymptotic standard error” [
C L = t 0.01 ⋅ s f k = t 0.01 s Y 2 ( Z k [ Z T Z ] − 1 Z k T ) .
In all the above relationships Z is the partial first derivative matrix of f k with respect to the parameters π 1 and π 2 (i.e., a 2-parameter fit) such that
Z = [ ∂ π 1 f 1 ∂ π 2 f 1 ∂ π 1 f 2 ∂ π 2 f 2 ⋮ ⋮ ∂ π 1 f K ∂ π 2 f K ] ,
Z T is the transpose of Z, Z k = [ ∂ π 1 f k ∂ π 2 f k ] (K row vectors), and s Y 2 ⋅ [ Z T Z ] − 1 is the variance-covariance matrix [
For the food microbiome sampling experiment ~25 g of commercial, pre-thawed (~15 min at room temperature), frozen vegetables were washed with a volume of phosphate buffered saline (PBS; 10 mM Na2HPO4 + 2 mM NaH2PO4 + 137 mM NaCl; pH 7.4 ± 0.2; Boston BioProducts, 159 Chestnut Street, Ashland, MA 01721) equivalent to double the mass of the sample. In order to assist in the detachment of plant tissue-bound cells, 0.075% [w/v] Tween-20 (Sigma-Aldrich, 3050 Spruce St., St. Louis, MO 63103) was added to the PBS and filter sterilized. All washing was performed in sanitized plastic zip-lock bags wherein the formerly frozen vegetables and buffer wash were gently agitated at 80 rpm for approximately 20 min and immediately passed through a 40 μm nylon filter (BD Falcon; Becton Dickinson Biosciences, Bedford, MA) to remove large particles.
Directly sampled washes (5 mL Control = Observation I [cultured at 30˚C] and III [cultured at 37˚C]) as well as hollow fiber microfilter-concentrated (each 5 mL sample was diluted to ~100 mL PBS + Tween, concentrated, then washed with another 100 mL buffer, and eluted with ~5 mLs PBS + Tween = Observation II [cultured at 30˚C] and IV [cultured at 37˚C]) samples were collected and enumerated using the 6 × 6 drop plate method [
Each colony (n total) was carefully removed from the agar plate’s surface using a Rainin L20 tip, dispersed into 200 μL BHI in a 96-well plate and incubated at 30˚C for 16 - 24 hours. These cultures were restreaked onto solid media and incubated at 30˚C overnight. One colony from each of the original n plates was selected, suspended into 25 μL of Ultra PrepMan (Applied Biosystems, Foster City, CA) in a PCR tube and heated in a thermocycler at 99˚C for 15 min. Upon cooling, samples were centrifuged 10 min. to separate the DNA solution from the cell debris. A sample of supernatant was transferred to a new tube for the DNA amplification step (end-point PCR). Once the 16S rRNA “gene” amplification, sequencing reactions (EubA and EubB primers) and Sanger sequencing were performed, DNA sequences were edited, and contigs assembled using Sequencher software as explained in detail previously [
Completely homologous relationships to the Poisson and MPN findings were also noted with Gaussian-based data (
The counting results alluded to above (P-I, P-II, & MPN) are similar to those observed previously: [
Δ ¯ ∝ 1 n ⋅ μ = 1 n ⋅ μ μ = σ n ⋅ 1 μ = σ x ¯ μ = C V [ x ¯ ] (11)
because σ = μ . We have simplified the expression by utilizing the term σ x ¯ [
In
The equality in Equation (11) is also visually confirmed by the results shown in
∂ Δ ¯ ∂ [ n ⋅ μ − 2 ] = ∂ Δ ¯ ∂ [ σ x ¯ ÷ μ ] .
Since the combined data in
Δ ¯ n ⋅ μ − 2 = Δ ¯ σ x ¯ ÷ μ
therefore cross-multiplying gives
Δ ¯ ⋅ σ x ¯ ÷ μ = Δ ¯ ⋅ n ⋅ μ
and dividing both sides by Δ ¯ produces the equality
σ x ¯ ÷ μ = n ⋅ μ − 2 .
All sampling error-related findings are summarized in
Lastly, all these assertions are substantiated by the observation (
∂ s Δ j ∂ Δ ¯ = ∂ Δ ¯ ∂ X (12)
where X = either n ⋅ μ − 2 or σ x ¯ ÷ μ . Since s Δ j in
s Δ j Δ ¯ = Δ ¯ X .
Substituting Δ ¯ with A ⋅ X
s Δ j A ⋅ X = A ⋅ X X
s Δ j X = A 2
s Δ j = A 2 ⋅ X = A ( A ⋅ X ) = A ⋅ Δ ¯
and therefore
s Δ j Δ ¯ = C V [ Δ j ] = A
The above equality establishes that the coefficient of variation associated with Δ ¯ ( C V [ Δ j ] ) is equivalent to the proportionality constant A seen in Figures 1-3 and
Based upon these results, the estimation of C V [ x ¯ ] (i.e., s x ¯ ÷ x ¯ ) should be germane in determining if data have been appropriately sampled.
on commercially available, frozen vegetables were sufficiently sampled using an n = 16 - 18 inasmuch as the C V [ x ¯ ] -values associated with the normalized colony counts (CFU g−1 averaged across all l dilutions = x ¯ l ÷ 0.007 mL per drop ÷ 0.5 l dilution factor × 57.2 mL total original sample volume ÷ 28.6 g total frozen vegetable mass) were appropriately small (ranging between ca. 2% to 4%). In a
similar vein, it is pertinent that the observed (s) and calculated ( x ¯ l ) standard deviations associated with the counts per drop were equivalent since the average deviation ( | s − x ¯ l | ) from ideality varied only 15.7% ± 3.54% ( ± s x ¯ ). Lastly it is also significant that the dilution factors calculated from the ratios of average plate counts ( x ¯ l ÷ x ¯ l − 1 ) were very close to ½ (average 0.523 ± 0.0172) which also argues for a minimized Δ .
Across the 4 observational sets (I, II, III, and IV) depicted in
We have performed analyses associated with empirical stochastic sampling errors linked to data generated from 3 common probability density functions. We have used these to describe the limiting behavior of Δ by generating models which suggest a generalized, and facile, mathematical solution. Based upon all our experiments, the common algebraic solution, regardless of parent distribution, is that experimental sampling errors are proportional to σ x ¯ ÷ μ . This generalized relationship is intuitively reasonable inasmuch as this is the C V for any population of sample means ( C V [ x ¯ ] ) and describes how closely x ¯ values approach μ as n increases. The proportionality constant for all these findings was found to be mathematically related to C V [ Δ j ] or ∂ s Δ j / ∂ Δ ¯ , which is the coefficient of variation associated with the error measurement itself. Lastly, using estimates of these sampling-associated errors ( C V [ x ¯ ] ~ s x ¯ ÷ x ¯ ), we show that when a test microbiome was sufficiently sampled, several measures of stochastic sampling error were reasonably small for both counting and DNA sequence-based results.
The authors declare no conflicts of interest regarding the publication of this paper.
Irwin, P.L., He, Y. and Chen, C.-Y. (2019) Deconvolution of the Error Associated with Random Sampling. Advances in Pure Mathematics, 9, 205-227. https://doi.org/10.4236/apm.2019.93010
Indices = i ( = 1 , 2 , ⋯ , n ) observations per experiment; j ( = 1 , 2 , ⋯ , J = 100 ) experiments with n observations each; k ( = 1 , 2 , ⋯ , K ) rows of X-Y values; l ( = 1 , 2 , ⋯ , L ) dilutions; m ( = 1 , 2 , ⋯ , M ) iterations; p ( = 1 , 2 , ⋯ , P ) parameters
Δ j = j th experimental measure of sampling error out of J = 100 experiments: Equations (7)-(10).
Δ ¯ = average sampling error in J = 100 observations of Δ j
A = proportionality constant associated with Δ ¯ curve-fitting to n, μ (or σ)
s Δ j = standard deviation associated with Δ j measurement; for this work there are 25 ( n × μ or n × σ for the Gaussian populations) such s Δ j for each PDF type (2 types of Poisson, MPN or binomial, Gaussian)
μ = for either Poisson PDF or MPN assays ( μ = V ⋅ δ ), the population average number of biological entities, or other analytes, per test; for Gaussian PDF, the population’s average of any real-valued, randomly changing variable
V = the sample volume to be tested
V e = volume of the biological entity, or other analyte, being tested
δ = concentration of the biological entity (count ÷ V) or other analyte
μ + = population average number of positive growth responses (MPN) out of n observations; μ + = n ⋅ P +
σ + = the standard deviation associated with the probability density of x + ; the Gaussian approximations for σ + are plotted in
P − = probability that V e will NOT contain the biological entity, or other analyte, being tested
P + = probability that V e will contain the biological entity, or other analyte, being tested; P + = 1 − P − ; Equation (1)
∂ X f [ X ] = ∂ f [ X ] / ∂ X
x i j = for Poisson populations, the i th observation’s number of counts per tested volume, surface area, etc. for each j th experiment; for Gaussian populations, any real-valued, randomly changing variable
x ¯ j = 1 n ⋅ ∑ i = 1 n x i j
x j + = j th experiment’s number of positive growth responses out of n observations; x j + = ∑ i = 1 n θ i j where θ = 1 (positive) or 0 (negative)
x ¯ j + = j th experiment’s number of positive counts in V volume; x ¯ j + = ln [ n ÷ ( n − x j + ) ] ; the x-bar symbol is used here because this relations contains a parameter, x j + , which is the result of a summation across all θ i j ; it just isn’t normalized to n
n = number of technical replicates in each j th experiment; for MPN, number of observations each of volume V; for Poisson populations we have found [
σ = population standard deviation associated with μ
σ x ¯ = standard deviation of a population of sample means ( x ¯ ); the formula for the σ x ¯ statistic can be derived from the propagation of errors method [
σ x ¯ = ( ∂ x ¯ ∂ x 1 ) 2 σ x 1 2 + ( ∂ x ¯ ∂ x 2 ) 2 σ x 2 2 + ⋯ + ( ∂ x ¯ ∂ x n ) 2 σ x n 2 = n σ 2 n 2 = σ n
since
∂ x ¯ ∂ x 1 = ∂ x ¯ ∂ x 2 = ⋯ = ∂ x ¯ ∂ x n = 1 n
and
σ x 1 2 = σ x 2 2 = ⋯ = σ x n 2 = σ 2 .
s j = any jth experiment’s estimation of population standard deviation
s x ¯ = estimation of σ x ¯ from a limited number of x ¯ j ; s x ¯ = s j ÷ n
C V [ x ¯ ] = coefficient of variation for a population of means; C V [ x ¯ ] = σ x ¯ ÷ μ x ¯ = σ x ¯ ÷ μ estimated as s x ¯ ÷ x ¯
C V [ x ] = coefficient of variation for any set of observations x; C V [ x ] = σ μ estimated as s x ¯
C V [ Δ j ] = ∂ s Δ j / ∂ Δ ¯ ~ s Δ j ÷ Δ ¯ if the s Δ j vs. Δ ¯ intercept ~ 0
CLT = central limit theorem: the mean ( μ x ¯ ) of a population of observed means ( x ¯ ) will be approximately equal to the mean of the sampled population (μ) and the standard deviation of this population of means will be approximately equal to σ x ¯ ; Equation (5) with x = x ¯ , μ = μ x ¯ = μ , and σ = σ x ¯
PDF = probability density function or probability distribution function
P b = binomial PDF: Equation (3)
P P = Poisson PDF: Equation (4)
P G = Gaussian PDF: Equation (5)
CL = confidence limit = t-statistic × s f k = t ⋅ s f k
ASE = asymptotic standard error [
A S E = s ω = s Y 2 ⋅ [ Z T Z ] ω ω − 1 ; s Y 2 = residual sum of squares ÷ (K − M) where M =
the number of fitting parameters π p ( p = 1 , 2 , ⋯ , P )
s f k = kth row standard error of fitting function fk; s f k = s Y 2 ( Z k [ Z T Z ] − 1 Z k T )
Z = partial first derivative matrix of f k with respect to associated fitting parameters π 1 , π 2 , ⋯ , π P
Z T = transposition of Z
Z k = [ ∂ π 1 f k ∂ π 2 f k ] for f k = f [ X k ; π p ]