This paper analyzes the effect of subgroup size on the x-bar chart characteristics using sample influx (SIF) into forensic science laboratory (FSL). The characteristics studied include changes in out-or-control points (OCP), upper control limit UCLx, and zonal demarcations. Multi-rules were used to identify the number of out-of-control-points, Nocp as violations using five control chart rules applied separately. A sensitivity analysis on the Nocp was applied for subgroup size, k, and number of sigma above the mean value to determine the upper control limit, UCLx. A computer code was implemented using a FORTRAN code to create x-bar control-charts and capture OCP and other control-chart characteristics with increasing k from 2 to 25. For each value of k, a complete series of average values, Q( p), of specific length, Nsg, was created from which statistical analysis was conducted and compared to the original SIF data, S( t). The variation of number of out-of-control points or violations, Nocp, for different control-charts rules with increasing k was determined to follow a decaying exponential function, Nocp = Ae –α, for which, the goodness of fit was established, and the R 2 value approached unity for Rule #4 and #5 only. The goodness of fit was established to be the new criteria for rational subgroup-size range, for Rules #5 and #4 only, which involve a count of 6 consecutive points decreasing and 8 consecutive points above the selected control limit ( σ/3 above the grand mean), respectively. Using this criterion, the rational subgroup range was established to be 4 ≤ k ≤ 20 for the two x-bar control chart rules.
The forensic science laboratory (FSL) studied comprises of three disciplines (forensic biology and DNA, forensic chemistry and forensic toxicology), which receive, examine, analyze and reports on evidence submitted in criminal cases from all over Tanzania. The FSL is the sole legally authorized laboratory to examine evidence submitted by any Governmental agency investigating a criminal offense. Crime scenes are the primary point of detection, collection, and preservation of evidence, before submission to the FSL. All of the methods and processes employed at a crime scene are geared toward stopping the clock, i.e., keeping the scene and the evidence as much as it was when the crime was committed as possible until it can be recorded and documented. The evidence received is viewed in the context of the crime, the persons involved, and the environment (both physical and temporal). Science applied at the scene can generate data for later analysis, provide quicker answers, and facilitate better analysis in the laboratory. Currently, the sample influx is overwhelming, which require statistical analysis and decision making.
The FSL studied is part of a dynamic multi-disciplinary organization devoted to the ideals of excellence that provides leadership to advance forensic science and its application to the police investigation unit and legal system and other services offered by the organization. Currently, the FSL has undertaken studies to debottleneck the challenges including extended turnaround time (TAT), large number of case files submitted, large number of evidence or samples submitted per case file, defined in this study as sample influx data, financial and administrative hurdles, and human resource challenges. The SIF data shows strong variations with time, from case file to another and among the three disciplines, posing a challenge to the FSL performance.
This paper focuses on the statistical time series analysis techniques on SIF data and introduces application of x-bar control chart technique as a tool for identifying uncommon occurrences of high sample influx so that causes can be identified and action can be taken to keep the sample management process under control.
Higher SIF to FSL affects the sample management process. Demand on human resource, funds for reagents and other consumables, high equipment utilization rate and repair, administrative burdens (documentation, quality control) are among the effects of high SIF to the FSL management. Moreover, high demand on utilities (water, electricity, compressed gases, cooling and storage facilities, etc.) and increased laboratory space requirements necessitate a closer look at SIF data so that decisions can be made based on scientific evidence. Other effects of high SIF include high report volumes and writing and review time, payment of extra hours and demand for expert witness sessions’ preparation time for analysts.
Antagonistic scenarios prevail between crime scene exhibit collection and FSL sample analysis and reporting. While crime scene investigation demand collection of as large number of evidence as possible (which leads to elevated SIF), the FSL enters a state of management surge in trying to accommodate high SIF leading to extended TAT, which forms a large component of reasons for complaints by its clients (investigation, prosecution and judiciary).
The SIF data can be analyzed using time series analysis techniques. Statistical analysis techniques tend to unfold hidden details of the inherent process that are usually contained in the time series. Those details are necessary in order to build critical understanding of the system or process being assessed. With increased understanding of the process, control action and decision making become easy for elimination of the causes of process instability.
There exists several time series analysis techniques employed in analyzing engineering systems. This study used x-bar control charts to signal problems in either crime scene management techniques, investigator skills, change of national or regional crime profile, the need for re-planning for human resources, space, equipment and technology acquisition, budget review process, or funding re-allocation. The x-bar and range charts are the most common control charts used in measuring continuous data well known as fundamental tools for displaying the range of variability inherent to a process [
Application of several rules to the same set of data is referred to as multi-rule analysis. The rules can be implemented separately (as in this work) or in combination [
Research on SPC is wide in industry and medical laboratories, but has not been focused in forensic science laboratories for cost and quality improvements. While FSL’’s reports contribute strongly in the judiciary system, the use of control chart as quality control tool is an important research area. The purpose of this research was to provide a scientific basis for choosing rational subgroup size, given an acceptable range of 2 to 25. None of the researchers dealt in depth on the scientific identification of rational subgroup size.
The primary use of x-bar control charts is to help in determining whether or not the process in question is stable [
Other rules use the zones to test for process stability, called zone tests (Rule #2 to #4). The zone tests are valuable tests for enhancing the ability of control charts to detect small shifts quickly. In this study, Rule #2 to #4 were implemented based on zone tests while Rule #5 which is based on trending behavior was also implemented in order to build understanding on the ability of the control charts. The first step in using these tests is to divide the control chart into zones, by dividing the area between the average and the upper control limit into three equally spaced areas. The locations of the lines depend on standard deviation and a factor of the latter to be added to the grand mean. This is then repeated for the area between the average and the lower control limit.
The x-bar control charts consist of three zones, that is, A, B, and C. There is zone A for the top half of the chart and a zone A for the bottom half of the chart. The same is true for zones B and C. The charts are normally based on 3 sigma limits of the variable being plotted. This method works perfectly for a normally distributed process [
The limits are determined by estimating the short-term variation in the process, which are then used in defining process stability (or process control). The short-term variation provides a good model (or estimate, or prediction) of the longer-term variations because if short term variation remains under control, eventually the long term variations will be under control or stable. The short-termism arises from the choice of the subgroup size. This is the most critical component towards effective use of these control charts, yet one of the most overlooked. This paper examines the effect of subgroup size on the performance of the control chart as a statistical tool.
This paper presents a new criterion on choosing subgroup size for the data in hand. As stated above, each subgroup represents a snapshot of the process at a given point in time. The x-axes of the x-bar control chart are time-based, so that the charts show a history of the exhibit or sample receiving process. X-bar charts are efficient at detecting relatively large shifts in the process average, typically shifts of ±1.5 sigma or larger. The larger the subgroup, the less-sensitive the chart will be to shifts in the process.
Different researchers use subgroup sizes depending on convenience of data collection, and limitations posed by literature values of subgroup size. The key to successful control charts is based on formation of rational subgroups. Control charts rely upon properly selected subgroups to estimate the short-term variations in the process. The short-term variations are then used to predict the longer-term variation defined by the control limits, which differentiate between common and special causes of variations. A rational subgroup is simply a sample in which all of the items are produced under conditions in which only random effects are responsible for the observed variation. This study critically investigates the ability to identify process instability at a wide range of subgroup sizes for the same data set, that is, SIF data.
This paper deals with an approach for choosing the proper subgroup size for control charts. Other researchers used ANOVA for testing that the process mean is in control and Bartlett’s test for testing that the process variance is in control [
The purpose of x-bar control charts is to detect significant process changes when they occur. In general, charts that display averages of data like x-bar charts are more useful than charts of individual data points. Charts of individuals are not nearly as sensitive as charts of averages at detecting process changes quickly. X-bar charts are far superior at detecting process shifts in a timely manner, and the subgroup size is a crucial element in ensuring that appropriate chart signals are produced [
Often, the subgroup size is selected without much thought. A subgroup size of 5 seems to be a common choice. If the subgroup size is not large enough, then meaningful process shifts may go undetected. Based on the limitations over batch completion time, a value of k = 11 was used for saccharification temperature, pH and Brix data control charts [
The SIF data comprised of details per case files received in each calendar year from January to December. Each request submission contains a different number of samples or evidence, referred to as case file. Since the sample influx data was collected from recorded sample receiving datasheets, as the samples were being received, then the subgroups are formed from observations taken in a time-ordered sequence, i.e., from a time series of sample influx, or SIF, denoted as S(t). In other words, subgroups were formed using a snapshot of the process over a small window of time, and the order of the subgroups would show how those snapshots vary in time. Given that 629 case files were received in 260 working days (SIF2014 with highest case files), at an average of 3 case files per day, a value of k = 2 or 3 spans a time window of one day. Thus, values of k higher than 3 are recommended for SIF data. On the other extreme, a maximum value of k = 25, on the other extreme, is equivalent to 8.33 working days, which is within 2 weeks. Thus, most of the values of k used in this study investigate variations in a time window of 1 to 2 weeks, which can be too long maximizing the chance of special causes. Thus, for sample influx data, one week or 5 days should be sufficient, that is k = 5 × 3 = 15 maximum. In this case, analysis of k = 2 to 15 to provide a time window of 5 working days in a week is recommended. The details of the SIF data used in this study including number of case files received at FSL and the corresponding statistics is summarized in
This study was mainly focused on characterizing stability of the system as the subgroup size was increased from 2 to 25. Given a time series of sample influx data, S(t), of length Npt, as the number of subgroups, k, is changed the number of groups for which average values are determined and compared with control limits changes as per Equation (1):
Since only complete columns of subgroups can be processed, the incomplete subgroups were truncated leading to slight variations in the grand mean,
SIF data source | SIF2009 | SIF2014 | SIF2015 | |
---|---|---|---|---|
N (case files) | 360 | 629 | 503 | |
Mean (samples/case file) | 14.72 | 13.31 | 12.03 | |
Median | 3 | 3 | 3 | |
Mode | 1 | 3 | 3 | |
Std. Deviation | 34.17 | 44.35 | 36.13 | |
Skewness | 5.68 | 7.48 | 7.08 | |
Kurtosis | 41.18 | 73.42 | 63.41 | |
Minimum | 1 | 1 | 1 | |
Maximum | 343 | 618 | 400 | |
Total number of samples | 5300 | 8370 | 6073 | |
Mean | 14.72 | 13.31 | 12.03 | |
Percentiles | 25% | 1 | 2 | 2 |
50% | 3 | 3 | 3 | |
75% | 14 | 4.5 | 5 |
k | Nsg | UCLx | X | X | Nocp | Pocp | ||||
---|---|---|---|---|---|---|---|---|---|---|
5 | 124 | 36.9 | 29.1 | 21.284 | 13.5 | 5.7 | −2.1 | −9.96 | 12 | 9.67% |
10 | 62 | 35.0 | 27.9 | 20.664 | 13.5 | 6.3 | −0.9 | −8.10 | 8 | 12.90% |
15 | 41 | 33.8 | 27.2 | 20.296 | 13.6 | 6.8 | 0.1 | −6.63 | 4 | 9.75% |
The main purpose of x-bar control chart is to identify and count the number of out-of-control points (OCP) denoted as Nocp, observed above the upper control limit, UCLx, using Equation (2):
where n is the number of multiples of sample sigma above the grand mean to determine the upper control limit. For a normal distribution, n = 3. However, for a distribution away from normal, it is wise to establish the coefficient n before a control chart can be used to assess stability of the process. In the literature, the value of n used to set the control limits is usually stated in the rules to be used, such as 2of32s as introduced in Westgard Rules.
The count of number of times any rule was violated, denoted as OCP, was established based on the selected value of n = 1.0 for setting the control limits, as per Equation (3):
This led to the three zones separated by the lines XA and XB, as per Equations (4) and (5):
and
The percent of out-of-control points for different control-chart interpretation rules for a given number of subgroups, Nsg, (that is, for each value of k) was determined using Equation (6):
Literature shows that there are times when control limits are set using 3σ, 2σ or 1σ [
Detailed analysis of control charts uses a collection of rules to asses for condition leading to denoting the process (from which a time series originates) as out of control or unstable [
Rule #1 signifies process control rule where a violation was counted as Nocp when a subgroup average exceeds the upper control limit set as per Equation (3), where n = 1, which is different from a usual action or rejection limit on Shewhart control chart that uses Equation (2) with n = 3. The decision to use n = 1 was reached through sensitivity analysis for SIF data, as shown in
Rule #2 was implemented using a count of times at least 2 points out of 3 exceed XA. This count was established by summation of cases where all three points (Rall3), first and second points (R1&2), first and third points (R1&3), or second and third points (R2&3) were observed to exceed XA. Whenever Rule #2 is
Rules | Condition assessed |
---|---|
Rule #1 | When a point falls outside UCLx, denoted as 11s. |
Rule #2 | At least 2 points out of 3 are in zone A. |
Rule #3 | 3 out of five consecutive points on a control chart fall above XB. |
Rule #4 | 8 consecutive points above XB (in Zones B, or A or beyond). |
Rule #5 | 6 consecutive points decreasing or increasing. In this case only a decreasing scenario was used. |
violated, the count of number of violations is increased by unity, such that the total number of violations can be expressed as per Equation (7):
where the exponent denotes rule number.
Rule #3 was implemented by counting number of times the following scenarios were detected among the average values for each subgroup: the 1st to 3rd points (R1to3), 2nd to 4th points (R2to4), or 3rd to 5th points (R3to5) are above XB. Thus the rule is violated whenever any of these scenarios is observed, such that, the total number of possible violations is the sum of three possibilities, defined using Equation (8):
Rule #4 was implemented by assessing when eight consecutive average values of the subgroups were above XB [
Let
where symbol “Λ” represents an “AND” operator, then Rule #4 is violated and Nocp is increased by 1, until all cases where the condition is fulfilled are counted. This is denoted as 8x in multi-rule implementation. When condition stipulated in Equation (9) is fulfilled, the number of OCP or violations are counted, denoted as Rall8, expressed as per Equation (10):
Rule #5 was implemented by assuming that a sequence of Q(p) values, such that
This implies that six consecutive points in Q(p) series steadily decreases [
Since all the rules were applied separately to the sample influx data,
Based on results shown in
was implemented to read the original time series, S(t), create subgroups automatically, and create the upper and lower control limits, followed by calculating the averages for each subgroup and performing violations detection.
While the x-bar control chart rules might be used differently in different applications, it is important to note that these rules are intended to provide evidence of out-of-control process and not conclusive proof. Once out-of-control points are observed in the data, causes are investigated in the real or physical system, remedies made and observation on the effect of remedial action investigated once again.
With such wide range of variations in control chart characteristics, a FORTRAN code was created to read the time series data and perform the analysis of detecting instability using the above rules and mainly testing the effect of subgroup size, k, stating from 2 to 25. The parameters assessed for OCP were related to k using power and exponential functions of different coefficients and indices.
The sample influx data recorded chronologically for 629 case files received into the FSL in the year 2009, 2014 and 2015 is presented in
When the data was tested for underlying nature of distribution, it was evident that the SIF data is not normally distributed. The probability distribution functions (PDFs) show high positively skewed distributions with skewness = 5.69, 7.48 and 7.06 and very high kurtosis values = 41.18, 73.42, and 63.17, for SIF2009, SIF2014 and SIF2015, respectively, as shown in
are required) which manifests in the peak at S(t) = 3 samples for SIF2014 and SIF2015. The peaks at S(t) = 1 signifies case files where a single sample was submitted to the FSL, which occur at highest frequency, especially for SIF2009. Despite the difference in number of case files received (SIF2014 and SIF2015 data sets) results show similar behavior compared to SIF2009, all of which are not normal distributions.
The effect of sub-group size was initially investigated by plotting control charts at selected interval of subgroup sizes. In each case, different control limits and zonal demarcations of the x-bar charts were identified.
The resulting changes in the control chart parameters are listed in
Other characteristics of the control chart that depend on the subgroup size include
that SPx drops from 46.87 at k = 5 to 40.39 at k = 15. Further increase in k will result into even narrower area between units since the standard deviation decreases continuously.
Based on the nature of S(t) and Q(p), that is the number of samples per case file, negative control limits were excluded in the analysis, and only x-bar, UCLx, XA, and XB were used in detecting violations.
While
Several series of the subgroup average values, Q(p), equal in number to Nsg, were determined and recorded for further statistical analysis. The probability density functions (PDFs) at selected values of subgroup sizes, k = 2, 5, 10, 15, 20 and 25, are plotted in
Control chart parameters | SIF2009 | SIF2014 | SIF2015 |
---|---|---|---|
UCLx | 21.85 | 36.91 | 30.84 |
14.41 | 13.47 | 12.14 | |
LCLx | 6.96 | −9.96 | −6.56 |
SPx | 14.89 | 46.87 | 37.40 |
N | 370 | 629 | 503 |
K | 5 | 5 | 5 |
Nsg | 74 | 124 | 100 |
Nocp | 13 | 12 | 8 |
Despite the similarity in the shape of the PDFs, they differ in terms of frequency or vertical axis for which the minimum observed frequency increases with k from 1% when k = 2% to 4% when k = 25, showing that the Q(p) approaches a normal distribution when k increases. There is also a shift along horizontal axis when k increases with the tail at lower values of Q(p) diminishing when k increases. Such observation has been reported in literature especially for data exhibiting normal distribution.
It is evident that increasing k leads to a more uniformity among the Q(p) values due to averaging effect as subgroup size increases. Moreover, the span and the maximum value of Q(p) decreases with k. The changes in statistics between the original time series data and the Q(p) can be seen by comparing the statistical values as shown in
It was observed that as the subgroup size increases the standard deviation of
Series | k | Nsg | Sk | Maximum | |||
---|---|---|---|---|---|---|---|
S(t) | - | 620 | 15.2 | 37.2 | 6.85 | 52.4 | 620 |
Q(p) | 2 | 310 | 13.57 | 34.28 | 5.12 | 47.85 | 329.5 |
Q(p) | 5 | 124 | 13.57 | 25.31 | 3.84 | 38.87 | 186.4 |
Q(p) | 10 | 62 | 13.57 | 20.06 | 3.04 | 33.63 | 119.3 |
Q(p) | 15 | 41 | 13.56 | 17.47 | 2.22 | 31.03 | 82.4 |
Q(p) | 20 | 31 | 13.57 | 16.84 | 2.73 | 30.41 | 83.8 |
Q(p) | 25 | 24 | 13.58 | 14.95 | 1.44 | 28.53 | 54.0 |
the distribution of averages decreases. Specifically, the relationship shown in
However, Rules #2 and #3, shows poor relationship between number of violations with subgroup size as the fluctuations were observed to increase with k. Thus, further analysis was conducted for Rule #1, #4 and #5.
It should be noted that Rules #2, #3, #4 and #5 can lead to Nocp higher than Nsg due to the fact that one point can be counted several times as long as the neighboring points lead to violation of the rule. Thus, Pocp was not determined for the Rules #2 to #5.
The number of violations for a given control chart (prescribed by k and Nsg) analyzed using Rules #1, #4, and #5 were counted for each value of subgroup size, k. A preselected value of k that leads to rational subgroups is a prerequisite before performing analysis of number of violations, and identification of effective remedial action. However, the Nocp and hence the effectiveness of the control chart in bringing tangible remedial action depends strongly on k. Further analysis revealed that the two quantities (Nocp and k) were exponentially related as depicted in
The goodness of fit, expressed using R2 values, which were closer to unity, with exponential functions generated, as summarized in
where A and α are constants depending on data set and interval of subgroup size.
In
showing that the criteria applies well for SIF data.
The same procedure was used to test Rule #5 for goodness of fit between Nocp and k as a criterion for selecting the rational subgroup range. Again, exponential relationship with higher R2 values was revealed for SIF data, as shown in
Thus, a good fit of an exponential relationship on a log-log plot for number of violations, Nocp, versus subgroup size, k, was established as a criterion for choosing a proper value of k, when R2 value approached 1.0 or lies between 0.9 and unity for the two rules. Investigations for the behavior of the x-bar chart for the rest of the rules require further research work.
Sample influx data exhibits complex behavior with sudden spikes, leading to
Rule No. | Range of k | Exponential equation | R2 value | SIF data set |
---|---|---|---|---|
Rule #1 | 2 ≤ k ≤ 25 | R2 = 0.9579 | SIF2014 | |
Rule #4 | R2 = 0.9705 | |||
Rule #5 | R2 = 0.9880 | |||
Rule #4 | 4 ≤ k ≤ 20 | R2 = 0.9915 | SIF2009 | |
R2 = 0.9842 | SIF2014 | |||
R2 = 0.99 | SIF2015 | |||
Rule #5 | 4 ≤ k ≤ 20 | R2 = 0.9915 | SIF2009 | |
R2 = 0.9951 | SIF2014 | |||
R2 = 0.9952 | SIF2015 |
sudden surge in requirements for extra FSL resources. This necessitated detailed analysis to enable proper management of samples throughout the year. The method employed in this study, i.e., the x-bar control charts, has proved to be effective in identifying uncommon changes in the sample reception process. Control charts behaved widely with changes in subgroup size, necessitating use of computer software to characterize the charts and relate the results with actual situation. The rational subgroup size was established to range from 4 ≤ k ≤ 20, during which the exponential functions between Nocp and k exhibited high goodness of fit, R2, compared to other regions of k from 2 to 25. Implementation of multi-rules allowed detailed analysis of the behavior SIF data, in addition to statistical analysis of subgroup averages. With a proper choice of subgroup size, x-bar control charts are capable of identifying uncommon changes in the sample influx at FSL. The charts behave differently at different values of k with varying Nocp, Pocp, UCLx, SPx although the shape of the Q(p) values remains the same. At various values of k, the statistical analysis of Q(p) reveals a tendency to shrink both vertically and horizontally, with decrease in skewness and standard deviation. The goodness of fit tween Nocp and k was established to be the new criteria for rational subgroup-size range observed for Rules #4 and #5, which involve a count of 8 and 6 consecutive points above the selected control limit, respectively. Using this criterion, the rational subgroup range was established to be 4 ≤ k ≤ 20 for the two x-bar control chart rules. The exponential variation of Nocp with k for different x-bar control chart rules is a new finding established in this study. Where exponential functions fit well, the Nocp data has been suggested to be the rational choices of subgroup size. In this study, the rational subgroup size was observed to be 4 ≤ k ≤ 20 for Rule#4 and #5.
The author is grateful to the management of the Government Chemist Laboratory Authority (GCLA) for support during the course of this study.
Manyele, S.V. (2017) Analysis of the Effect of Subgroup Size on the X-Bar Control Chart Using Forensic Science Laboratory Sample Influx Data. Engineering, 9, 434-456. https://doi.org/10.4236/eng.2017.95026