Unauthorized use of energy is the major source of the non-technical losses of the energy in developing countries. Gas theft as a kind of energy theft is an increasing issue in a number of countries particularly in developing countries. This study is an attempt to address the issue of gas theft through the deployment of Geographic Information System (GIS) capabilities (Spatial Analysis) to import external factors into the current gas theft detection methods, improve data mining processes, and offer some management solutions. To achieve the intended goals in the study, two types of data sources were collected and analyzed: internal data such as reported instances of gas theft, and some customer properties, and external data such as some demographic data. In order to analyze and modeling the gas theft and the relationships between variables we used Hotspot analysis, Ordinary Least Squares regression (OLS) and Geographically Weighted Regression (GWR) analysis with ArcGIS tools. The results from clustering test indicated that the gas theft is not a random phenomenon in all areas of Tabriz and there are underlying factors. Mapping clusters by the hotspot techniques suggested the locations of clusters and areas at risk. The results of the regression analysis illustrated the importance of external factors clearly. According to the results, we recommend a conceptual GIS framework to select high risk areas as a subset data for a meter data analysis. Results of this research are of great importance for GIS based spatial analysis and can be used as base of future researches.
Unauthorized use of energy is the major source of the non-technical losses of the energy in developing countries. Due to the poor underlying infrastructure, detection and control of potential causes of non-technical losses tend to be too difficult in developing countries [
Natural gas utility companies in Iran detect gas theft cases based on meter readers, service personnel and in some cases based on Meter Data Analytics (MDA) on monthly records in Customer Information System (CIS). Lack of adequate company control procedures for meter tampering and theft, ease of theft and low level of detection and high level of poverty are some important factors that can result in the gas theft [
Today, GIS plays a significant role in analyzing and modeling to examine data and reveal its relationships, patterns, trends and anomalies that we are unable to detect directly at maps [
The study area was Tabriz city which is located in northern west part of Iran as center of East Azerbaijan province (
In this research the census blocks were used as the spatial mapping unit. There were a total of 7260 blocks and each block had its population and other joined attributes in a related table.
To achieve the intended goals in the study, two types of data sources were collected and analyzed: internal data such as reported instances of gas theft, and some customer properties, and external data such as some demographic data. In
order to model the gas theft we employed the following dataset based on GIS statistical analysis.
・ Reported gas theft cases (2012-2013, EAGC): point locations.
・ G-rates (meter capacity) and unit counts for costumers.
・ Census blocks data (2012, Statistical Centre of Iran): population, high education rate, unemployment rate, immigrant rate, income level and type of buildings (non-apartment ratio).
・ Administrative borders (2014, Tabriz Municipality).
Detected gas theft cases were collected for a two-year period based on EAGAS utility agent reports and then they were geocoded by x and y coordinates in study area as a layer map. The capacity of meters (G-Rate) and number of units for each costumer was extracted from EAGAS CIS (Costumer Information system). G-Rate and Units are two important technical indicators that we used in spatial analysis of gas theft. Census blocks data were collected from Statistical Centre of Iran. Census block level data have a comprehensive demographic data but for the purpose of this study we selected population, education, immigration, unemployment, income and type of buildings as main socio-economic factors for spatial analysis of gas theft.
Within this research we developed a GIS based dataset as base of GIS spatial analysis. The reason behind using census block level data is important that this data leads to a rich set of variables that might help explain gas theft volumes. Gas theft data was in the form of excel tables, therefore we converted it to location points GIS based shape format with attributes in the ArcGIS software. Incident gas theft data are points of those events with required costumer attributes, and for the purpose of analysis we needed to join them to adjacent census blocks for representing the number of gas thefts and mean attributes (mean of G-Rates and mean of Unit counts) in each census block. To better analyze the points, we aggregated near points (30 meter) with each other too (
Our research makes use of GIS spatial statistics capabilities for exploring and mapping spatial relationships in gas theft. The spatial statistics comprises a set of techniques for describing and modeling spatial data [
Variable | Numerator | Denominator |
---|---|---|
G-Rate Mean | Sum of G-rates per block | Number of cases per block |
Units Mean | Sum of units per block | Number of cases per block |
Population | - | - |
Low Income Level Ratio | Xi - Min level | Max level - Min level |
High education Rate | Number of High Educated Peoples | Number of Peoples above 18 |
Immigrant Rate | Number of immigrants | Total Population |
Unemployment Rate | Number of unemployed | Labor Force |
Non-apartment Ratio | Number of Non apartment Buildings | Number of Households |
model spatial correlation of some factors for gas theft.
Within this research in order to examine clustering and provide better understanding of the Gas theft patterns, we employed Average Nearest Neighbor tool in ArcGIS software for incident points of gas theft. The average nearest neighbor tool calculates the distance between each feature and its nearest neighbor, then computes the average for all nearest neighbor distances [
A N N = D ¯ O D ¯ E (1)
where D ¯ O is the observed mean distance between each feature and its nearest neighbor;
D ¯ O = ∑ i = 1 n d i n (2)
And D ¯ E is the expected mean distance for the features given in a random pattern;
D ¯ E = 0.5 n / A (3)
In the above equation, d i equals the distance between feature i and its nearest neighboring feature (census block), n corresponds to the total number of features, and A is the area of a minimum enclosing recangle around all blocks, or it’s a user-defined area value.
The average nearest neighbor z-score for the feature is calculated as;
z = D ¯ O − D ¯ E S E (4)
where;
S E = 0.26136 n 2 / A (5)
After test of clustering, we used hotspot analysis for identify the locations of statistically significant hotspots and cold spots in data. The Hot Spot Analysis tool calculates the Getis-Ord Gi* statistic for each feature in a dataset. This tool looks at each feature within the context of neighboring features. For being statistically significant hotspot, a feature must have a high value and be surrounded by other features with high values as well [
G i = ∑ i = 1 n W i j X j ∑ j = 1 n X j (6)
The other form of local Getis-Ord is Getis-Ord G* [
G i * = ∑ j = 1 n W i , j X j − X ¯ ∑ j = 1 n W i , j S [ n ∑ j = 1 n W i , j 2 − ( ∑ j = 1 n W i , j ) 2 ] n − 1 (7)
where X j is the variable for location j, W i , j is the spatial weight between locations i and j, n is equal to the total number of locations and;
X ¯ = ∑ j = 1 n X j n (8)
S = ∑ j = 1 n X j 2 n − ( X ¯ ) 2 (9)
The Gi* statistic returned for each location is a z-score. A positive value indicates clustering of high values and a negative value indicates a cluster of low values.
In order to analyze the relationships between variables we used Ordinary Least Squares regression (OLS) and Geographically Weighted Regression (GWR) analysis with ArcGIS tools. OLS is a global model [
y = β 0 + β 1 x 1 + β 2 x 2 + ⋯ + β n x n + E (10)
In the Equation (10), y is the dependent variable, the xs are the explanatory variables, the βs are regression coefficients, and the E is random error or residuals. Over the study area some variables might be strong in predicting the gas theft in some locations of our study area, but perhaps a weak predictor in other locations. For this reason we used GWR. For each location, i = 1 , ⋯ , n , the GWR model is as below [
y i = β i 0 + ∑ k = 1 p − 1 β i k x i k + E i (11)
In the Equation (11), y i is the dependent variable at location i, x i k is the value of the kth covariate at location i, β i 0 is the intercept, β i k is the regression coefficient for the kth covariate, p is the number of regressions, and E i is the random error at location. There is a distinction between regression terms and regression coefficients, and the number of regression coefficients is np.
The implemented GWR Analysis consists of four steps:
1) Exploring the Data and OLS Regression: We used scatter plot matrix and histograms to elucidate the relationships among all variables and examine extreme data values. The first step in identifying relationships between variables is performing Ordinary Least Squares (OLS) linear regression to model a dependent variable in terms of its relationships to a set of explanatory variables. In this research, after removing outliers based on the scatter plot matrix we ran an OLS regression on 8 variables: G-Rate Mean, Units Mean, Population, Low Income Level Ratio, Education Rate, Immigrant Rate, Unemployment Rate and Non- apartment Ratio. The obtained report from the OLS suggested that a test should be conducted to determine the existence of spatial autocorrelation in the residuals. The results provided by the OLS regression analysis tend to be invalid if the residuals are poorly autocorrelated. For this purpose, Moran’s I appears to be a proper test. That is to say, the test being available under Spatial Statistics Tools /Analyzing Patterns/Spatial Autocorrelation attempts to measure the level of spatial autocorrelation in the residuals.
2) Model Development: The OLS regression is the starting point for model development and performing a proper GWR analysis. To evaluate combination of variables to make a properly specified OLS model, we used exploratory regression tool. Exploratory regression method evaluates all possible combinations of the variables and determines predictor importance. Exploratory regression tests all variable combinations for redundancy, completeness, significance, bias and performance.
3) Performing the GWR: After finding the best combination of variables, we performed the GWR with that combination to create a local model for understand or predict the gas theft by fitting a regression equation to every block in the dataset.
4) Prediction of Gas Theft: Eventually, after verification of the model we predicted future behavior of gas theft in the study area by GWR tool.
The Average Nearest Neighbor on gas theft cases indicates statistically significant clustering. In order to validate results, we apply the average nearest neighbor tool on the parcel layer for comparing the result against to the clustering for individual cases. The z-score for both parcels and individual cases are quite different. For gas theft cases, the z-score is −15.5 and there is less than 1% likelihood that this clustered pattern could be the result of random chance (
After aggregating data and selecting appropriate distance band for polygon data, we used Hot Spot Analysis (Getis-Ord Gi*) tool for mapping patterns of gas theft. The result of the Hot Spot Analysis tool is a new map that is symbolized based on whether it is part of a statistically significant hotspot, a statistically significant cold spot, or is not part of any statistically significant cluster. The red color areas are identified as hotspots or areas where high numbers of Gas Thefts are surrounded by other areas with high numbers of Gas Thefts. The blue
areas are cold spots or areas where low numbers of Gas Thefts are surrounded by other areas with low numbers of Gas Thefts. Statistical significance is based on p-values and z-scores that are calculated when we apply the Hot Spot Analysis.
The results of regression analysis showed a global relationship between some variables (
Model performance according to adjusted R-squared is 0.48. This indicates that the implemented model (the explanatory variables modeled using linear regression) explains approximately 48 percent of the variation in the dependent variable. To put it differently, our model explains approximately 48 percent of the gas theft story.
Regression models with statistically significant non-stationary are often proper candidates for Geographically Weighted Regression (GWR) analysis. In the OLS map (
have statistically significant spatial autocorrelation for residuals (
The OLS regression, however, fits a global regression for all dataset. With regard to the spatial differences of variables (non-stationary) and complexity of the study area, we cannot use the OLS for modeling or predicting gas theft. Rather, we need a model that has the required potential to explain the differences locally. But the OLS can tell us global statistics for selecting the best combination of explanatory variables as a starting point for running a local regression model. The outputs generated from the GWR tool include a feature class (map) and a report of overall model results. The results of GWR analysis revealed a better relationship between variables and gas theft. Model performance according to Adjusted R-Squared is 0.6 (
As mentioned previously, over and under-predictions for a well-specified regression model will be randomly distributed. Examine the patterns (Spatial Autocorrelation Moran’s I) of the output in our GWR model residuals indicates that the GWR model is a well-specified model (
As with any regression, GWR can be used to predict. The obtained from GWR for the prediction of gas theft include two maps, predicted gas theft values per blocks and predicted gas theft hotspots (
The results from clustering test indicated that the gas theft is not a random phenomenon in all areas of Tabriz and there are underlying factors. Mapping clusters by the hotspot techniques suggested the locations of clusters and areas at risk. The results of the regression analysis illustrated the importance of external factors clearly. The comparison of OLS regression results and GWR results revealed that the impact of factors differ from a location to another location. For example, the impact of educations is more distinct in the northern and southern parts of the city, but the impact of population density is almost the same and close in many neighborhoods. The accuracy of the model strongly depends on the number and combination of factors (data sources); so we can improve the
Item | Value |
---|---|
Neighbors | 458 |
Residual Squares | 2561.85 |
Effective Number | 313.0145 |
Sigma | 0.607 |
AICc | 13530.006 |
R2 | 0.6226 |
R2 Adjusted | 0.6056 |
model by more data sources. According to the results obtained, there is overwhelming evidence corroborating the presence of spatial relationships that can help to better understanding about gas theft.
Today there is no need for data mining approaches offered by many vendors to process all data, they prioritize data for analysis based on a number of data sources and then execute data mining. In the current research, we employed gas theft instances, third-party and internal data to check spatial aspects of gas theft and finally indicated some areas as priority areas for further analysis and investigations. Data integration and using plenty of data sources are considered as the fundamental tasks in today’s analytic energy theft detection methods. In our investigation, every attempt was made to use some different data sources to reveal some spatial aspects of gas theft. We believe that GIS is a complete repository for utilities to cover all spatial and non-spatial data from external and internal sources in an integrated space. Interoperability of GIS technology with other technologies and systems is opportunity that leading utilities are using it perfectly. In this research, we tried to show some capabilities of GIS for integration with utilities specifically for revenue protection programs. It is obvious that when a utility company wishes to properly take advantage of the capabilities of a GIS in revenue protection, they must develop an enterprise GIS. Planning for the deployment of an enterprise GIS is a critical managerial activity for utilities to move from traditional to smart utility. To sum up, the obtained research findings show a meaningful spatial correlation that we can use this potential and GIS capabilities to use external and internal data sources for proactive analytics more effectively. Therefore, according to the results and the literature review, we recommend a conceptual GIS framework to select high risk areas as a subset data for a meter data analysis. Results of this research are of great importance for GIS based spatial analysis and can be used as base of future researches.
Authors would like to thank you for EA Gas Company for providing dataset and Professor Jean-Fabrice Lebraty from University of Lyon. IAE for his supports and remarkable comments on the early version of this paper.
Touhidi, S.R.R. and Davoudi, I. (2018) Spatial Analysis Applied for Gas Theft Modelling in Tabriz City, Iran. Journal of Geoscience and Environment Protection, 6, 1-19. https://doi.org/10.4236/gep.2018.62001