2016 Eighth International Conference on Knowledge and Systems Engineering (KSE) Standardization procedure for automatic environmental data: a case study in Hanoi, Vietnam Linh Nguyen Duc, Man Duc Chuc, Bui Quang Hung, Nguyen Thi Nhat Thanh Center of Multidisciplinary Integrated Technology for Field Monitoring, University of Engineering and Technology, Vietnam National University Hanoi, Vietnam linhnd@fimo.edu.vn solutions to limit the severely decreasing air quality in Vietnam at the present In Vietnam, there are two systems of environmental monitoring stations, both are managed by the Ministry of Natural Resources and Environment [2] Most of the stations are automated stations The stations measures meteorological indicators and air pollution indicators by hour Measured data is stored in local memory and transferred to main center daily or weekly There are also many abnormal data and many gaps in the data due to problems during operation such as sensor’s problems, maintaining of stations Furthermore, the data has not been undergone any fixing or recovering process This makes some obstacles for researchers when they use the data to study Currently, the authorities mainly use traditional statistical tools, i.e Microsoft Excel, this may result in more processing time especially when the data volume is huge Additionally, it is very time and cost consuming to detect abnormal data or filling in missing data by human Thus an automatic tool is needed to help the authorities or researchers work with the data Current problems appearing in the measured data at the ground stations are described below: - The data is not consistent: Data is not stored in a commonly standardized output The data is stored in different structures using different units of measurement, column names, date and time formats This cause a lot of difficulties to analysis the data - Noisy data: occurring in several cases such as equipment failure, transmission errors and unidentified errors - Missing data: data is missed in some situations such as the monitoring modules are broken unexpectedly, power failure or by changing the position of the measuring devices In this paper, we address the second and third problems The proposed standardization procedures helps in Abstract - In Vietnam, environmental data collected from ground-based stations may contain abnormal or missing values due to several problems during operation, i.e sensor’s problems This paper proposes a standardization procedure which try to detect unusual values and fill in missing data Experiments were conducted for PM10 data Two datasets measured in 01/2011 and 01/2012 at Nguyen Van Cu station in Hanoi, Vietnam is used for experiments For the abnormal detection process, unusual data can be informed to the data analyzers at ground stations for judging For the missing filling process, the first dataset is used as training dataset to construct regression models for predicting missing data, the second dataset is used as testing data In the worst case, suppose 100% PM10 is missing, Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) are 51 μg/m3 and 45% respectively Correlation coefficient (R) between original PM10 data and predicted PM10 data is 0.56 In addition, different scenarios taking account of percentage of missing data of the whole testing dataset are also considered Experimental results showed that it is best to perform missing filling process on datasets that contain 10% to 30% of missing data For this case, RMSE ranges from 15-25 μg/m3 and MAPE varies from to 13% Keywords—environmental data, abnormal detection, missing filling, PM10 I INTRODUCTION Environmental monitoring data is a dataset obtained by the process of measuring one or more indicators of physical properties, chemical and biological components of the environment, according to a preset plan which covers time, space, methods and measurement process, to reliably and accurately provide the field information Ground-based environmental data can be used in various real life applications such as air pollution modeling, healthcare studies [11] For example, healthcare sector can use the data to make analysis and assess the impact of physical, chemical and biological factors on dermatological, respiratory or epidemic diseases [12] Also, the data can help the managers in decision making process to create appropriate 978-1-4673-8929-7/16/$31.00 ©2016 IEEE 321 31/01/2011 and 01/01/2012 to 31/01/2012 (Table 1) Total number of records for each dataset is 744 the synthesis, cleaning and missing filling of data, to save time and effort for managers, researchers when working with the data Table Statistics on data structure, volume in 01/2011 and 01/2012 II DATA Time Monitoring As mentioned before, in Vietnam, there exists two automatic air monitoring stations which are managed by the Ministry of Natural Resources and Environment The first is monitoring networks of meteorological and environment parameters (10 stations), the second is a network of national environmental monitoring stations (7 stations) The monitoring stations hourly measured data The air pollution parameters measured at all of the stations include carbon monoxide popular (CO), nitric oxide (NO), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3), PM, wind speed, wind direction, temperature, relative humidity, barometer, radiation, inner temperature In addition, these stations also measure meteorological information such as wind speed In Hanoi, there are three air monitoring stations, one is located at Phao Dai Lang, Dong Da, the other two stations located at 556 Nguyen Van Cu and Ho Chi Minh Mausoleum In this study, we used data from Nguyen Van Cu station for analysis The station is launched in 2009 with regular maintenance This ground station is located in the Centre for Environmental Monitoring (CEM) which is the most stable operation and the data could be representative for Hanoi area Particulate matter (PM) is solid and liquid particles suspended in the atmosphere PM includes both organic and inorganic particles such as dust, pollen, soot, smoke, and liquid droplets These particles vary greatly in size, composition, and origin PMs can be divided into three categories based on its diameter including PM10, PM2.5 and PM1 Dust monitoring data includes PM10, PM2.5, PM1 and PM10 is the main focus in this study PM10 data collected from 01/01/2011 to 31/01/2011 and 01/01/2012 to 31/01/2012 at the Nguyen Van Cu station, Hanoi This is close to the time when Nguyen Van Cu station was set up and data quality is guaranteed For the period of time from 2013-2016, the monitoring module are not well maintained thus resulting in more errors and missing data Number of xls files 01/2011 31 01/2012 31 Indicators Wind speed, wind direction, temperature, relative humidity, barometer, radiation, inner temperature, NO, NO2, SO2, CO, O3, PM10, PM2.5 and PM1 B Missing status The first dataset collected in 01/2011 has a low missing rate, i.e about 2% for PM and 0% for other indicators The second dataset collected in 01/2012 have a higher missing rate, i.e 23% for SO2 and 37.4% for O3 But PM indicators in this dataset were fully recorded (Table 2) Table Statistics on the number of missed records according to indicators in two datasets Indicators SO2 O3 PM10 PM2.5 PM1 01/2011 0 15/744 15/744 15/744 01/2012 170/744 278/744 0 According to the statistics, the first dataset (01/2011) is used as training dataset because the amount of data PM10 quite full and monitoring data of other indicators have high completeness The second dataset (01/2012) is used as test dataset III METHODOLOGY Based on the characteristics of data, we propose a standardized procedure for automatic environmental data (Fig 1) as described below: A Structure and volume The two datasets collected from Nguyen Van Cu station consisting of 15 indicators including wind speed, wind direction, temperature, relative humidity, barometer, radiation, inner temperature, NO, NO2, SO2, CO, O3, PM10, PM2.5 and PM1 The data is stored in Microsoft Excel format (.xls) Data of each day is saved in a separate file Thus the two datasets contain 62 files corresponding to 62 days from 01/01/2011 to 322 Data collection: collect data from the stations After that to build common dataset defined by a conventional structure The aim is to create a dataset of standard data structure that simplifies the process of managing and analyzing data If dataset structure has not correct, collect data again and go to Data overview step when it correct Data overview (based on statistics): using the statistical methods to extract statistical characteristics of the data, trends of data and prescreen it to assess against reality 3 appearing during peak hours every day Average values of PM10 for each hours calculated from data of each month showed an agreement with the general trend (Fig 2) Apply similar evaluation methods for other indicators such as NO2, SO2, CO The results showed that the two datasets are reliable and follow general trends that were reported in the literature This guarantees the following steps to be conducted This is just to get an overview of the data and to get a feel if the data is noisy or missing This step help us assess the quality of existing data If dataset have good quality then call to the next step If not, determine the data source in first step Noise detecting: removing data based on data reliability range or using correlation analysis methods This is to detect the days that have abnormally observational data This is to suggest unusual data to the analysts to make decision on the data If the day had detected are not noise data then revaluation noise detecting method, else go to next step Fill in missing data: using correlation analysis between target indicator and other indicators to build linear regression models The models are used to predict values for missing data records of the target indicator If the dataset has been filled is true, finish process, else revaluation filling missing method Start Data collection Problems Evalu ation Next IV EXPERIMENTS AND RESULTS A Data Collection and Data Overview Based on data and basic statistical indicators, we can draw some conclusions on the PM10 data from the two datasets as Table Table The results of some statistical indicators were calculated on datasets Month Mean Median Mode Q1 Q3 01/2011 141.37 129.68 40.91 56.07 210.41 01/2012 87.18 75.39 97.22 49.61 113.61 Data overview Problems Evalu ation Next Noise detecting Overall, the average PM10 concentrations range from 85-140 ug/m3 This is close to the QCVN 05:2013/BTNMT standard which states the standard of air pollution in Vietnam for PM10 is 150 ug/m3 In general, the statistical indicators of PM10 in the second dataset often have lower values than those of the first dataset Previous study conducted in Hanoi showed that the average of monitoring indicators are often higher in winter and lower in summer [1] The maximum PM10 value is often observed in the period from October to January with average PM10 value ranging from 100 to 150 ug/m3 This is similar to the above statistical data Previous study also showed an evolution of air pollution levels in 05/2003 and 09/2003 in Hanoi [1] During these days, air pollution level tends to rise during peak hours from 7-9am and 18-20pm Furthermore, the highest peaks of air pollution level in the morning are often similar to those in the evening This is because of high volume of vehicles Problems Evalu ation Next Fill in missing Problems Next Finish Evalu ation Fig Data processing framework proposed 323 Correlation analysis is another way to detect abnormal data We propose to detect potential abnormal data based on analysis of correlation between daily data and monthly average data First, the average value of each hour in a day is calculated from observed data in a month at the hour Thus for each month, 24 average PM10 values corresponding to 24 hours in a day are constructed The values are considered to represent the daily trend of PM10 for the month Correlation analysis is then conducted for PM10 data measured in a particular day in the month with the average PM10 values If the correlation coefficient is low then the data is considered to be noisy or abnormal Specifically, the range of [-0.3; 0.3] is used to filter out potentially abnormal data for further analysis and evaluation The range of [-0.3; 0.3] was chosen as it is negligible correlation based on research of Mukaka [14] Besides, in order to evaluate abnormal data, professional experience in meteorology, environment plus further assessment of originality of the pollutions such as traffic, industrial zones, the surrounding area at the measured time, status of measurement equipments at the time By applying the proposed range of [-0.3; 0.3] to training dataset, there are days have low correlation coefficient as described in Table Fig PM10 daily trend in 01/2011 and 01/2012 Table List of dates which have low correlation coefficient between day and monthly average in training dataset B Noise detecting Noise removing aims to detect potential abnormal data in daily basis This can be based on constructing reliable data range or correlation analysis or combination of both methods The confidence interval can be used to determine a reliable range of values which is used to remove noise data This method requires analysts to have good experience of working with observational data in a long time in order to construct good data range Through research and environmental reports [2, 3, 4, 5, 6, 7] we proposed a range of reliable values for PM10 is [0-400] ug/m3 By applying the proposed range to training dataset, there are potentially abnormal records as described in Table 4: Date observation 03/01/2011 04/01/2011 09/01/2011 11/01/2011 13/01/2011 17/01/2011 19/01/2011 23/01/2011 C Filling missing Previous studies show that some environmental indicators have significant correlations [9, 13] This means that missing PM10 data can be recovered from suitable environmental parameters by constructing linear regression models In this study, we build linear regression models using training dataset The models are used to predict missing PM10 values in testing dataset Table shows correlations between PM10 and other environmental indicators derived from training dataset: Table Table correlation between PM10 and other environmental indicators in training dataset Table List of date have valuable outside the confidence interval in 01/2011 Datetime 12/01/2011 10:00 17/01/2011 08:00 17/01/2011 17:00 17/11/2011 18:00 Correlation coefficients -0.2829 0.2108 -0.0953 0.1110 0.1502 0.2299 -0.2411 -0.0405 Observation value of PM10 490 420.656 462.044 425.139 324 WindSpd WindDir Temp RH Barometer Radiation 0.04982 0.03815 0.08365 0.34409 0.03855 -0.0124 InnerTemp NO NO2 SO2 CO O3 models to use requires understanding of the data to be recovered In Table 9, a list of suggested models to use according to different status of the data Table Cases of missing data and suggested linear regression models 0.02089 0.23985 0.59005 0.53962 0.44486 0.09338 From the table, there are three indicators owning high correlation with PM10 including NO2, SO2 and CO Seven linear regression models are constructed to predict PM10 from the three parameters As described before, in training dataset, 15 records have missing PM10 values To ensure completeness of data for building regression models, the records are removed thus resulting in a training dataset containing 725 records After building seven linear regression models, we validated the predicted PM10 values of each models with actual PM10 values Assuming the data is missing 100% PM10, from that R2, RMSE and MAPE are used to quantitatively assess performance of each models (Table 7) Table Validation results of linear regression models on training dataset (100% PM10 missing) Parameter for R2 RMSE* MAPE* model SO2 NO2 CO SO2, NO2 SO2, CO NO2, CO SO2, NO2, CO 0.3 0.35 0.2 0.43 0.4 0.35 0.43 75.6 72.6 80.5 67.9 69.5 72.6 67.6 Records missing SO2, NO2, CO Records missing SO2 Records missing NO2 Records missing CO Records missing SO2, NO2 Records missing SO2, CO Records missing NO2, CO Records missing all SO2, NO2, CO Next, we validated the models on testing dataset The dataset has no missing PM10, CO and NO2, but SO2 missing 170/744 records This is a good basis for assessment process Assuming 100% PM10 data is missing from the testing dataset, the results showed the correlation coefficient between the predicted value of PM10 and actual PM10 is 0.56 RMSE and MAPE are 51 ug/m3 and 45% respectively (Table 10) This result is acceptable because it ensures data completeness and the R, RMSE and MAPE are in the medium level The MAPE value in this case smaller than MAPE in Table because two linear regression models was applied to pedicted PM10 so the error rate will be smaller than use of only one model Table 10 Results after filling missing PM10 in testing dataset Assuming that 100% PM10 data is missed in 01/2012 Number Number of Correlation of records records coefficients RMSE* MAPE* {NO2, {NO2, * SO2, CO} CO} 80 74.7 87.5 68.9 71.8 74.7 68.8 * Predicted PM10 and Actual PM10 Based on the results, priorities are set for each model when applying to real life problems as Table 8: Table Table ordered models corresponding to the priority Parameter Linear regression equation Priority for model SO2, NO2, CO SO2, NO2 SO2, CO NO2, CO NO2 SO2 CO Y= -8.98 + 2.02*SO2 + 1.35*NO2 + 0.011*CO Y= 0.79 + 1.87*SO2 + 1.80*NO2 Y= -1.95 + 2.59*SO2 + 0.028*CO Y= 20.5 + 2.51*NO2 0.0004*CO Y= 20.2 + 2.5*NO2 Y= 52.9 + 3.01*SO2 Y= 42.5 + 0.04*CO Linear regression model number Can not predict Record status 574 170 0.56 51.4 45.3 * Predicted PM10 and Actual PM10 A test to evaluate the impact of missing data rate is also performed Different missing rate of PM10 are assumed including 10%, 20%, 30%, 40%, and 50% For each missing rate, 10 datasets are randomly generated Average results of 10 assessments perform on 10 datasets are reported in Table 11 Table 11 PM10 missing filling results, considering different missing rates in testing dataset Missing per 10% 20% 30% 40% 50% cent Total 744 records In practice, NO2, SO2, CO are not always available They can be missed like PM10 Therefore, deciding on which Correlation 325 0.94 0.91 0.86 0.78 0.75 REFERENCES coefficients * RMSE* 15.75 20.77 24.92 34.29 36.8 MAPE* 4.93 9.10 13.46 18.04 23.06 * Predicted PM10 and Actual PM10 In general, the results significantly disparity Specifically, when 10% PM10 is missed, R, RMSE and MAPE are 0.94, 15.75 ug/m3 and 4.9% which indicates best recovery results in the test With a lack of data from 20% to 30%, MAPE and RMSE range from to 13% and 20-25 ug/m3 respectively When missing rate is higher than 30% or more, RMSE rate started increasing from 18% to 23% The worst case is of 50% missing rate with RMSE of 36.8 ug/m3 and R = 0.75 From the results, it is observed that it is better to perform missing filling process when the data is of 30% missing rate or less However, these results show the potential of applying the method to real life problems [1] [2] [3] [4] [5] [6] [7] [8] [9] V CONCLUSION In this paper, we propose a framework for automatic environmental data, from data collection to building a dataset which ensuring a standardized structures and acceptable quality The proposed workflow includes different stages including data collection, data overview, noise detecting, missing filling and evaluation Different techniques at each stage are also introduced and experimentally evaluated Although the framework is an overall process but analysts can customize every step in the process Besides, the framework still exist some unresolved issues which include historical knowledge for noise removal and the completeness of other environmental indicators to estimate missed PM10 values In future, exploiting the use of meteorology or weather stations in the same area to employ more environmental indicators to improve the overall quality for the workflow [10] [11] [12] [13] [14] ACKNOWLEDGMENT 326 Pham Duy Hien Current status and laws of changes of air quality in Hanoi, 03/2006 The Ministry of Natural Resources and Environment Vietnam National environmental report in 2013, The Ministry of Natural Resources and Environment Vietnam National environmental report in 2010 Ngo Tho Hung AARHUS University, Urban Air Quality Modelling and Management in Hanoi, Vietnam PhD Thesis, 2010 Clean Air Initiative for Asian Cities (CAI-Asia) Center Viet Nam: Air Quality Profile 2010 Edition Cao Dung Hai, Nguyen Thi Kim Oanh Effects of local, regional meteorology and emission sources on mass and compositions of particulate matter in Hanoi Atmospheric Environment Volume 78, October 2013, Pages 105–112 Nguyen Tran Huong Giang, Nguyen Thi Kim Oanh Roadside levels and traffic emission rates of PM2.5 and BTEX in Ho Chi Minh City, Vietnam Atmospheric Environment Volume 94, September 2014, Pages 806–816 Dang Manh Doan, Tran Thi Dieu Hang, Phan Ban Mai Institute of Meteorology, Hydrology and Environment The situation of air pollution in Hanoi and recommendations to reduce pollution, 2007 Jung-Moon Yoo a, Yu-Ri Lee b, Dongchul Kim c,g,*, Myeong-Jae Jeong d, William R Stockwell e, Prasun K Kundu f,g, Soo-Min Oh a, Dong-Bin Shin b, Suk-Jo Lee New indices for wet scavenging of air pollutants (O3, CO, NO2, SO2, and PM10) by summertime rain Atmospheric Environment Volume 82, January 2014, Pages 226–237 Ping Wang, Junji Cao, Xuexi Tie, Gehui Wang, Guohui Li, Tafeng Hu, Yaoting Wu, Yunsheng Xu, Gongdi Xu, Youzhi Zhao, Wenci Ding, Huikun Liu, Rujin Huang, Changlin Zhan Impact of Meteorological Parameters and Gaseous Pollutants on PM2.5 and PM10 Mass Concentrations during 2010 in Xi’an, China Aerosol and Air Quality Research, 15: 1844–1854, 2015 Gharehchahi E, Mahvi AH, Amini H, Nabizadeh R, Akhlaghi AA, Shamsipour M, et al Health impact assessment of air pollution in Shiraz, Iran: a two-part study J Environ Health Sci Eng 2013; 11: – Brauer M, Amann M, Burnett RT, Cohen A, Dentener F, Ezzati M, et al Exposure assessment for estimation of the global burden of disease attributable to outdoor air pollution Environ Sci Technol 2012; 46: 652 – 660 Dragan M Markoviü, Dragan A Markoviü, Anka Jovanoviü, Lazar Laziü, Zoran Mijiü, Determination of O3, NO2, SO2, CO and PM10 measured in Belgrade urban area, Environmental Monitoring and Assessment October 2008, Volume 145, Issue 1, pp 349-359 M M Mukaka A Guide to Appropriate Use of Correlation Coefficient in Medical Research Malawi Medical Journal, Vol 24, No 3, 2012, pp 69-71 ... way to detect abnormal data We propose to detect potential abnormal data based on analysis of correlation between daily data and monthly average data First, the average value of each hour in a. .. characteristics of the data, trends of data and prescreen it to assess against reality 3 appearing during peak hours every day Average values of PM10 for each hours calculated from data of each... between day and monthly average in training dataset B Noise detecting Noise removing aims to detect potential abnormal data in daily basis This can be based on constructing reliable data range or