PART FOUR - MAINTAINING THE DATA © 2002 by CRC Press LLC CHAPTER 13 IMPORTING DATA The most important component of any data management system is the data in it. Manual and automated entry, importing, and careful checking are critical components in ensuring that the data in the system can be trusted, at least to the level of its intended use. For many data management projects, the bulk of the work is finding, organizing, and inputting the data, and then keeping up with importing new data as it comes along. The cost of implementing the technology to store the data should be secondary. The EDMS can be a great time-saver, and should more than pay for its cost in the time saved and greater quality achieved using an organized database system. The time savings and quality improvement will be much greater if the EDMS facilitates efficient data importing and checking. MANUAL ENTRY Sometimes there’s no other way to get data into the system other than transcribing it from hard copy, usually by typing it in. This process is slow and error prone, but if it’s the only way, and if the data is important enough to justify it, then it must be done. The challenge is to do the entry cost-effectively while maintaining a sufficient level of data quality. Historical entry Often the bulk of manual entry is for historical data. Usually this is data in hard-copy files. It can be found in old laboratory reports, reports which have been submitted to regulators, and many other places. DATA SELECTION - WHAT’S REALLY IMPORTANT? Before embarking on a manual entry project, it is important to place a value on the data to be entered. The importance of the data and the cost to enter it must be balanced. It is not unusual for a data entry project for a large site, where an effort is made to locate and input a comprehensive set of data for the life of the facility, to cost tens or hundreds of thousands of dollars. The decision to proceed should not be taken lightly. LOCATING AND ORGANIZING DATA The next step, and often the most difficult, is to find the data. This is often complicated by the fact that over time many different people or even different organizations may have worked on the © 2002 by CRC Press LLC project, and the data may be scattered across many different locations. It may even be difficult to locate people who know or can find out what happened in the past. It is important to locate as much of this historical data as possible, and then the portion selected as described in the previous section can be inventoried and input. Once the data has been found, it should be inventoried. On small projects this can be done in word processor or spreadsheet files. For larger projects it is appropriate to build a database just to track documents and other items containing the data, or include this information in the EDMS. Either way, a list should be made of all of the data that might be entered. This list should be updated as decisions are made about what data is to be entered, and then updated again as the data is entered and checked. If the data inventory is stored in the EDMS, it should be set up so that after the data is imported it can be tracked back to the original source documents to help answer questions about the origin of the data. TOOLS TO HELP WITH CORRECT ENTRY There are a number of ways to enter the data, and these options provide various levels of assistance in getting clean data into the system. Entry and review process – Probably the most common approach used in the environmental industry is manual entry followed by visual review. In this process, someone types in the data, then it is printed out in a format similar to the one that was used for import. Then a second person compares every piece of data between the two pieces of paper, and marks any inconsistencies. These are then remedied in the database, and the corrections checked. The end result, if done conscientiously, is reliable data. The process is tedious for those involved, and care should be taken that those doing it keep up their attention to detail, or quality goes down. Often it is best to mix this work with other work, since it is hard to do this accurately for days on end. Some people are better at it than others, and some like it more than others. (Most don’t like it very much.) Double entry – Another approach is to have the data entered twice, by two different people, and then have special software compare the two copies. Data that does not match is then entered again. This technique is not as widely used as the previous one in the environmental industry perhaps because existing EDMS software does not make this easy to do, and maybe also because the human checking in the previous approach sounds more reliable. Scanning and OCR – Hardware and software are widely available to scan hard copy documents into digital format, and then convert it into editable text using optical character recognition (OCR). The tools to do this have improved immensely over the last few years, such that error rates are down to just a few errors per page. Unfortunately, the highest error rates are with older documents and with numbers, both of which are important in historical entry of environmental data. Also, because the formats of old documents are widely variable, it is difficult to fit the data into a database structure after it has been scanned. These problems are most likely to be overcome, from the point of view of environmental data entry, when there is a large amount of data in a consistent format, with the pages in good condition. Unless you have this situation, then scanning probably won’t work. However, this approach has been known to work on some projects. After scanning, a checking step is required to maintain quality. Voice entry – As with scanning, voice recognition has taken great strides in recent years. Systems are available that do a reasonable job of converting a continuous stream of spoken words into a word processing document. Voice recognition is also starting to be used for on-screen navigation, especially for the handicapped. It is probably too soon to tell whether this technology will have a large impact on data entry. Offshore entry – There are a number of organizations in countries outside the United States, especially Mexico and India, that specialize in high-volume data entry. They have been very successful in some industries, such as processing loan applications. Again, the availability of a large number of documents in the same format seems to be the key to success in this approach, and a post-entry checking step is required. © 2002 by CRC Press LLC Figure 55 - Form entry of analysis data Form entry vs. spreadsheet entry – EDMS programs usually provide a form-based system for entering data, and the form usually has fields for all the data at each level, such as site, station, sample, and analysis. Figure 55 shows an example of this type of form. This is usually best for entering a small amount of data. For larger data entry projects, it may be useful to make a customized form that matches the source documents to simplify input. Another common approach is to enter the data into a spreadsheet, and then use the import tool of the EDMS to check and import the data. Figure 56 shows this approach. This has two benefits. The EDMS may have better data checking and cleanup tools as part of the import than it does for form entry. Also, the person entering the data into the spreadsheet doesn’t necessarily need a license for the EDMS software, which can save the project money. Sometimes it is helpful to create spreadsheet templates with things like station names, dates, and parameter lists using cut and paste in one step, and then have the results entered in a second step. Ongoing entry There may be situations where data needs to be manually entered on an ongoing basis. This is becoming less common as most sources of data involve a computerized step, so there is usually a way to import the data electronically. If not, approaches as described above can be used. ELECTRONIC IMPORT The majority of data placed into the EDMS is usually in digital format in some form or other before it is brought into the system. The implementers of the system should provide a data transfer standard (DTS) so that the electronic data deliverables (EDDs) created by the laboratory for the EDMS contain the appropriate data elements in a format suitable for easy import. An example DTS is shown in Appendix C. © 2002 by CRC Press LLC Figure 56 - Spreadsheet entry of analysis data Automated import routines should be provided in the EDMS so that data in the specified format (or formats if the system supports more than one) can be easily brought into the system and checked for consistency. Data review tracking options and procedures must be provided. In addition, if it is found that a significant amount of digital data exists in other formats, then imports for those formats should be provided. In some cases, importing those files may require operator involvement if, for example, the file is a spreadsheet file of sample and analytical data but does not contain site or station information. These situations usually must be addressed on a case-by-case basis. Historical entry Electronic entry of historical data involves several issues including selecting, locating, and organizing data, and format and content issues. Data selection, location, and organization – The same issues exist here as in manual input in terms of prioritizing what data will be brought into the EDMS. Then it is necessary to locate and catalog the data, whatever format it is in, such as on a hard drive or on diskettes. Format issues – Importing historical data in digital format involves figuring out what is in the files and how it is formatted, and then finding a way to import it, either interactively using queries or automatically with a menu-driven system. Most modern data management programs can read a variety of file formats including text files, spreadsheets, word processing documents, and so on. Usually the data needs to be organized and reformatted before it can be merged with other data already in the EDMS. This can be done either in its native format, such as in a spreadsheet, or imported into the database program and organized there. If each file is in a different format, then there can be a big manual component to this. If there are a lot of data files in the same format, it may be possible to automate the process to a large degree. © 2002 by CRC Press LLC Content issues – It is very important that the people responsible for importing the data have a detailed understanding of the content of the data being imported. This includes knowing where the data was acquired and when, how it is organized, and other details like detection limits, flags, and units, if they are not in the data files. Great care must be exercised here, because often details like these change over time, often with little or no documentation, and are important in interpreting the data. Ongoing entry The EDMS should provide the capability to import analytical data in the format(s) specified in the data transfer standard. This import capability must be robust and complete, and the software and import procedures must address data selection, format, and content issues, and special issues such as field data, along with consistency checking as described in a later section. Data selection – For current data in a standard format, importing may not be very time- consuming, but it may still be necessary to prioritize data import for various projects. The return on the time invested is the key factor. Format and content issues – It may be necessary to provide other import formats in addition to those in the data transfer standard. The identification of the need to implement other data formats will be made by project staff members. The content issues for ongoing entry may be less than for historical data, since the people involved in creating the files are more likely to be available to provide guidance, but care must still be taken to understand the data in order to get it in right. Field data – In the sampling process for environmental data there is often a field component and a laboratory component. More and more the data is being gathered in the field electronically. It is sometimes possible to move this data digitally into the EDMS. Some hard copy information is usually still required, such as a chain of custody to accompany the samples, but this can be generated in the field and printed there. The EDMS needs to be able to associate the field data arriving from one route with the laboratory data from another route so both types of data are assigned to the correct sample. Understanding duplicated and superseded data Environmental projects generate duplicated data in a variety of ways. Particular care should be taken with duplicated data at the Samples and Analyses levels. Duplicate samples are usually the result of the quality assurance process, where a certain number of duplicates of various types are taken and analyzed to check the quality of the sampling and analysis processes. QC samples are described in more detail in Chapter 15. A sample can also be reanalyzed, resulting in duplicated results at the Analyses level. These results can be represented in two ways, either as the original result plus the reanalysis, or as a superseded (replaced) original result plus the new, unsuperseded result. The latter is more useful for selection purposes, because the user can easily choose to see just the most current (unsuperseded) data, whereas selecting reanalyzed data is not as helpful because not all samples will have been reanalyzed. Examples of data at these two levels and the various fields that can be involved in the duplications at the levels are shown in Figure 57. Obtaining clean data from laboratories Having an accurate, comprehensive, historical database for a facility provides a variety of benefits, but requires that consistency be enforced when data is being added to the database. Matching analytical data coming from laboratories with previous data in a database can be a time- consuming process. © 2002 by CRC Press LLC * Unique Index - For Water Samples * Unique Index - For Water Analyses Duplicate - Samples Level Superseded - Analysis Level Sample No. Sample No.* (Station, Date, Matrix, Filt., Dup.) 1 2 3 1 1 Station* Parameter Name* MW-1 Field pH Field pH Field pH Field pH Naphthalene Naphthalene Naphthalene Sample Date* Leach Method* 8/1/2000 None None None None None None None Matrix* Basis* Water None None None None None None None Filtered* Superseded* Tota l 0 1 2 3 0 1 2 Duplicate* Val ue Code 0 1 2 None None None None Original DL1 DL2 QC Code Dilution Factor Original Field Dup. Split 1 50 10 Lab ID Reportable Result 2000-001 2000-002 2000-003 N Y N Figure 57 - Duplicate and superseded data Variation in station names, spelling of constituent names, abbreviation of units, and problems with other data elements can result in data that does not tie in with historical data, or, even worse, does not get imported at all because of referential integrity constraints. An alternative is a time- consuming data checking and cleanup process with each data deliverable, which is standard operating procedure for many projects. WORKING WITH LABS - STANDARDIZING DELIVERABLES The process of getting the data from the laboratory in a consistent, usable format is a key element of a successful data management system. Appendix C contains a data transfer standard (DTS) that can be used to inform the lab how to deliver data. EDDs should be in the same format every time, with all of the information necessary to successfully import the data into the database and tie it with field samples, if they are already there. Problems with EDDs fall into two general areas: 1) data format problems and 2) data content problems. In addition, if data is gathered in the field (pH, turbidity, water level, etc.) then that data must be tied to the laboratory data once the data administrator has received both data sets. Data format problems fall into two areas: 1) file format and 2) data organization. The DTS can help with both of these by defining the formats (text file, Excel spreadsheet, etc.) acceptable to the data management system, and the columns of data in the file (data elements, order, width, etc.). Data content problems are more difficult, because they involve consistency between what the lab is generating and what is already in the database. Variation in station names (is it “MW1” or “MW-1”?), spelling of constituent names, abbreviation of units, and problems with other data elements can result in data that does not tie in with historical data. Even worse, the data may not get imported at all because of referential integrity constraints defined in the data management system. © 2002 by CRC Press LLC Figure 58 - Export laboratory reference file USING REFERENCE FILES AND A CLOSED-LOOP SYSTEM While project managers expect their laboratories to provide them with “clean” data, on most projects it is difficult for the laboratory to deliver data that is consistent with data already in the database. What is needed is a way for the project personnel to keep the laboratory updated with information on the various data elements that must be matched in order for the data to import properly. Then the laboratory needs a way to efficiently check its electronic data deliverable (EDD) against this information prior to delivering it to the user. When this is done, then project personnel can import the data cleanly, with minimal impact on the data generation process at the laboratory. It is possible to implement a system that cuts the time to import a laboratory deliverable by a factor of five to ten over traditional methods. The process involves a DTS as described in Appendix C to define how the data is to be delivered, and a closed-loop reference file system where the laboratory compares the data it is about to deliver to a reference file provided by the database user. Users employ their database software to create the reference file. This reference file is then sent to the laboratory. The laboratory prepares the electronic data deliverable (EDD) in the usual way, following the DTS, and then uses the database software to do a test import against the reference file. If the EDD imports successfully, the laboratory sends it to the user. If it does not, the laboratory can make changes to the file, test it again, and once successful, send it to the user. Users can then import this file with a minimum of effort because consistency problems have been eliminated before they receive it. This results in significant time-savings over the life of a project. If the database tracks which laboratories are associated with which sites, then the creation of the reference file can start with selection of the laboratory. An example screen to start the process is shown in Figure 58. In this example, the software knows which sites are associated with the laboratory, and also knows the name to be used for the reference file. The user selects the laboratory, confirms the file name, and clicks on Create File. The file can then be sent to the laboratory via email or on a disk. This process is done any time there are significant changes to the database that might affect the laboratory, such as installation of new stations (for that laboratory’s sites) or changes to the lookup tables. There are many benefits to having a centralized, open database available to project personnel. In order to have this work effectively the data in the database must be accurate and consistent. Achieving this consistency can be a time-consuming process. By using a comprehensive data transfer standard, and the closed-loop system described above, this time can be minimized. In one organization the average time to import a laboratory deliverable was reduced from 30 minutes down to 5 minutes using this process. Another major benefit of this process is higher data quality. This increase in quality comes from two sources. The first is that there will be fewer errors in the data deliverable, and consequently fewer errors in the database, because a whole class of errors © 2002 by CRC Press LLC related to data mismatches has been completely eliminated. A second increase in quality is a consequence of the increased efficiency of the import process. The data administrator has more time to scrutinize the data during and after import, making it easier to eliminate many other errors that would have been missed without this scrutiny. Automated checking Effective importing of laboratory and other data should include data checking prior to import to identify errors and to assist with the resolution of those errors prior to placing the data in the system. Data checking spans a range of activities from consistency checking through verification and validation. Performing all of the checks won’t ensure that no bad data ever gets into the database, but it will cut down significantly on the number of errors. The verification and validation components are discussed in more detail in Chapter 16. The consistency checks should include evaluation of key data elements, including referential integrity (existence of parents); valid site (project) and station (well); valid parameters, units, and flags; handling of duplicate results (same station, sample date and depth, and parameter); reasonable values for each parameter; comparison with like data; and comparison with previous data. The software importing the data should perform all of the data checks and report on the results before importing the data. It’s not helpful to have it give up after finding one error, since there may well be more, and it might as well find and flag all of them so you can fix them all at once. Unfortunately, this is not always possible. For example, valid station names are associated with a specific site, so if the site in the import file is wrong, or hasn’t been entered in the sites table, then the program can’t check the station names. Once the program has a valid site, though, it should be able to perform the rest of the checks before stopping. Of course, all of this assumes that the file being imported is in a format that matches what the software is looking for. If site name is in the column where the result values should be, the import should fail, unless the software is smart enough to straighten it out for you. Figure 59 shows an example of a screen where the user is being asked what software-assisted data checking they want performed, and how to handle specific situations resulting from the checking. Figure 59 - Screen for software-assisted data checking © 2002 by CRC Press LLC Figure 60 - Screen for editing data prior to import You might want to look at the data prior to importing it. Figure 60 shows an example of a screen to help you do this. If edits are made to the laboratory deliverable, it is important that a record be kept of these changes for future reference. REFERENTIAL INTEGRITY CHECKING A properly designed EDMS program based on the relational model should require that a parent entry exist before related child entries can be imported. (Surprisingly, not all do.) This means that a site must exist before stations for that site can be entered, and so on through stations, samples, and analyses. Relationships with lookups should also be enforced, meaning that values related to a lookup, such as sample matrix, must be present and match entries in the lookup table. This helps ensure that “orphan” data does not exist in the tables. Unfortunately, the database system itself, such as Access, usually doesn’t give you much help when referential integrity problems occur. It fails to import the record(s), and provides an error message that may, or may not, give you some useful information about what happened. Usually it is the job of the application software running within the database system to check the data and provide more detailed information about problems. CHECKING SITES AND STATIONS When data is obtained from the lab it must contain information about the sites and samples associated with the data. It is usually not a good idea to add this data to the main data tables automatically based on the lab data file. This is because it is too easy to get bad records in these two tables and then have the data being imported associated with those bad records. In our experience, it is more likely that the lab has misspelled the station name than that you really drilled a new well, although obviously this is not always the case. It is better to enter the sites and stations first, and then associate the samples and analyses with that data during import. Then the import should check to make sure the sites and stations are there, and tell you if they aren’t, so you can do something about it. On many projects the sample information follows two paths. The samples and field data are gathered in the field. The samples go to the laboratory for analysis, and that data arrives in the electronic data deliverable (EDD) from the laboratory. The field data may arrive directly from the field, or may be input by the laboratory. © 2002 by CRC Press LLC [...]... of data check-in (chain of custody forms, sampling dates, entire data set received), data © 2002 by CRC Press LLC entry into the database, checking the imported data against the import file, and querying and comparing the data to the current data set and to historical data A data review flag should be provided in the database to allow users to track the progress of the data review, and to use the data. .. Supplies and Consumables – Define how and by whom the sampling supplies and other consumables will be accepted for use in the project Data Acquisition Requirements (Non-direct Measurements) – Define the criteria for the use of non-measurement data such as data that comes from databases or literature Data Management – Outline the data management scheme including the path and storage of the data and the data. .. concerned with “quality management, ” while ISO 140 00 is primarily concerned with environmental management. ” ISO 9000 and ISO 140 00 both have a greater chance of success in an organization with good data management practices A good source of information on ISO 9000 and ISO 140 00 is www.iso.ch/9000e/9k14ke.htm Another international standard that applies to environmental data gathering and management is ISO... investigation and remediation Recent developments in international standards can have an impact on the design and implementation of an EDMS, and the EDMS can contribute to successful implementation of these systems ISO 9000 and ISO 140 00 are families of international standards Both families consist of standards and guidelines relating to management systems, and related supporting standards on terminology and. .. 1996) Some of these principles relate to environmental data management In the following section each principle is followed by a brief discussion of how that principle relates to environmental data management and the EDMS software Recognize that environmental management is one of the highest priorities of any organization – This means that adequate resources should be allocated to management of environmental. .. Figure 63 - Screen to configure content-specific filtering © 2002 by CRC Press LLC Figure 64 - Screens showing results of a successful and an unsuccessful import TRACKING IMPORTS Part of the administration of the data management task should include keeping records of the import process After trying an import, the software should notify you of the result Records should be kept of both unsuccessful and successful... depending on the status of the review The details of how data review is accomplished and by whom must be worked out on a project-by-project basis The software should provide a data storage location and manipulation routines for information about the data review status of each analytical value in the system The system should allow the storage of data with different levels of data checking A method should... database is not of sufficient quality, people won’t (and shouldn’t) use it Managing the quality of the data is just as important as managing the data itself This chapter and the next cover a variety of issues related to quality terminology, QA/QC samples, data quality procedures and standards, database software support for quality analysis and tracking, and protection from loss General data quality issues... third-party validation of the laboratory data Flags in the data review table can be used to indicate if this procedure has been performed, while still tracking the data import and review steps required for entry into the data management system It is also possible for various levels of data review to be performed prior to importing the data Laboratories and consultants could be asked to provide data. .. level of checking, and this information brought in with the data The following table shows some typical review codes that might be associated with analytical data: Data Review Code 0 1 2 3 4 5 6 7 8 Data Review Status Imported Vintage (historical) data Data entry checked Sampler error checked Laboratory error checked Consistent with like data Consistent with previous data In-house validation Third-party . ID Reportable Result 200 0-0 01 200 0-0 02 200 0-0 03 N Y N Figure 57 - Duplicate and superseded data Variation in station names, spelling of constituent names, abbreviation of units, and problems with other data elements. unusual for a data entry project for a large site, where an effort is made to locate and input a comprehensive set of data for the life of the facility, to cost tens or hundreds of thousands of dollars PART FOUR - MAINTAINING THE DATA © 2002 by CRC Press LLC CHAPTER 13 IMPORTING DATA The most important component of any data management system is the data in it. Manual and automated