PART TWO - SYSTEM DESIGN AND IMPLEMENTATION © 2002 by CRC Press LLC CHAPTER 5 GENERAL DESIGN ISSUES The success of a data management task usually depends on the tool used for that task. The theoretical physicist Stephen Hawking is quoted as saying, “When all you have is a hammer, everything looks like a nail.” This is as true in data management as in anything else. People who like to use a word processor or a spreadsheet program are likely to use the tool they are familiar with to manage their data. But just as a hammer is not the right tool to tighten a screw, a spreadsheet is not the right tool to manage a large and complicated database. A database management program should be used instead. This section discusses the design of the database management tool, and how the design can influence the success of the project. DATABASE MANAGEMENT SOFTWARE Database management programs fall into two categories, desktop and client-server. The use of the two different types and decisions about where the data will be located are discussed in the next section. This section will discuss database applications themselves and briefly discuss the features and benefits of the programs. The major database software vendors put a large amount of effort into expanding and improving their products, so these descriptions are a snapshot in time. For an overview of desktop and Web-based database software, see Ross et al. (2001). Older database systems were, for the most part, based on dBase, or at least the dBase file format. dBase started in the early days of DOS, and was originally released as dBase II because that sounded more mature than calling it 1.0. If anyone tells you that they have been doing databases since dBase 1 you know they are bluffing. dBase was an interpreted application, meaning that the code was translated into machine language (compiled) each time it was run, which was slow on those early computers. This created a market for dBase compilers, of which FoxPro was the most popular. Both used a similar data format in which each data table was called a database file or .dbf. Relationships were defined in code, rather than as part of the data model. Much data has been, and still is in some cases, managed in this format. These files were designed for single-user desktop use, although locking capabilities were added in later versions of the software to allow shared use. Nowadays Microsoft Access dominates the desktop database market. This program provides a good combination of ease of use for beginners and power for experts. It is widely available, either as a stand-alone product or as part of the Office desktop suite. Additional information on Access can be found in books by Dunn (1994), Jennings (1995), and others, and especially in journals such as PC Magazine and Access/Visual Basic Advisor. Access has a feature that is common to almost all successful database programs, which is a programming language that allows users to © 2002 by CRC Press LLC automate tasks, or even build complete programs using the database software. In the case of Access, there are actually two programming models, a macro language that is falling out of favor, and a programming language. The programming language is called Visual Basic for Applications (VBA), and is a fairly complete development environment. Since Access is a desktop database, it has limitations relative to larger systems. Experience has shown that for practical use, the software starts to have performance problems when the largest table in a database starts to reach a half million to a million records. Access allows multiple users to share a database, and no programming is required to implement this, but a dozen or so concurrent users is an upper limit for heavy usage database scenarios. Access is available either as a stand-alone product or as part of the Microsoft Office Suite. An alternative to Access is Paradox from Corel (www.corel.com). This is a programmable, relational database system, and is available as part of the Corel Office Suite. Paradox is a capable tool suitable for a complex database project, but the greater acceptance of Access makes Paradox an unlikely choice in the environmental business where file sharing is common, and Access is widespread. The next step up from Access for many organizations is Microsoft SQL Server. This is a full- scale client-server system with robust security and a larger capacity than Access. It is moderately priced and relatively easy to use (for enterprise software), and increases the capacity to several million records. It is easy to attach an Access front end (user interface) to a SQL Server back end (data storage), so the transition to SQL Server is relatively easy when the data outgrows Access. This connection can be done using ODBC (Open DataBase Connectivity) or other connection methods. For even larger installations, Oracle or IBM’s DB2 offer industrial-strength data storage, but with a price and learning curve to match. These can also be connected to the desktop using connection methods like ODBC, and one front-end application can be set up to talk to data in these databases, as well as to data in Access. Using this approach it is possible to create one user interface that can work with data in all of the different database systems. A new category of database software that is beginning to appear is open-source software. Open-source software refers to programs where the source code for the software is freely available, and the software itself is often free as well. This type of software is popular for Internet applications, and includes such popular programs as the Linux operating system and Apache Web server. Two open-source database programs are PostgreSQL and MySQL (Jepson, 2001). These programs are not yet as robust as commercial database systems, but are improving rapidly. They are available in commercial, supported versions as well as the free open-source versions, so they are starting to become options for enterprise database storage. And you can’t beat the price. Another new category of database software is Web-based programs. These programs run in a browser rather than on the desktop, and are paid for with a monthly fee. Current versions of these programs are limited to a flat-file design, which makes them unsuitable for the complex, relational nature of most environmental data, but they might have application in some aspects of Web data delivery. Examples of this type of software include QuickBase from the authors of the popular Quicken and QuickBooks financial software (www.quickbase.com), and Caspio Bridge (www.caspio.com). DATABASE LOCATION OPTIONS A key decision in designing a data management system is where the data will reside. Related to this are a variety of issues, including what hardware and software will provide the necessary functionality, who will be responsible for data entry and editing, and who will be responsible for backup of the database. © 2002 by CRC Press LLC Stand-alone The simplest design for a database location is stand-alone. In this design, the data and the software to manage it reside on the computer of one user. That computer may or may not be on a network, but all of the functionality of the database system is local to that machine. The hardware and software requirements of a system like this are modest, requiring only one computer and one license for the database management software. The software does not need to provide access for more than one user at a time. One person is in control of the computer, software, and data. For small projects, especially one-person projects, this type of design is often adequate. For larger projects where many people need access to the data, the single individual keeping the data can become a bottleneck. This is particularly true when the retrievals required are large or complicated. The person responsible for the data can end up spending most or all of his or her time responding to data requests from users. When the data management requirements grow beyond that point, the stand-alone system no longer meets the needs of the project team, and a better design is required. Shared file Generally the next step beyond a stand-alone system is a shared file system. In a shared file system, the server (or any computer) stores the database on its hard drive like any other file. Clients access the file using database software on their computers the same way they would open any other file on the server. The operating system on the server makes the file available. The database software on the client computer is responsible for handling access to the database by multiple users. An example of this design would be a system in which multiple users have Microsoft Access on their computers, and the database file, which has an extension of .mdb, resides on a server, which could be running Windows 95/98/ME or NT/2000/XP. When one or more users is working in the database file, their copy of Access maintains a second file on the server called a lock file. This file, which has an extension of .ldb, keeps track of which users are using the database and what objects in the database may be locked at any particular time. This design works well for a modest number of users in the database at once, providing adequate performance for a dozen or so users at any given time, and for databases up to a few hundred thousand records. Client-server When the load on the database increases to the point where a shared file system no longer provides adequate performance, the next step is a client-server system. In this design, a data manager program runs on the server, providing access to the data through a system process. One computer is designated the server, and it holds the data management software and the data itself. This system may also be used as the workstation of an individual user, but in high-volume situations this is not recommended. More commonly, the server computer is placed on a network with the computers of the users, which are referred to as clients. The software on the server works with software on the client computers to provide access to the data. The following diagram covers the internal workings of a client-server EDMS. It contains two parts, the Access component at the top and the SQL Server part at the bottom. In discussing the EDMS, the Access component is sometimes called the user interface, since that is the part of the system that users see, but in fact both Access and SQL Server have user interfaces. The Access user interface has been customized to make it easy for the EDMS users to manage the data in ways useful to them. The SQL Server user interface is used without modifications (as provided by Microsoft) for data administration tasks. Between these user interfaces are a number of pieces that work together to provide the data management capabilities of the system. © 2002 by CRC Press LLC Access User Interface Server Client Fmt1 Fmt1 Fmt2 Fmt2 Lookup Table Maintenance Table View Record Counts Subset Creation Electronic Import Volume Maint. Formatted Reports Backup / Restore File Export Subset Database Manual Entry Data Review Maps Graphs Selection Screen Server Tables Server Volume Access Queries / Modules Access Attachments Security System Read/Write Access Queries / Modules Access Attachments Security System Read Only Selection Scr. SQL Server / Oracle User Interface Figure 17 - Client-server EDMS data flow diagram Discussion of this diagram will start at the bottom and work toward the top, since this order starts with the least complicated parts (from the user’s perspective, not the complexity of the software code) and moves to the more complicated parts. That means starting on the SQL Server side and working toward the client side. This sequence provides the most orderly view of the system. In this diagram, the part with a gray background represents SQL Server, and the rest of the box is Access. The basic foundation of the server side is the SQL Server volume, which is actually a file on the server hard drive that contains the data. The size of this volume is set when the database is set up, and can be changed by the administrator as necessary. Unlike many computer files, it will not grow automatically as data is added. Someone needs to monitor the volume size and the amount of data and keep them in synch. The software works this way because the location and structure of the file on the hard drive is carefully managed by the SQL Server software to provide maximum performance (speed of query execution). The database tables are within the SQL Server volume. These tables are similar in function and appearance to the tables in Access. They contain all of the data in the system, usually in normalized database form. The data in the tables is manipulated through SQL queries passed to the SQL Server software via the ODBC link from the clients. Also stored in the SQL Server volume can be triggers and stored procedures to increase performance in the client-server system and to enforce referential integrity. If they wish, users can see the tables in the SQL Server volume through the database window in Access, but their ability to manipulate the data depends on the privileges that they have through the security system. A System Administrator should back up data in the SQL Server tables on a regular basis (at least daily). The interface between the EDMS and the SQL Server tables is through a security system that is partly implemented in SQL Server and partly in Access. Most users should have read-only permission for working with the data; that is, they will be able to view data but not change it. A small group of users, called data administrators, should be able to modify data, which will include importing and entering data, changing data, and deleting data. © 2002 by CRC Press LLC The actual connection between Access and SQL Server is done through attachments. Attachments in Access allow Access to see data that is located somewhere other than the current Access .mdb file as if it were in the current database. This is the usual way of connecting to SQL Server, and also allows us to provide the flexibility of attaching to other data sources. Once the attachments are in place, the client interaction with the database is through Access queries, either alone or in combination with modules, which are programs written in VBA. Various queries provide access to different subsets of the data for different purposes. Modules are used when it is necessary to work with the data procedurally, that is, line by line, instead of the whole query as a set. Distributed The database designs described above were geared toward an organization where the data management is centralized, that is, where the data management activities are performed in one central location, usually on a local area network (LAN). With environmental data management, this is not always the case. Often the data for a particular facility must be available both at the facility and at the central office. The situation becomes even more complicated when the central office must manage and share data with multiple remote sites. This requires that some or all of the data be made available at multiple locations. The following sections describe three ways to do this: wide-area networks, distributed databases with replication, and remote communication with subsets. The factors that determine which solution is best for an organization include the amount of data to be managed, how fresh the data needs to be at the remote locations, and whether full-time communication between the facilities is available and the speed of that communication. Wide-area networks – In situations where a full-time, high-speed communication link is or can be made available (at a cost which is reasonable for the project), a wide-area network (WAN) is often the best choice. From the point of view of the people using it, the WAN hardware and software connect the computers just as with a LAN. The difference is that instead of all of the computers being connected directly through a local Ethernet or Token Ring network, some are connected through long-distance data lines of some sort. Often there are LANs at the different locations connected together over the WAN. The connection between the LANs is usually done with routers on either end of the long- distance line. The router looks at the data traffic passing over the network, and data packets which have a destination on the other LAN are routed across the WAN to that LAN, where they continue on to their destination. There are several options for the long-distance lines forming the WAN between the LANs. This discussion will cover some popular existing and emerging technologies. This is a rapidly changing industry, and new technologies are appearing regularly. At the high end of connectivity options are dedicated analog solutions such as T1 (or in cases of very high data volume, T3) or frame relay. These services are connected full-time, and provide high to moderate speeds ranging from 56 kilobits per second (kbps) to 1 megabit per second (mbps) or more. These services can cost $1000 per month or more. This is proven technology, and is available, at a cost, to nearly any facility. Recently, newer digital services have become available for some areas. Integrated Services Digital Network (ISDN) provides 128 kbps for around $100 per month. Digital Subscriber Line (DSL) provides connectivity ranging in speed from 256 kbps to 1.5 mbps or more. Prices can be as low as $40 per month, so this service can be a real bargain, but service is limited to a fairly short distance from the telephone company central office, so it’s not available to many locations. Cable modems promise to give DSL a run for its money. It’s not widely available right now, especially for business locations, and when it is it will have more of a focus on residential rather than business since that is where cable is currently connected. © 2002 by CRC Press LLC Another option is standard telephone lines and analog modems. This is sometimes called POTS (plain old telephone system). This provides 56 kbps, or more with modem pooling, and the connection is made on demand. The cost is relatively low ($15 to 30 per month) and is available everywhere. In order to have WAN-level connectivity, you should have a full-time connection of about 1 mbps or faster. If the connection speed available is less than this, another approach should be used. Distributed databases with replication – There are several situations where a client-server connection over a WAN is not the best solution. One is where the connection speed is too low for real-time access. The second is where the data volume is extremely high, and local copies make more sense. In this situation, distributed databases can make sense. In this design, copies of the database are placed on local servers at each facility, and users work with the local copies of the data. This is an efficient use of computer resources, but raises the very important issue of currency of the data. When data is entered or changed in one copy, the other copy is no longer the most current, and, at some point, the changes must be transferred between the copies. Most high-end database programs, and now some low end ones, can do this automatically at specified intervals. This is called replication. Generally, the database manager software is smart enough to move only the changed data (sometimes called “dirty” records) rather than the whole database. Problems can occur when users make simultaneous changes to the same records at different locations, so this approach rapidly becomes complicated. Remote communication with subsets – Often it is valuable for users to be able to work with part of the database remotely. This is particularly useful when the communication line is slow. In this scenario, users call in with their remote computers and attach to the main database, and either work with it that way or download subsets for local use. In some software this is as easy as selecting data using the standard selection screen, then instructing the EDMS to create a subset. This subset can be created on the user’s computer, then the user can hang up, attach to the local subset, then use the EDMS in the usual way, working with the subset. This works for retrieving data, but not as well for data entry or editing, unless a way is provided for data entered into the subset to be uploaded to the main database. Internet/intranet Tools are now available to provide access to data stored in relational database managers using a Web browser interface. At the time of this writing, in the opinion of the author, these tools are not yet ready for use as the primary interface in a sophisticated EDMS package. Specifically, the technology to provide an intuitive user interface with real-time feedback is too costly or impossible to build with current Web development tools. Vendors are working on implementing that capability, but the technology is not currently ready for prime time, at least for everyday users, and the current technology of choice is client-server. It is now feasible, however, to provide a Web interface with limited functionality for specific applications. For example, it is not difficult to provide a public page with summaries of environmental data for plants. The more “canned” the retrieval is, the easier it is to implement in a browser interface, although allowing some user selection is not difficult. In the near future, tools like Dynamic HTML, Active Server Pages, and better Java applets, combined with universal high-speed connections, will make it much easier to provide an interactive user interface, hosted in a Web browser. At that time, EDMS vendors will certainly provide this type of interface to their client’s databases. The following figure shows a view of three different spectra provided by the Internet and related technologies. There are probably other ways of looking at it, but this view provides a framework for discussing products and services and their presentation to users. In these days of multi-tiered applications, this diagram is somewhat of an over-simplification, but it serves the purpose here. © 2002 by CRC Press LLC Local Global Applications Data Stand- Alone Shared Files Client- Server Web- Enabled Web- Based Proprietary Commercial Public Domain Desktops Laptops PDAs, etc. Public Portals Users Figure 18 - The Internet spectrum The overall range of the diagram in Figure 18 is from Local on the left to Global on the right. This range is divided into three spectra for this discussion. The three spectra, which are separate but not unrelated, are applications, data, and users. Applications – Desktop computer usage started with stand-alone applications. A program and the data that it used were installed on one computer, which was not attached to others. With the advent of local area networks (LANs), and in some organizations wide-area networks (WANs), it became possible to share files, with the application running on the local desktop and with the data residing on a file server. As software evolved and data volumes grew, software was installed on both the local machine (client) and the server, with the user interface operating locally, and data storage, retrieval, and sometimes business logic operating on the server. With the advent of the Internet and the World Wide Web, sharing on a much broader scale is possible. The application can reside either on the client computer and communicate with the Web, or it can run on a Web server. The first type of application can be called Web-enabled. An example of this is an email program that resides locally, but talks through the Web. Another example would be a virus- scanning program that operates locally but goes to the Web to update its virus signature files. The second type of application can be called Web-based. An example of this would be a browser-based stock trading application. Many commercial applications still operate in the range between stand-alone and client server. There is now a movement of commercial software to the right in this spectrum, to Web-enabled or Web-based applications, probably starting with Web-enabling current programs, and then perhaps evolving to a thin-client, browser-based model over time. This migration can be done with various parts of the applications at different rates depending on the costs and benefits at each stage. New technologies like Microsoft’s .NET initiative are helping accelerate this trend. Data – Most environmental database users currently work mostly with data that they generate themselves. Their base map data is usually based on CAD drawings that they create, and the rest of their project data comes from field measurements and laboratory analyses, which the data manager (or their client) pays for and owns. This puts them to the left of the spectrum in the above figure. Many vendors now offer both base map and other data, either on the Web or on CD-ROM, which might be of value to many users. Likewise, government agencies are making more and more data, spatial and non-spatial, available, often for free. As vendors evolve their software and Web presence, they can work toward integrating this data into their offerings. For example, software could be used to load a USGS or Census Bureau base map, and then display sites of environmental concern obtained from the EPA. Several software companies provide tools to make it possible to © 2002 by CRC Press LLC serve up this type of data from a modified Web server. Revenue can be obtained from purchase or rental of the application, as well as from access to the data. Users – The World Wide Web has opened up a whole new world of options for computing platforms. These range from the traditional desktop computers and laptops through personal digital assistants (PDAs), which may be connected via wireless modem, to Web portals and other public access devices. Desktops and laptops can run present and future software, and, as most are becoming connected to the Internet, will be able to support any of the computing models discussed above. PDAs and other portable devices promise to provide a high level of portability and connectivity, which may require re-thinking data delivery and display. Already there are companies that integrate global positioning systems (GPS) with PDAs and map data to show you where you are. Other possible applications include field data gathering and delivery, and a number of organizations provide this capability. Web portals include public Internet access (such as in libraries and coffee shops) as well as other Internet-enabled devices like public phones. This brings up the possibility that applications (and data) may run on a device not owned by or controlled by the client, and argues for a thin-client approach. This is all food for thought as we try to envision the evolution of environmental software products and services (see Chapter 27). What is clear is that the options for delivery of applications and data have broadened significantly, and must be considered in planning for future needs. Multi-tiered The evolution of the Internet and distributed computing has led to a new deployment model called “multi-tiered.” The three most common tiers are the presentation level, the business logic level, and the data storage level. Each level might run on a different computer. For example, the presentation level displayed to the user might run on a client computer, using either client-server software or a Web browser. The business logic level might enforce the data integrity and other rules of the database, and could reside on a server or Web server computer. Finally, the data itself could reside on a database server computer. Separating the tiers can provide benefits for both the design and operation of the system. DISTRIBUTED VS. CENTRALIZED DATABASES An important decision in implementing a data management system for an organization performing environmental projects for multiple sites is whether the databases should be distributed or centralized. This is particularly true when the requirements for various uses of the data are taken into consideration. This issue will be discussed here from two perspectives. The first perspective to be discussed will be that of the data, and the second will be that of the organization. From the perspective of the data and the applications, the options of distributed vs. centralized databases are illustrated in Figures 19 and 20. Clearly it is easier for an application to be connected to a centralized, open database than a diverse assortment of data sources. The downside is the effort required to set up and maintain a centralized data repository. © 2002 by CRC Press LLC Mapping Statistics Project Management Validation Web Access Reporting Graphing Planning ? ? ? ? ? ? ? ? GIS Coverages Lab Deliverables CAD Files Spreadsheets Legacy Systems ASCII Files Word Proc. Files Chain of Custody Field Notebooks Regulatory Reports Hard Copy Files Figure 19 - Connection to diverse, distributed data sources Mapping Statistics Project Management Validation Web Access Reporting Graphing Planning Centralized Open Database Figure 20 - Connection to a centralized open database © 2002 by CRC Press LLC [...]... benefits can be realized © 20 02 by CRC Press LLC Figure 22 - Example of a simplified logical data model THE DATA MODEL The data model for a data management system is the structure of the tables and fields that contain the data Creating a robust data model is one of the most important steps in building a successful data management system (Walls, 1999) If you are building a data management system from scratch,... developed and marketed by Geotech Computer Systems, Inc called Enviro Data Because this is a working system that has managed hundreds of databases and millions of records of site environmental investigation and monitoring data, it seems like a good starting point for discussing the issues related to a data model for storing this type of data © 20 02 by CRC Press LLC Figure 23 - Table and field display. .. relationships are one-to-many rather than many-to-many, since one-to-many relationships are handled well by the relational data model, and many-to-many are not (Many-to-many relationships can be handled in the relational data model They require adding another table to track the links between the two tables This table is sometimes called a join table We don’t have to worry about that here.) © 20 02 by CRC Press... management of ODBC connections for all of the drivers that are installed The second part consists of individual drivers for specific data sources There are two kinds of ODBC drivers, single-tier and multi-tier The single-tier drivers provide both the communication and data manipulation capabilities, and the data management software for that specific format itself is not required Examples of single-tier... resources and network database server resources are available This networking software is described in more detail in the next section Database management software The next software element in the database system is the database management software itself Examples of this software are Microsoft Access, FoxPro, and Paradox This software can be used by itself to manage the data, or with the help of a customized... of the software Figure 24 - Simple screen for tracking database activity © 20 02 by CRC Press LLC Figure 25 - Output of activity log data Database maintenance There are a number of activities that must be performed on an ongoing or at least occasional basis to keep an EDMS up and running These include: Backup – Backing up data in the database is discussed in Chapter 15, but must be kept in mind as part. .. names, and files in DOS and Windows usually have a base name and an extension separated by a period, such as Mydata.dbf The extension usually tells you what type of file it is Older database systems often stored their data in the format of dBase, with an extension of dbf Access stores its data and programs in files with the extension of mdb for Microsoft DataBase, and can store many tables and other... requirements of any of the Microsoft Office programs At the time of this writing, the minimum and recommended computer specifications for adequate performance using the data management system are as shown in Figure 26 © 20 02 by CRC Press LLC Item Computer Hard drive Memory Removable storage Display Network Peripherals Minimum 20 0 megahertz Pentium processor Adequate for software and local data storage,... appropriate), database management software, and the application Operating system Most systems used for data management run one of the Microsoft operating systems: Windows 95, 98, ME, or NT /20 00/XP All of these systems can run the same client data management software and perform pretty much the same Apple Macintosh systems are present in some places, but are used mostly for graphic design and education, and have... another file, and then you can’t easily work with all of your data If you store your data in a standalone database manager program like Access, when your data grows you can relatively easily migrate to a more powerful database manager like SQL Server or Oracle The ability of software and hardware to handle tasks of different sizes is called scalability, and this requirement should be part of your planning . the complexity of the software. Figure 24 - Simple screen for tracking database activity © 20 02 by CRC Press LLC Figure 25 - Output of activity log data Database maintenance There are a number of activities. series of one-to-many (also known as parent-child or hierarchical) relationships. It is particularly fortunate that these relationships are one-to-many rather than many-to-many, since one-to-many. one-to-many relationships are handled well by the relational data model, and many-to-many are not. (Many-to-many relationships can be handled in the relational data model. They require adding