PRACTICAL SQL A Beginner’s Guide to Storytelling with Data by Anthony DeBarros San Francisco Estadísticos e-Books & Papers PRACTICAL SQL Copyright © 2018 by Anthony DeBarros All rights reserved No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher ISBN-10: 1-59327-827-6 ISBN-13: 978-1-59327-827-4 Publisher: William Pollock Production Editor: Janelle Ludowise Cover Illustration: Josh Ellingson Interior Design: Octopod Studios Developmental Editors: Liz Chadwick and Annie Choi Technical Reviewer: Josh Berkus Copyeditor: Anne Marie Walker Compositor: Janelle Ludowise Proofreader: James Fraleigh For information on distribution, translations, or bulk sales, please contact No Starch Press, Inc directly: No Starch Press, Inc 245 8th Street, San Francisco, CA 94103 phone: 1.415.863.9900; info@nostarch.com www.nostarch.com Library of Congress Cataloging-in-Publication Data Names: DeBarros, Anthony, author Title: Practical SQL : a beginner's guide to storytelling with data / Anthony DeBarros Description: San Francisco : No Starch Press, 2018 | Includes index Identifiers: LCCN 2018000030 (print) | LCCN 2017043947 (ebook) | ISBN 9781593278458 (epub) | ISBN 1593278454 (epub) | ISBN 9781593278274 (paperback) | ISBN 1593278276 (paperback) | ISBN 9781593278458 (ebook) Subjects: LCSH: SQL (Computer program language) | Database design | BISAC: COMPUTERS / Programming Languages / SQL | COMPUTERS / Database Management / General | COMPUTERS / Database Management / Data Mining Classification: LCC QA76.73.S67 (print) | LCC QA76.73.S67 D44 2018 (ebook) | DDC 005.75/6 dc23 LC record available at https://lccn.loc.gov/2018000030 No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc Other product and company names mentioned herein may be the trademarks of their respective owners Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no Estadísticos e-Books & Papers intention of infringement of the trademark The information in this book is distributed on an “As Is” basis, without warranty While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it Estadísticos e-Books & Papers About the Author Anthony DeBarros is an award-winning journalist who has combined avid interests in data analysis, coding, and storytelling for much of his career He spent more than 25 years with the Gannett company, including the Poughkeepsie Journal, USA TODAY, and Gannett Digital He is currently senior vice president for content and product development for a publishing and events firm and lives and works in the Washington, D.C., area Estadísticos e-Books & Papers About the Technical Reviewer Josh Berkus is a “hacker emeritus” for the PostgreSQL Project, where he served on the Core Team for 13 years He was also a database consultant for 15 years, working with PostgreSQL, MySQL, CitusDB, Redis, CouchDB, Hadoop, and Microsoft SQL Server Josh currently works as a Kubernetes community manager at Red Hat, Inc Estadísticos e-Books & Papers BRIEF CONTENTS Foreword by Sarah Frostenson Acknowledgments Introduction Chapter 1: Creating Your First Database and Table Chapter 2: Beginning Data Exploration with SELECT Chapter 3: Understanding Data Types Chapter 4: Importing and Exporting Data Chapter 5: Basic Math and Stats with SQL Chapter 6: Joining Tables in a Relational Database Chapter 7: Table Design That Works for You Chapter 8: Extracting Information by Grouping and Summarizing Chapter 9: Inspecting and Modifying Data Chapter 10: Statistical Functions in SQL Chapter 11: Working with Dates and Times Chapter 12: Advanced Query Techniques Chapter 13: Mining Text to Find Meaningful Data Chapter 14: Analyzing Spatial Data with PostGIS Chapter 15: Saving Time with Views, Functions, and Triggers Chapter 16: Using PostgreSQL from the Command Line Chapter 17: Maintaining Your Database Estadísticos e-Books & Papers Chapter 18: Identifying and Telling the Story Behind Your Data Appendix: Additional PostgreSQL Resources Index Estadísticos e-Books & Papers CONTENTS IN DETAIL FOREWORD by Sarah Frostenson ACKNOWLEDGMENTS INTRODUCTION What Is SQL? Why Use SQL? About This Book Using the Book’s Code Examples Using PostgreSQL Installing PostgreSQL Working with pgAdmin Alternatives to pgAdmin Wrapping Up CREATING YOUR FIRST DATABASE AND TABLE Creating a Database Executing SQL in pgAdmin Connecting to the Analysis Database Creating a Table The CREATE TABLE Statement Making the teachers Table Inserting Rows into a Table The INSERT Statement Viewing the Data When Code Goes Bad Formatting SQL for Readability Wrapping Up Estadísticos e-Books & Papers Try It Yourself BEGINNING DATA EXPLORATION WITH SELECT Basic SELECT Syntax Querying a Subset of Columns Using DISTINCT to Find Unique Values Sorting Data with ORDER BY Filtering Rows with WHERE Using LIKE and ILIKE with WHERE Combining Operators with AND and OR Putting It All Together Wrapping Up Try It Yourself UNDERSTANDING DATA TYPES Characters Numbers Integers Auto-Incrementing Integers Decimal Numbers Choosing Your Number Data Type Dates and Times Using the interval Data Type in Calculations Miscellaneous Types Transforming Values from One Type to Another with CAST CAST Shortcut Notation Wrapping Up Try It Yourself IMPORTING AND EXPORTING DATA Estadísticos e-Books & Papers indexes, 108 inserting rows, 8–9 key columns, 74 modifying with ALTER statement, 137–138 naming, 94, 96 querying multiple tables using joins, 77 relationships, size, 314 temporary tables, 50 viewing data, tablefunc module, 203 table relationships many-to-many, 85 one-to-many, 84 one-to-one, 84 temporary table declaring, 50 removing with DROP TABLE, 51 text data types, 24–26 char, 24 text, 25 varchar, 6, 24 text operations case formatting, 212 concatenation, 143 escaping characters, 219 extracting and replacing characters, 213–214 formatting as timestamp, 173 formatting with functions, 212–214 matching patterns with regular expressions, 214 removing characters, 213 Estadísticos e-Books & Papers sorting, 16 text files, delimited See delimited text files text qualifier ignoring delimiters with, 41 specifying with QUOTE option in COPY, 43 tilde-asterisk case-insensitive matching operator (~*), 228 tilde case-sensitive matching operator (~), 228 time data types interval, 32, 172 matching with regular expression, 215 time, 32, 172 timestamp, 32, 172 timestamp, 32, 172 calculations with, 180 creating from components, 174–175, 225 extracting components from, 173–174 finding current date and time, 175–176 formatting display, 187 subtracting to find interval, 187 timestamptz shorthand, 172 with time zone, 32, 172 within transactions, 176 time zones AT TIME ZONE keywords, 179 automatic conversion of, 173, 175 finding server setting, 177–178 including in timestamp, 32, 173, 226 setting, 178–180 setting server default, 320 standard name database, 33 viewing names of, 177 Estadísticos e-Books & Papers working with, 177 to_char() function, 187 to_tsquery() function, 232 to_tsvector() function, 231 transaction blocks, 149–151 COMMIT, 149 definition, 149 ROLLBACK, 149 START TRANSACTION, 149 visibility to other users, 151 transactions, 149 with time functions, 176 triggers, 267, 282 BEFORE INSERT statement, 288 CREATE TRIGGER statement, 285 FOR EACH ROW statement, 285 FOR EACH STATEMENT statement, 285 NEW and OLD variables, 284 RETURN statement, 285 testing, 285, 288 trim_county() user function, 281 trim() function, 213 true (Boolean value), 74 ts_headline() function, 235 tsquery data type, 232 ts_rank_cd() function, 237 ts_rank() function, 237 tsvector data type, 231 U Estadísticos e-Books & Papers uncorrelated subquery, 192 underscore wildcard for pattern matching (_), 19 UNIQUE constraint, 76, 105–106 Universally Unique Identifier (UUID), 35, 98 unnest() function, 68 unstructured data, 211 parsing with regular expressions, 216, 222 UPDATE statement definition, 138 PostgreSQL syntax, 139 SET clause, 138 using across tables, 138, 145, 192 with CASE statement, 226 update_personal_days() user function, 279 upper() function, 212 USA TODAY, xxiii U.S Census 2010 Decennial Census data, 43 calculating population change, 89 county shapefile analysis, 259 description of columns, 45–47 finding total population, 64 importing data, 43–44 racial categories, 60 short form, 60 2011–2015 American Community Survey description of columns, 156 estimates and margin of error, 157 importing data, 156 apportionment of U.S House of Representatives, 44 methodologies compared, 157, 328 Estadísticos e-Books & Papers U.S Department of Agriculture, 130 farmers’ market data, 250 U.S Federal Bureau of Investigation (FBI) crime report data, 167 UTC (Coordinated Universal Time), 33, 174 UTC offset, 33, 179, 187 UTF-8, 16 UUID (Universally Unique Identifier), 35, 98 V command, 314 ANALYZE option, 317 autovacuum process, 316 editing server setting, 319 FULL option, 318 monitoring table size, 314 pg_stat_all_tables view, 317 running manually, 318 time of last vacuum, 317 VERBOSE option, 318 VALUES clause with INSERT, varchar data type, 6, 24 views, 267 advantage of using, 268 creating, 269–271 deleting data with, 275 dropping, 269 inserting data with, 273–274 inserting, updating, deleting data, 271 LOCAL CHECK OPTION, 272, 273 materialized, 268 VACUUM Estadísticos e-Books & Papers pg_stat_all_tables, 317 queries in, 269 retrieving specific columns, 271 updating data with, 274 W well-known text (WKT), 244 extended, 248 order of coordinates, 245 WHEN clause, 208 in CASE statement, 227 WHERE clause, 17 in UPDATE statement, 138 filtering rows with, 17–19 with DELETE FROM statement, 147 with EXISTS clause, 139, 192 with ILIKE operator, 19–20 with IS NULL keywords, 133 with LIKE operator, 19–20, 143 with regular expressions, 228 whole numbers, 27 wildcard asterisk (*) in SELECT statement, 12 percent sign (%), 19 underscore (_), 19 window functions definition of, 164 OVER clause, 164 PARTITION BY clause, 165 WITH Estadísticos e-Books & Papers as Common Table Expression, 200 options with COPY, 42 WKT (well-known text), 244 extended, 248 order of coordinates, 245 working tables, 148 X XML, 35 Z ZIP Codes, 135 loss of leading zeros, 135 repairing botched, 143 Estadísticos e-Books & Papers Practical SQL is set in New Baskerville, Futura, Dogma, andTheSansMono Condensed Estadísticos e-Books & Papers RESOURCES Visit https://www.nostarch.com/practicalSQL/ for resources, errata, and more information More no-nonsense books from NO STARCH PRESS THE BOOK OF R A First Course in Programming and Statistics by TILMAN M DAVIES JULY 2016, 832 pp., $49.95 ISBN 978-1-59327-651-5 color insert Estadísticos e-Books & Papers DATA VISU ALIZATION WITH JAVASCRIPT by STEPHEN A THOMAS MARCH 2015, 384 pp., $39.95 ISBN 978-1-59327-605-8 full color PYTHON CRASH COURSE A Hands-On, Project-Based Introduction to Programming by ERIC MATTHES NOVEMBER 2015, 560 pp., $39.95 ISBN 978-1-59327-603-4 Estadísticos e-Books & Papers STATISTICS DONE WRONG The Woefully Complete Guide by ALEX REINHART MARCH 2015, 176 pp., $24.95 ISBN 978-1-59327-620-1 THE MANGA GUIDE TO DATABASES by MANA TAKAHASHI, SHOKO AZUMA, and TREND-PRO CO., LTD JANUARY 2009, 224 pp., $19.95 ISBN 978-1-59327-190-9 Estadísticos e-Books & Papers DOING MATH WITH PYTHON Use Programming to Explore Algebra, Statistics, Calculus, and More! by AMIT SAHA AUGUST 2015, 264 pp., $29.95 ISBN 978-1-59327-640-9 PHONE: 1.800.420.7240 or 1.415.863.9900 EMAIL: SALES@NOSTARCH.COM WEB: WWW.NOSTARCH.COM Estadísticos e-Books & Papers Estadísticos e-Books & Papers FIND THE STORY IN YOUR DATA This book uses PostgreSQL but is applicable to MySQL, Microsoft SQL Server, and other database systems SQL (Structured Query Language) is a popular programming language used to create, manage, and query databases Whether you’re a marketing analyst, a journalist, or a researcher mapping neurons in the brain of a fruit fly, you’ll benefit from using SQL to tell the story hidden in your data Practical SQL is a fast-paced, plain-English introduction to programming with SQL Following a primer on SQL language basics and database fundamentals, you’ll learn how to use the pgAdmin interface and PostgreSQL database system to define, organize, and analyze real-world data sets, such as crime statistics and U.S Census demographics Next, you’ll learn how to create databases using your own data, write queries to perform calculations, and handle common roadblocks when dealing with public data With the help of easy-to-follow exercises in each Estadísticos e-Books & Papers chapter, you’ll discover how to build powerful databases and find meaning in your data sets You’ll also learn how to: • Define the right data types for your information • Aggregate, sort, and filter data to find patterns • Identify and clean up any errors in your data • Search text for meaningful data • Create advanced queries and automate tedious tasks Organizing and analyzing data doesn’t have to be dry and complicated Find the story in your data with Practical SQL ABOUT THE AUTHOR Anthony DeBarros is an award-winning data journalist whose career spans 30 years at news organizations including USA TODAY and Gannett’s Poughkeepsie Journal He holds a master’s degree in information systems from Marist College THE FINEST IN GEEK ENTERTAINMENT™ www.nostarch.com Estadísticos e-Books & Papers ... https://www.nostarch.com/practicalSQL/ CREATE DATABASE analysis; Listing 1-1: Creating a database named analysis This statement creates a database on your server named Estadísticos e-Books & Papers analysis... colleges and universities Many of our projects included as much as 20 years’ worth of data, and one of my main tasks was to import all that data into a SQL database and analyze it I had to calculate... stay on track Practical SQL starts with the basics of databases, queries, tables, and data that are common to SQL across many database systems Chapters 13 to 17 cover topics more specific to