Data Analysis Using SQL and Excel Data Analysis Using SQL and Excel® Gordon S Linoff Data Analysis Using SQL and Excel® Second Edition Data Analysis Using SQL and Excel®, Second Edition Published by J[.]
Data Analysis Using SQL and Excel® Data Analysis Using SQL and Excel® Second Edition Gordon S Linoff Data Analysis Using SQL and Excel®, Second Edition Published by John Wiley & Sons, Inc 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright © 2016 by John Wiley & Sons, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-1-119-02143-8 ISBN: 978-1-119-02145-2 (ebk) ISBN: 978-1-119-02144-5 (ebk) Manufactured in the United States of America 10 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600 Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose No warranty may be created or extended by sales or promotional materials The advice and strategies contained herein may not be suitable for every situation This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services If professional assistance is required, the services of a competent professional person should be sought Neither the publisher nor the author shall be liable for damages arising herefrom The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002 Wiley publishes in a variety of print and electronic formats and by print-on-demand Some material included with standard print versions of this book may not be included in e-books or in print-on-demand If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com For more information about Wiley products, visit www.wiley.com Library of Congress Control Number: 2015950486 Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc and/or its affiliates, in the United States and other countries, and may not be used without written permission Excel is a registered trademark of Microsoft Corporation All other trademarks are the property of their respective owners John Wiley & Sons, Inc is not associated with any product or vendor mentioned in this book To Giuseppe—for twenty five years, five books, and counting About the Author Gordon S Linoff has been working with databases, big data, and data mining for almost longer than he can remember With decades of experience on the practice of using data effectively, he is a recognized expert in the field of data mining Gordon started using spreadsheets while a student at MIT, on the original Compaq Portable, the world’s first luggable computer Not very many years later, he managed a development group at the now‐defunct Thinking Machines Corporation, tasked with building a massively parallel relational database for decision support After Thinking Machines’ demise, he founded Data Miners in 1998 with his friend and former colleague Michael J A Berry (who left in 2012) Since then, he has worked on a wide diversity of projects across many different companies He has taught hundreds of classes around the world on data mining and survival analysis through SAS Institute, a leader in statistical and business analytics software He is also an avid contributor to Stack Overflow, particularly on questions related to databases, having the highest score in 2014 Together with Michael Berry, Gordon has written several influential books on data mining, including Data Mining Techniques for Marketing, Sales, and Customer Support, the first book on data mining to achieve a third edition Gordon lives in New York with Giuseppe Scalia, his partner of 25 years vii Index % (modulus operator), 112 A ACID (database properties, ADD_MONTHS( ) SQL function, 716 addresses, 380–381 email addresses, 381–382 aggregation conditional, 454–455 filtering and, 461–462 indexes and, 676 string, 456–458 string concatenation, 455–456 aliases, tables, 23 analytic functions See window functions animation in charts, 247–254 area charts, 63–64 stacked, 215 array functions, 156, 579 ASCII( ) SQL function, 372, 711 association rules, 465–466, 480–481 chi square, 491–496 heterogeneous, 496–499 different left- and righthand sides, 499–502 item sets, 466 examples, 469–470 household combinations, 476–478 large, 471–473 multiple purchases of product, 478–480 product group combinations, 470–471 size, 473–475 two-way combinations, 466–469 one-way, 483–489 probability distribution, 481–483 product attributes and, 502 right-hand side, 502–503 sequential, 466, 503–506 two-way, 489–499 calculating, 489–490 chi-square and, 491–496 zero-way, 481, 483 average time between orders, 388–390 average truncated tenure, 295–296 average value chart, 538–540 averages, 105 AUC (area under curve), 546–552 moving averages, trend line, 574–576 numeric variables and, 317–324 standard deviation, 105–107 standard error, 106 AVG( ) SQL function, 407, 435 B B-trees (index), 668–670 balanced samples, 113–115 Bank Identification Number (BIN), 66 bar charts (Excel) in cells, 57–59, 214 character-based, 57–58 conditional formatting, 58–59 stacked, 213 Bayes, Thomas, 550, 556 bell curve See normal distribution best fit line, 415–416 averages, 568 coefficients, calculation, 584–585 errors, 567–568 expected value, 565–568 formula, 565 inverse model, 568–569 731 732 Index ■ C–C LINEST( ) Excel function, 580–581 OLS (ordinary least squares), 563 price elasticity, 587–592 properties, 563–569 R2 value, 581–584 regression, 562 residuals, 567–568 scatter plots, 571–572 trend lines, 571–576 weighted, 594–596 charts, 596 Solver, 597–599 SQL and, 596–597 big data, binary response models, 519–520 BINOM.DIST, 120, 122, 124–125 blanks, 709–710 BTRIM( ) SQL function, 709 bubble charts, non-numeric axes and, 471–472 weighted best fit line, 596–597 C Calendar table, 13, 203–204, 647, 697–701 calendars, 198–199 cardinality, 11 Cartesian products, 26 CASE expression, 33–34 CAST( ) SQL function, 205, 395, 397, 454, 712, 713, 719, 723, 724 CAT( ) SQL function, 706 CEILING( ) SQL function, 444 censoring, 266–268 interval censoring, 268 left-censoring, 266, 268 right-censoring, 266 census demographics block groups, 178 blocks, 178 census tracts, 178 county wealth, 181–183 income similarity/ dissimilarity, 163–167 median income, 161–162 proportions of wealthy/ poor, 162–163 values of wealth, distribution, 183–184 zip code comparison, orders, 167–172 Central Limit Theorem, 105–107 Ceres, 564 CHAR( ) Excel function, 148–149 character strings, 10 character-based bar charts, 57–58 CHARINDEX( ) SQL function, 216, 705 Chart wizard, 53–55 charts, histograms, 68–72 counts, 72–74 cumulative, 74–75 numeric values, 75–79 charts (Excel), 51 animation, 247–254 area charts, 63–64 bar charts in cells, 57–59 character-based, 57–58 conditional formatting, 58–59 clustered index, 672 column charts, 51–52 creating, 53–55 data, inserting, 51–53 formatting, 55–57 queries, 59 side-by-side columns, 59–60 stacked and normalized columns, 60 stacked columns, 60 composite (multi-column) index, 679–683 line charts, 63 link charts, 117–118 scatter plots, 64–65 sparklines, 65–68 X-Y charts, 64–65 CHECKSUM( ) SQL function, 111, 650 CHIDIST( ) function, 135 chi-square, 132–134, 138–140, 466, 498 association rules, 491–506 calculation, 134 SQL and, 139–140 degrees of freedom, 135–136 deviation, 133 dimensions, 141 distribution, 134–135 income similarity measurement, 163–167 multidimensional, 141–143 queries, 138–139 SQL, 141–143 versus lift (association rules), 493–495 SQL and, 135–137 COALESCE( ) SQL function, 206–207, 385, 425 collation, 84 COLUMN( ) Excel function, 533 column charts, 51–57 copies data, 54 creating, 53–55 data, inserting, 51–53 formatting, 55–57 queries, 59 side-by-side columns, 59–60 stacked and normalized columns, 60 stacked columns, 60 columns (tables), 7–8 alias, 23 appending, 19 comparing values, 86 date time columns, 640–641 foreign keys, 11, 27–29 maximum values, 79–80 minimum values, 79–80 mode, 80–81 numeric values, 9–10 partitioning, primary key, selecting, 18 summarizing all columns, 93 single column, 90–93 COMBIN( ) Excel function, 120 combinatorics, 116–122 competing risks, 342–352 expected churn, 344 hazard probabilities and, 345–346 involuntary churn, 343 migration churn, 344 survival, 346–352 voluntary churn, 343 CONCAT( ) SQL function, 706 concatenation, 705–706 aggregate string concatenation, 455–456 conditional aggregation, 33–34, 435, 454–455 conditional expressions, multiple, 686 conditional formatting bar charts, 58–59 cells, 532–5333 conditional probability, 553 conditional survival, 285–287 confidence association rules, 483–484 charts, 322–323 statistics, 100–101 counting and, 122–123 hazards, 297–298 ratios, 129–131 confidence interval (statistics), 101 consecutive days of purchase, 391–393 constant, 711–712 constant hazard, 276–277 CONVERT( ) SQL function, 714 convert to string, 713–714 converting number to string, 723–724 convex conic quadratics, 600 correct classification matrix, 531–532 Index ■ C–C CORREL( ) Excel function, 208, 601 correlated subquery, 43–44, 691–693, 699 correlation coefficient, 208 count, customers, active, SQL and, 246–247 COUNT( ) SQL function, 24, 209, 691–694 COUNT(DISTINCT) SQL function, 25, 428–429, 440, 442 COUNTIF( ) Excel function, 156 counting, 115–116 combinatorics, 116–122 confidence and, 122–123 Null Hypthesis, 122–123 probability, 125–126 counts customers, 240–241 active, 239 different, 241–242 tenure segment, 242–246 date comparison, 205–210 dollars, 214–215 number of units measure, 211 products, 211–214 covering indexes, 674–675 Cox, David, 318 Cox proportional hazards regression, 258–259, 317–318 credit card numbers, 66, 649–650 CROSS JOIN operator, 369, 424–426 cross-joins, 26–27 CTE (common table expression), 36, 271–272, 624 cumulative distribution, 124 cumulative events (survival analysis), 417–420 cumulative gains chart, 543–546, 563–564 CURDATE( ) SQL function, 712 current, 712–713 CURRENT_DATE( ) SQL function, 712, 713 CURRENT_TIMESTAMP( ) SQL function, 226–227, 713 customer retention, 274–276 customer signatures, 609 ad hoc analysis and, 616 building data lookup, 625–628 driving table, 622–625 initial transaction, 628–629 pivoting, 629–637 summarizing, 637–639 customer-centric business metrics repository, 616–617 data mining modeling and, 616 data page/cache, 660–662 data sources current customer snapshot and, 612–613 external data, 614–615 initial customer information, 613 neighbors, 615 self-reported information, 614 transaction summaries, 615–616 design columns, 617–619 prediction model sets, 617 profiling model sets, 617 time frames, 619–622 longitudinal information, 610–611 scoring models and, 616 customer tenure, 315–316 averages, 317–324 confidence bounds and, 322–324 hazard probabilities, 333–335 hazard ratios calculating with SQL, 326–327 733 734 Index ■ D–D calculating with SQL and Excel, 326 interpreting, 324–325 reasons for using, 327 SQL, 321–322 SQL and Excel, 320–321 customer value calculations, 298–305 estimated future revenue, 300–303, 305–308 estimated revenue, 299–300, 303–305 customers counting, 383–386 definition, 611–612 identifying, 368–378 D data See also big data character strings, 10 copying, to Excel, 54 data analysis definition, data exploration, 50 data flow operators AGGREGATE, 19 APPEND, 19 CASE, 33–34 CROSSJOIN, 19–20, 27–29, 424 FILTER, 18–19 IN, 34–35 JOIN, 20 LOOKUP, 19 OUTPUT, 18 READ, 18 SELECT, 18 SORT, 20 UNION, 19 UNION ALL, 33 data mining, addresses, 642 algorithms, 640 data mining models, 507– 508, 616 categories, multiple, 514–515 directed, 508, 509 evaluation, 515 look-alikes, 515–521 lookup models AUC, 542–546 binary classification option, 526–528 Naïve Bayesian comparison, 558–559 order size and, 528–534 popular product, 522–524 prediction and, 525–526 probability of response, 534–540 profiling lookup models, 525–526 ROC charts, 540–546 model sets, 509–511 binary columns, 510 category columns, 510 date-time columns, 510 numeric columns, 510 prediction, 511–513 profiling, 511–513 target, 510 text columns, 511 naïve Bayesian models calculating, 549–555 cumulative gains, 557–558 lookup model comparison, 558–559 probabilities and, 546–549 scoring, 555–557 nearest-neighbor, 521–522 numeric value, estimates, 515 score set, 511 similarity models, 513 yes-or-no models, 513–514 propensity scores and, 514 data models, logical data models, partitioning, physical, data types, 9–10 dates, 199 INTERVAL, 202–203 times, 199 databases, 2–3 date in, 198–199 design, differences, 664–665 document databases, graph-based, relational, 2–4 SQL, 2–3 time in, 198–199 dataflows, 16–18 edges, 16 nodes, 16 datasets combining, 19 naming conventions, 14–15 purchases, 14 sorting, 20 subscription, 13 DATE( ) SQL function, 712 date time columns, 199–200, 204–205, 640–641 DATEADD( ) SQL function, 225–229, 236, 245, 269, 333, 716, 717 DATE_ADD( ) SQL function, 716 DATEDIFF( ) SQL function, 218, 221, 224, 231, 262, 717, 718 DATE_FORMAT( ) SQL function, 713 DATENAME( ) SQL function, 215, 217, 224, 236, 715 DATEPART( ) SQL function, 204–205, 231, 715, 719 DATETIME, 199 date/time, 197–199 Calendar table, 203–204 comparisons, 224 components, extracting, 199 converting, to standard format, 201–202 count comparison, 205–210 data types, 199 dates without times, 204–205 DOWs (days of the week), 215–218 milestones, 221–225 next date, 225–229 time between, 218–221 duration, 202–203 time between, 218–221 functions COALESCE( ), 206–207 DAY( ), 199 EXTRACT( ), 199 HOUR( ), 199 MINUTE( ), 199 MONTH( ), 199 Index ■ E–F NOW( ), 201 SECOND( ), 199 YEAR( ), 199 intervals, 202–203 storage, 200 year-over-year comparisons, 229–230 by month, 231–239 by week, 231 date/time functions, 711 adding/subtracting days, 715–716 adding/subtracting months, 716–717 constant, 711–712 convert to string, 713–714 current, 712–713 day of week, 715 different between dates, 717–718 extracting date, 718–719 year, month, day of month, 714 DATE_TRUNC( ) SQL function, 718 DAY( ) SQL function, 199, 714 day of week, 715 DAYNAME( ) SQL function, 715 DAYOFWEEK( ) SQL function, 715 DAYS( ) SQL function, 717 DBMS_RANDOM.VALUE( ) SQL function, 722 decile, 546 DENSE_RANK( ) SQL function, 36, 529–530 degree symbol, 148–149 descriptive languages versus procedural, design, directed data mining, models, 509 distance, Euclidian method, 149–151 distribution, 124 probability, association rules and, 481–483 DMA (Designated Marketing Area), 177 document databases, domain (email), 382, 647–648 DOWs (days of the week), 215–218 milestones, 221–225 next date, 225–229 time between, 218–221 driving tables, 622–625 duration, 202–203 time between, 218–221 E edit distance, 379 email addresses, 381–382 entity-relationship diagrams, 6, 10–12 cardinality, 11 equijoins, 29–30 nonequijoins, 31 error bars (charts), 322–324 errors, best-fit-line regression, 567–568 Euclidian method for distance calculation, 149–151 evidence models See Naïve Bayesian models Excel charts, 2, 51 animation, 247–254 area charts, 63–64 bar charts in cells, 57–59 column charts, 51–57 data, copying to, 54 line charts, 63 link charts, 117–118 maps, 188–190 scatter plots, 64–65 sparklines, 65–68 X-Y charts, 64–65 EXISTS operator, 45–46 EXP( ) SQL function, 720–721 expected churn, 344 expected value calculation, 565–567 exponential function, 720–721 exponential curve, 277 exponential survival function, 280 exponential trend curves, 572–573 expressions conditional, multiple, 686 EXISTS, 460–461 EXTRACT( ) SQL function, 199, 714, 715, 718 extracting date, 718–719 F failure (survival analysis) See cumulative events FETCH FIRST clause, 71 filtering, 18–19 conditional aggregation and, 461–462 FIRST_VALUE( ) SQL function, 408 FIND( ) SQL function, 705 FIPS county codes, 177 FLOOR( ) SQL function, 444, 721 FOR XML PATH, 455–458 FORECAST( ) Excel function, 567 forecasting, 308–314 foreign keys, 11, 27–29 formatting charts, 55–57 conditional formatting, 58–59 queries, 39 FULL OUTER JOIN, 20, 31, 206 full table scan, 658 full text index, 671–672 functions Excel AND( ), 156 array functions, 156 CHIDIST( ), 135 CORREL( ), 208, 601 COUNTIF( ), 156 FORECAST( ), 567 HOUR( ), 199 IF( ), 156, 579 INTERCEPT( ), 565, 601 LEFT ( ), 23 LINEST( ), 579, 580–581 LOGEST( ), 579 MINUTE( ), 199 735 736 Index ■ F–F MINVERSE( ), 579 MMULT( ), 579 NOW( ), 201 OR( ), 156 SECOND( ), 199 SLOPE( ), 565, 601 SUM( ), 156, 579 SUMIF( ), 156 SUMPRODUCT( ), 156 TRANSPOSE( ), 579 VLOOKUP( ), 286–287, 472 date/time, 711 adding/subtracting days, 715–716 adding/subtracting months, 716–717 ADD_MONTHS( ), 716 CAST( ), 712, 713, 719 constant, 711–712 CONVERT( ), 714 convert to string, 713–714 CURDATE( ), 712 current, 712–713 CURRENT_DATE( ), 712, 713 CURRENT_TIMESTAMP( ), 713 DATE( ), 712 DATEADD( ), 716, 717 DATE_ADD( ), 716 DATEDIFF( ), 717, 718 DATE_FORMAT( ), 713 DATENAME( ), 715 DATEPART( ), 715, 719 DATE_TRUNC( ), 718 DAY( ), 714 day of week, 715 DAYNAME( ), 715 DAYOFWEEK( ), 715 DAYS( ), 717 different between dates, 717–718 EXTRACT( ), 714, 715, 718 extracting date, 718–719 GETDATE( ), 713 INTCK( ), 718 INTNX( ), 716 MONTH( ), 714 MONTHS_BETWEEN( ), 717, 718 PUT( ), 713, 715 REPLACE( ), 713 TIMESTAMPDIFF( ), 718 TO_CHAR( ), 713, 715 TODAY( ), 713 TRUNC( ), 713, 719 WEEKDAY( ), 715 year, month, day of month, 714 YEAR( ), 714 DAY( ), 199 EXTRACT( ), 199 FLOOR( ), 444 greatest/least, 724–725 LEAD( ), 392 least/greatest, 724–725 list table columns, 727–728 LOG( ), 274 mathematical CAST( ), 723, 724 converting number to string, 723–724 DBMS_RANDOM.VALUE( ), 722 EXP( ), 720–721 exponential function, 720–721 FLOOR( ), 721 GREATEST( ), 724 LEAST( ), 724 left padding integers, 722–723 LPAD( ), 722, 723 MOD( ), 719 natural logs, 720–721 power, 720 POWER( ), 720 PUT( ), 723 PUTN( ), 723 RAND( ), 722 random numbers, 721–722 remainder/modulo, 719 RIGHT( ), 722, 723 SELECT( ), 725–728 TO_CHAR( ), 723 MIN( ), 156 MOD( ), 148–149 MONTH( ), 199 NTILE( ), 396 NULLIF( ), 290 RAND( ), 110–111 RANK( ), 395–396 results in multiple rows, 726–727 results with one row, 725–726 SQL ASCII( ), 372 average of integers, 729 CEILING( ), 444 CHARINDEX( ), 216 COALESCE( ), 206–207, 385, 425 COUNT( ), 209, 691–694 strings, 704–705 ASCII( ), 711 blanks, 709–710 BTRIM( ), 709 CAT( ), 706 CHARINDEX( ), 705 CONCAT( ), 706 FIND( ), 705 INSTR( ), 704–705 LEFT( ), 710 LEN( ), 707 length, 706–707 LENGTH( ), 706, 707 LOCATE( ), 704, 705 LTRIM( ), 709 position, 704–705 POSITION( ), 705 POSSTR( ), 704 RANK( ), 711 REPLACE( ), 708, 709 RIGHT( ), 709–710 RXCHANGE( ), 709 SUBSTR( ), 707–708, 710 SUBSTRING( ), 707, 708 substring replacement, 708–709 substrings, 707–708 SUBSTRN( ), 708, 710 TRIM( ), 709 window functions, 694–701, 728 YEAR( ), 199 function-based indexes, 672–673 G Gauss, Carl Friederich, 564 Gaussian distribution, 101–104 geocoding, 145 geographic information, 640 ancillary information, 181 hierarchies, 172–188 catchment areas, 179 census hierarchies, 178 counties, 177 DMAs (designated marketing areas), 177 electoral districts, 179 school districts, 179 zip codes, 172–176 zip+2, 179 zip+4, 179 IP address lookups, 180 mobile devices, 181 self-reported addresses and, 180 GETDATE( ) SQL function, 226, 713 GIS (geographic information systems), graph-based databases, GREATEST( ) SQL function, 724 GROUP BY clause, 30–35, 38–41, 689–691 H Hadoop, Halley, Edmund, 256–257 hardware, SQL, HASHBYTES( ) SQL function, 644 hash indexes, 670 hazard, constant, 276–277 survival and, 279–280 unobserved heterogeneity, 277 hazard calculation confidence, 297–298 empirical hazards estimation, 315 hazard probability, 264–265 long term, 313 Index ■ G–I SQL, 335–336 stop flag, 261–262 tenure, 262–264 censoring, 266–268 survival calculation, 269–271 time and, 265–266 time-zero covariates, 280–287 hazard probabilities, 256, 264–265 competing risk and, 345–346 customer tenure, 333–335 examples, 259–260 left truncation, 328–330 effects, 330–331 repairing, 331–333 survival over time, 287–293 point estimate, 269 time windows, 338–339 hazard ratios calculating with SQL, 326–327 calculating with SQL and Excel, 326 interpreting, 324–325 reasons for using, 327 heterogeneous associations, 496–499 histograms, 68–79 counts, 72–74 cumulative, 74–75 market basket analysis and, 431–433 numeric values equal-sized ranges, 77–79 numeric techniques, 75–76 string techniques, 77 Hive, HOUR( ) Excel function, 199 I IBM DB2 date/time functions, 712–718 integer averages, 729 least and greatest functions, 724 mathematical functions, 719–723 results functions, 725, 726 string functions, 704, 706–711 table columns functions, 727 window functions, 728 IF( ) Excel function, 156, 579 IN operator, 34–35 correlated subqueries, 43–44 EXISTS operator, 45–46 as join, 42–43 NOT EXISTS operator, 45–46 NOT IN operator, 44–45 indexes, 667 aggregation, 676 B-trees, 668–670 clustered, 672 composite, 679–683 covering indexes, 674 equality, 673–675 full text, 671–672 hash indexes, 670 index lookup, 658–659 indicator variables, 33–34, 38, 283, 642–643 infant mortality rate, 259 INFORMATION_SCHEMA COLUMNS, 93 inverted, 671–672 limitations, 676–678 ORDER BY clause, 675–676 R-trees, 670–671 spatial indexes, 670–671 WHERE clause, 673, 675 inner joins, 31 INSTR( ) SQL function, 704–705 INTCK( ) SQL function, 718 INTERCEPT( ) Excel function, 565, 601 INTERVAL data type, 199, 202–203 intervals (duration), 202–203 737 738 Index ■ J–M INTNX( ) SQL function, 716 inverse model, 568–569 involuntary churn, 343 IP address, 180–181 ISO 8601 (date format), 200–201 item sets, 466 examples, 469–470 household combinations, 476–478 large, 471–473 multiple purchases of product, 478–480 product group combinations, 470–471 size, 473–475 two-way combinations, 466–469 J JOIN operator, 20, 25–32, 422–423 joining tables, 19–20, 25–26, 458–460 cross-joins, 26–27 equijoins, 29–30 nonequijoins, 31 inner joins, 31 lookup joins, 27–29 outer joins, 31–32 self-joins, 30 IN operator and, 42–43 JSON, 457 K key-value pairs, 4–5 L LAG( ) SQL function, 391, 621, 697 latitude definition, 146–147 degrees, 147–149 distance, Euclidian method, 149–151 measurement, 147–149 scatter plots and, 155–160 LEAD( ) SQL function, 392, 621, 697 LEAST( ) SQL function, 724 LEFT( ) SQL function, 23, 66–67, 83, 710 left padding integers, 722–723 left truncation, 328–342 effects, 330–331 repairing, 331–333 time windowing and, 337–342 LEN( ) SQL function, 77, 82–85, 707 length, 706–707 LENGTH( ) SQL function, 706, 707 Levenshtein, Vladimir, 379 Levenshtein distance, 379 life expectancy, 256–258 lift association rules, 483–485, 487–488, 490 models, 544 LIKE, 66–67, 677, 691 line charts, 63 linear regression, 561 See also best fit line coefficients, 566 LINEST( ) Excel function, 579, 580–581 link charts, 117–118 local part (email), 382 LOCATE( ) SQL function, 704, 705 location See also zip code tables census demographics chi-square, 163–167 income similarity/ dissimilarity, 163–167 median income, 161–162 proportions of wealthy/ poor, 162–163 distance between accurate method, 151–152 Euclidian method, 149–151 zip codes all, 152–154 nearest, 154–155 geocoding, 145 latitude definition, 146–147 degrees, 147–149 measurement, 147–149 longitude definition, 146–147 degrees, 147–149 measurement, 147–149 ZCTAs, 145 LOG( ) SQL function, 271–272 logarithmic trend curves, 572–573 LOGEST( ) Excel function, 579 logical data models, longitude See latitude longitudinal information, 616 look-alike models, 521–527 lookup joins, 27–29 lookup models, 528–546 lookup models (data mining) AUC, 542–546 binary classification option, 526–528 order size and, 528–534 popular product, 522–524 prediction and, 525–526 probability of response, 534–540 profiling lookup models, 525–526 ROC charts, 540–546 lookup tables customer dimension lookup tables, 627–628 fixed, 625–626 LPAD( ) SQL function, 722, 723 LTRIM( ) SQL function, 82, 709 M many-to-many relationships, 11 MapPoint, 188–190 MapReduce, maps Excel, reasons for, 188–190 MapPoint, 190 web-based, 190–191 market basket analysis, 421–422 best customers, 444–445 customers best, 444–445 one-time, 440–443 one-time customers, 440–443 order size consistency, 437–439 price changes, 435–437 products best customers, 444–445 duplicates, 426–431 geographic distribution, 448–451 multiples, 433–435 one-time customers, 440–443 shipping, 423–426 which customers have, 453 residual value, 445–448 scatter plots and, 422–423 units, histogram, 431–433 WHERE clause, 451–452 MATCH( ) Excel function mathematical functions converting number to string, 723–724 exponential function, 720–721 floor, 721 left padding integers, 722–723 natural logs, 720–721 power, 720 random numbers, 721–722 remainder/modulo, 719 MAX( ) SQL function, 25, 441–442, 698–699, 700–701 metadata, migration churn, 344 MIN( ) SQL function, 25, 156 MINUTE( ) Excel function, 199 MINVERSE( ) Excel function, 579 MLE (maximum likelihood estimation), 318 MMULT( ) Excel function, 579 MOD( ) SQL function, 148– 149, 719 Index ■ N–O mode (statistics), 80–81 model set, 515–517 modulus (%) operator, 112 monotonical decrease, 276 MONTH( ) SQL function, 199, 714 MONTHS_BETWEEN( ) SQL function, 717, 718 Monty Hall Paradox, 102 moving average, best fit line, 574–576 multiple conditional expressions, 686 multiple regression, 600 Excel, 601–602 Solver and, 604 SQL, 605–607 three input variables, 603–604 MySQL, 4, 665 date/time functions, 712–718 integer averages, 729 least and greatest functions, 724 mathematical functions, 719–723 results functions, 725, 726 string functions, 704–711 table columns functions, 727 window functions, 728 N Naïve Bayesian models calculating, 549–555 cumulative gains, 557–558 generalization, 553–555 lookup model comparison, 558–559 one variable, 551–553 probabilities, 546–547 conditional, 547 likelikhood, 548–549 odds, 548 scoring, 555–557 naming conventions, datasets, 14–15 naming variables, subqueries for, 37–40 natural logs, 720–721 nearest neighbor models, 527–528 nested aggregations, 428 NEWID( ) SQL function, 111–115 nodes, dataflows, 16 nonequijoins, 31 nonstationarity, 534, 537–538 normal distribution, 101–104, 562–569 NORMDIST( ) Excel function, 108–109 NORMSDIST( ) Excel function, 103, 109 NoSQL, 4–5 NOT EXISTS operator, 45–46, 687–688, 692–693 NOT IN operator, 44–45, 687 NOW( ) Excel function, 201 NTILE( ) SQL function, 396 Null Hypothesis, 98–100 counting and, 122–123 NULL values, 8–9, 23, 376–377 NULLIF( ) SQL function, 86, 290, 322 number format (Excel), 170, 175, 176, 202, 235, 411, 423 numeric values, dates, 10 date-times, 10 integers, 10 real numbers, 10 O object function, 600 odds, 554 OFFSET( ) Excel function, 155, 605 OLS (ordinary least squares), 563 dwarf planets, 564 one-at-a-time relationships, 11 one-to-one relationships, 11 one-way association rules, 483–485 evaluation information and, 486–488 generating, 485–486 739 740 Index ■ P–Q product groups and, 488–489 optimization engine, OR( ) Excel function, 156 Oracle, 665 date/time functions, 712–719 integer averages, 729 least and greatest functions, 724 mathematical functions, 719–723 results functions, 725, 726 string functions, 705–711 table columns functions, 727 window functions, 728 ORDER BY clause, 22–24 weekday, 216 order notation, 656–657 outer joins, 31–32 OVER See window functions overfitting in models, 574 P parallel full table scan, 658 partial indexes, 673 partitioning, PARTITION BY See window functions PERCENTILE_CONT( ) SQL function, 396 PERCENTILE_DIST( ) SQL function, 396 performance, 663–665 improvement, 665–667 indexes, 662, 667 aggregation, 676 B-trees, 668–670, 672 composite, 679–683 covering indexes, 674 equality, 673–675 full text, 671–672 hash indexes, 670 inverted, 671–672 limitations, 676–678 ORDER BY clause, 675–676 R-trees, 670–671 spatial indexes, 670–671 WHERE clause, 673, 675 LEFT OUTER JOIN and, 684–685 OR and, 683–684 parallel processing, 663 processing engine, 663 queries conditions, 667 DISTINCT keyword, 666 storage management, 660–663 query engines full table scan, 658 index lookup, 659–660 order notation, 656–657 parallel full table scan, 658–659 physical data models, PI( ) Excel function, 149 Piazzi, Giuseppe, 564 pivoting, 629–630 channel pivot, 632–633 order line information pivot, 634–637 payment type, 630–632 values into columns, 635–643 year pivot, 633–634 point estimate, survival and, 269, 293–294 polynomial trend curves, 573–574 position, 704–705 POSITION( ) SQL function, 705 POSSTR( ) SQL function, 704 Postgres, date/time functions, 712–718 integer averages, 729 least and greatest functions, 724 mathematical functions, 719–723 results functions, 725, 726 string functions, 705–711 table columns functions, 727 window functions, 728 POWER( ) function, 720 power trend curves, 572–573 primary key columns, 9, 673 principal components, 604 Prizm codes (Claritas), 621 probability (statistics), 100–101, 125–126 distribution, association rules and, 481–483 procedural languages versus descriptive, Proportional Hazards Regression See Cox proportional hazards regression proportional stratified samples, 112–113 proportionality assumption, 318 proportions difference of, 131–132 standard error, 128–129 purchases dataset, 14 days in a row, 391–393 intervals, 390–391 span of time, 386 PUT( ) SQL function, 713, 715, 723 PUTN( ) SQL function, 723 p-value, 99, 105, 109–110, 124–125, 134 Q quantiles, 394–397 queries basics, 22–24 column charts, 60 formatting, 39 optimization engine and, result sets, 21 set-within-sets, 421 subqueries, 36–37 IN operator, 42–46 summary handling, 40–41 UNION ALL operator, 46 variable naming, 37–40 summary queries, 24–25 quintiles, 394–397, 438–439 R R2 value, best fit line, 581–584 RADIANS( ) Excel function, 149 RAND( ) SQL function, 110–111, 722 random numbers, 721–722 random samples, 110–115 balanced, 113–115 repeatable, 111–112 stratified, 112–113 RANK( ) SQL function, 36, 395–396, 439, 444, 529–530, 711 ranking functions (SQL), 35–36, 396 ratios difference of proportions, 131–132 standard deviation, 128 confidence, 129–131 standard error of a proportion, 128–129 statistics and, 128 R-trees (index), 670–671 regression, 561–562 best-fit-line, 562 averages, 568 errors, 567–568 expected value, 565–567 formula, 565 inverse model, 568–569 LINEST( ) Excel function, 577–581 OLS, 563 price elasticity, 587–592 R2 value, 170, 581–584 residuals, 567–568 trend lines, 571–576 weighted, 594–599 linear, 561, 562 coefficients, 566 multiple, 600 Excel, 601–602 Solver, 604 SQL, 605–607 three input variables, 603–604 weighted, 592–599 Index ■ R–S relational algebra, 20–21 relational databases, 2–4 relationships entity-relationship diagrams, 11 many-to-many, 11 one-at-a-time, 11 zero/one-to-one, 11 remainder/modulo, 719 repeatable random samples, 111–112 repeated events, 367–368 REPLACE( ) SQL function, 227, 486, 708, 709, 713 residual value, 421, 445–448 best-fit-line regression, 567–568 RFM (recency, frequency, monetary) analysis, 393–403 cell calculation, 398–399 customer migration and, 400–403 dimensions, 394–398 frequency, 396–397 monetary, 397–398 quantiles, 394 recency, 394–396 limits, 403 RIGHT( ) SQL function, 709–710, 722, 723 right censoring, 337–342 ROC (receiver operating characteristics), 546–552 ROW_NUMBER( ) SQL function, 35–36, 407, 448, 529–530, 632, 635 ROW( ) Excel function, 533 rows (tables), filtering, 18–19 RXCHANGE( ) SQL function, 709 S sampling balanced samples, 113–115 margin of error, 97 proportional stratified samples, 112–113 random samples, 110–111 repeatable random samples, 111–112 SAS proc sql date/time functions, 712–719 integer averages, 729 least and greatest functions, 725 mathematical functions, 719–723 results functions, 725, 727 string functions, 705–711 table columns functions, 728 window functions, 728 scatter plots, 64–65 best fit line, 571–572 latitude/longitude, 155–160 log scales, 432 market basket analysis and, 422–423 multiple series, 524 non-numeric axes and, 471–472 points, labeling, 176 using clip-art, 387 SCF (Sectional Center Facility), 16–17 score set, 517 Seasonality, 625–626 SECOND( ) Excel function, 199 SELECT statement, 5, 22 self-joins, 30 sequential association rules, 466, 503–506 set-within-a-set queries, 421 ShopKo, 455 signatures, customer signatures, 609 similarity models, 519 SLOPE( ) Excel function, 565, 578, 601 Solver, 562, 597–601 regression, multiple, 604 sparklines, 65–68 741 742 Index ■ S–S SQL (Structured Query Language), chi-square and, 135–137 databases, 2–3 date/time functions, 712–719 functions, LEFT( ), 23 hardware, integer averages, 729 least and greatest functions, 725 mathematical functions, 719–724 multiple regression, 605–607 MySQL, ORDER BY clause, 22–24 performance and (See performance) Postgres, queries basics, 22–24 chi-square, 141–143 formatting, 39 optimization engine and, result sets, 21 subqueries, 36–46 summary queries, 24–25 results functions, 725, 727 SELECT statement, 22 SQLite, string functions, 705–711 survival analysis, 271–272 column values, 272–274 dimensions, 274 table columns functions, 728 WHERE clause, 23 window functions, 35–36, 391–392, 406–409, 448–449, 694–701, 728 standard deviation, 105–107, 323 confidence, 129–131 ratios, 128 standard error, 106 standard error of a proportion, 128–129 statistics, 97 approaches, 107–110 averages, 105 standard deviation, 105–107 chi-square, 132–134, 138–140 calculation, 134 distribution, 134–135 SQL and, 135–137 confidence, 100–101 distribution, 124 cumulative distribution, 124 Monty Hall paradox, 102 normal distribution, 101–104 Null Hypothesis, 98–100 probability, 100–101, 125–126 ratios and, 128 survival analysis and, 258 z-score, 103–104 STDEV( ) Excel function, 602 STDDEV( ) SQL function, 106 stop flag (hazard calculation), 261–262 storage, date/time, 200 stratification, 316–317 averages, 317–324 strings Levenshtein, distance, 379 patterns addresses, 642 credit card numbers, 66, 643–644 email address, 641–642 product descriptions, 642–643 values aggregation, 456–458 case sensitivity, 82–83 characters, 83–85 concatenation, 455–456 histogram of length, 82 spaces, 82 structured data, subqueries, 36–37 IN operator, 42–46 summary handling, 40–41 UNION ALL operator, 46 variable naming, 37–40 subscription data, 13 SUBSTITUTE( ) Excel function, 95 SUBSTR( ) SQL function, 707–708, 710 SUBSTRING( ) SQL function, 83–87, 707, 708 substring replacement, 708–709 substrings, 707–708 SUBSTRN( ) SQL function, 708, 710 SUM( ) Excel function, 156, 579 SUM( ) SQL function, 25, 33, 446–447, 695–696 SUM(SUM)( ) SQL function, 446–447 SUMIF( ) Excel function, 156 summaries basic, 637 complex, 637–639 customer behaviors declining usage, 650–653 time series, 644–648 weekend shoppers, 648–650 strings, patterns, 641–644 summary queries, 24–25, 40–41 SUMPRODUCT( ) Excel function, 156 support (association rules), 483–484 survival analysis See also hazard probabilities changes over time, 287–293 competing risk and, 346–352 conditional, 285–287 constant hazard and, 276–280 Cox proportional hazards regression, 258–259 curves, 347–352 customer lifetime average, 295–296 customer retention calculation, 274–276 customer tenure, 316–317 averages, 317–324 confidence bounds and, 322–324 SQL, 321–322 SQL and Excel, 320–321 customer value calculations, 298–299 estimated future revenue, 300–303, 305–308 estimated revenue, 299– 300, 303–305 hazard probability, left truncation, 328–336 hazard ratios calculating with SQL, 326–327 calculating with SQL and Excel, 326 interpreting, 324–325 reasons for using, 327 life expectancy, 256–258 longitudinal studies, 258–259 past, 290–293 point estimate, 269, 293–294 ratio, 284–285 in SQL, 271–272 column values product, 272–274 dimensions, 274 stratification, 316–317 tenure, 269–271 customer half-life, 294–295 median customer tenure, 294–295 T time-dependent covariates, 353–357 tables, aliases, 23 best practices, 14–16 Calendar, 13–16, 203–204 Campaigns, 11, 14 Cartesian product, 26 columns, 7–8 appending, 19 comparing values, 86–90 foreign keys, 11, 27–29 maximum values, 79–80 minimum values, 79–80 mode, 80–81 numeric values, 9–10 partitioning, primary key, Index ■ S–S selecting, 18 summarizing, 90–95 Customers, 11, 14 joining, 19–20, 25–26 cross-joins, 26–27 equijoins, 29–30 inner joins, 31 lookup joins, 27–29 nonequijoins, 31 outer joins, 31–32 self-joins, 30 metadata, NULL values, 8–9 OrderLines, 11, 14 Orders, 11, 14 output, 18 Products, 11, 14 rows, filtering, 18–19 Subscribers, 13 zip code, 12–13 ZipCensus, 7, 11–12 ZipCounty, 11–12 tenure (hazard calculation), 262–264 average truncated tenure, 295–296 censoring, 266–268 survival calculation, 269–271 customer half-life, 294–295 median customer tenure, 294–295 time and, 265–266 time-zero covariates, 280–287 tercile, 444–445 text to columns wizard (Excel), 54 TEXT( ) Excel function, 175, 176, 201, 215, 238 thrashing, 678 time, 197–198 See also date/ time average time between orders, 388–390 components, extracting, 199 data types, 199 databases, 198–199 duration, 202–203 functions DAY( ), 199 EXTRACT( ), 199 HOUR( ), 199 MINUTE( ), 199 MONTH( ), 199 NOW( ), 201 SECOND( ), 199 YEAR( ), 199 intervals, 202–203, 390–391 to next event, 416–420 span of purchases, 386 storage, 200 tenure (hazard calculation), 265–266 time zones, 203 time windowing, 336 left truncation and, 337–342 right censoring and, 337–342 time-dependent covariates, 353–366 cohort-based approach, 358–361 event effect, 361–366 scenarios, 353–356 survival forecasts and, 356–357 TIMESTAMPDIFF( ) SQL function, 718 time-zero covariates, 280–287 TO_CHAR( ) SQL function, 713, 715, 723 TODAY( ) SQL function, 713 TRANSPOSE( ) Excel function, 579 TREND( ) Excel function, 578 trend lines best fit, 571–572 logarithmic, 572–573 moving average, 574–576 polynomial, 573–574 power, 572–573 TRIM( ) SQL function, 709 TRUNC( ) SQL function, 713, 719 two-way association rules calculating, 489–490 chi-square and, 491–496 heterogeneous, 496–499 743 744 Index ■ U–X‑Y‑Z U WHERE clause, 22–24, 33–45, UNION ALL operator, 33, 46, 209 UNION operator, 19 unobserved heterogeneity, 277 V values aggregating, 19 earliest/latest calculation, 404–413 first year/last year, 413–415 key-value pairs, 4–5 looking up, 19 VALUE( ) SQL function, 457 variables, naming, subqueries for, 37–40 vertical partitioning, VLOOKUP( ) Excel function, 286–287, 472 voluntary churn, 343 W WEEKDAY( ) Excel function, 223 WEEKDAY( ) SQL function, 715 weighted linear regression, 592–599 451–453 window functions, 35–36, 391–392, 406–409, 448–449, 694–701 controlling window event, 273 ranking functions, 396 See also specific ranking functions wizards, Chart, 53–55 X‑Y‑Z X-Y charts, 64–65 XY Labeler, 176, 471–472 XML, 455–458 year, month, day of month, 714 YEAR( ) Excel function, 199 YEAR( ) SQL function, 199, 714 yes-or-no models, 519–520 ZCTA (zip code tabulation areas), 145 zero/one-to-one relationships, 11 zero-way association rules, 481, 483 zip code tables, 12–13 all within area, 152–154 census demographics, 167–172 chi-square and, 163–167 income similarity/ dissimilarity, 163–167 median income, 161–162 wealthy/poor proportion, 162–163 distance calculation accurate method, 151–152 Euclidian method, 149–151 hierarchies, 172–176, 185–188 catchment areas, 179 census, 178 counties, 177 DMAs, 177 electoral districts, 179 school districts, 179 zip+2, 179 zip+4, 179 nearest, Excel and, 154–155 scatter plot maps, 155–160 state boundaries, 191–194 wealthiest zip code, 185 z-score, 103–104 WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA