1652 ✦ Chapter 23: The SIMILARITY Procedure Example 23.3: Sliding Similarity Analysis This example illustrates how to use sliding similarity analysis to compare two time sequences. The SASHELP.WORKERS data set contains two similar time series variables (ELECTRIC and MASONRY), which represent employment over time. The following statements create an example data set that contains two time series of differing lengths, where the variable MASONRY has the first 12 and last 7 observations set to missing to simulate the lack of data associated with the target series: data workers; set sashelp.workers; if '01JAN1978'D <= date < '01JAN1982'D then masonry = masonry; else masonry = .; run; The goal of sliding similarity measures analysis is find the slide index that corresponds to the most similar subsequence of the input series when compared to the target sequence. The following statements perform sliding similarity analysis on the example data set: proc similarity data=workers out=_NULL_ print=(slides summary); id date interval=month; input electric; target masonry / slide=index measure=msqrdev expand=(localabs=3 globalabs=3) compress=(localabs=3 globalabs=3); run; The DATA=WORKERS option specifies that the input data set WORK.WORKERS is to be used in the analysis. The OUT=_NULL_ option specifies that no output time series data set is to be created. The PRINT=(SLIDES SUMMARY) option specifies that the ODS tables related to the sliding similarity measures and their summary be produced. The INPUT statement speci- fies that the input variable is ELECTRIC. The TARGET statement specifies that the target vari- able is MASONRY and that the similarity measure be computed using mean squared deviation (MEASURE=MSQRDEV) . The SLIDE=INDEX option specifies observation index sliding. The COMPRESS=(LOCALABS=3 GLOBALABS=3) option limits local and global absolute compres- sion to 3. The EXPAND=(LOCALABS=3 GLOBALABS=3) option limits local and global absolute expansion to 3. Example 23.3: Sliding Similarity Analysis ✦ 1653 Output 23.3.1 Summary of the Slide Measures The SIMILARITY Procedure Slide Measures Summary for Input=ELECTRIC and Target=MASONRY Slide Slide Target Input Slide Slide Slide Sequence Sequence Warping Minimum Index DATE Length Length Amount Measure 0 JAN1977 48 51 3 497.6737 1 FEB1977 48 51 1 482.6777 2 MAR1977 48 51 0 474.1251 3 APR1977 48 51 0 490.7792 4 MAY1977 48 51 -2 533.0788 5 JUN1977 48 51 -3 605.8198 6 JUL1977 48 51 -3 701.7138 7 AUG1977 48 51 3 646.5918 8 SEP1977 48 51 3 616.3258 9 OCT1977 48 51 3 510.9836 10 NOV1977 48 51 3 382.1434 11 DEC1977 48 51 3 340.4702 12 JAN1978 48 51 2 327.0572 13 FEB1978 48 51 1 322.5460 14 MAR1978 48 51 0 325.2689 15 APR1978 48 51 -1 351.4161 16 MAY1978 48 51 -2 398.0490 17 JUN1978 48 50 -3 471.6931 18 JUL1978 48 49 -3 590.8089 19 AUG1978 48 48 0 595.2538 20 SEP1978 48 47 -1 689.2233 21 OCT1978 48 46 -2 745.8891 22 NOV1978 48 45 -3 679.1907 Output 23.3.2 Minimum Measure Minimum Measure Summary Input Variable MASONRY ELECTRIC 322.5460 This analysis results in 23 slides based on the observation index. The minimum measure (322.5460) occurs at slide index 13 which corresponds to the time value FEB1978. Note that the original data set SASHELP.WORKERS was modified beginning at the time value JAN1978. This similarity analysis justifies the belief the ELECTRIC lags MASONRY by one month based on the time series cross-correlation analysis despite the lack of target data (MASONRY). The goal of seasonal sliding similarity measures is to find the seasonal slide index that corresponds to the most similar seasonal subsequence of the input series when compared to the target sequence. The 1654 ✦ Chapter 23: The SIMILARITY Procedure following statements repeat the preceding similarity analysis on the example data set with seasonal sliding: proc similarity data=workers out=_NULL_ print=(slides summary); id date interval=month; input electric; target masonry / slide=season measure=msqrdev; run; Output 23.3.3 Summary of the Seasonal Slide Measures The SIMILARITY Procedure Slide Measures Summary for Input=ELECTRIC and Target=MASONRY Slide Slide Target Input Slide Slide Slide Sequence Sequence Warping Minimum Index DATE Length Length Amount Measure 0 JAN1977 48 48 0 1040.086 12 JAN1978 48 48 0 641.927 Output 23.3.4 Seasonal Minimum Measure Minimum Measure Summary Input Variable MASONRY ELECTRIC 641.9273 The analysis differs from the previous analysis in that the slides are performed based on the seasonal index (SLIDE=SEASON) with no warping. With a seasonality of 12, two seasonal slides are considered at slide indices 0 and 12 with the minimum measure (641.9273) occurring at slide index 12 which corresponds to the time value JAN1978. Note that the original data set SASHELP.WORKERS was modified beginning at the time value JAN1978. This similarity analysis justifies the belief that ELECTRIC and MASONRY have similar seasonal properties based on seasonal decomposition analysis despite the lack of target data (MASONRY). Example 23.4: Searching for Historical Analogies This example illustrates how to search for historical analogies by using seasonal sliding similarity analysis of transactional time-stamped data. The SASHELP.TIMEDATA data set contains the variable (VOLUME), which represents activity over time. The following statements create an example data Example 23.4: Searching for Historical Analogies ✦ 1655 set that contains two time series of differing lengths, where the variable HISTORY represents the historical activity and RECENT represents the more recent activity: data timedata; set sashelp.timedata; drop volume; recent = .; history = volume; if datetime >= '20AUG2000:00:00:00'DT then do; recent = volume; history = .; end; run; The goal of seasonal sliding similarity measures is to find the seasonal slide index that corresponds to the most similar seasonal subsequence of the input series when compared to the target sequence. The following statements perform similarity analysis on the example data set with seasonal sliding: proc similarity data=timedata out=_NULL_ outsequence=sequences outsum=summary; id datetime interval=dtday accumulate=total start='27JUL1997:00:00:00'dt end='21OCT2000:11:59:59'DT; input history / normalize=absolute; target recent / slide=season normalize=absolute measure=mabsdev; run; The DATA=TIMEDATA option specifies that the input data set WORK.TIMEDATA be used in the analysis. The OUT=_NULL_ option specifies that no output time series data set is to be created. The OUTSEQUENCE=SEQUENCES and OUTSUM=SUMMARY options specify the output sequences and summary data sets, respectively. The ID statement specifies that the time ID variable is DATETIME, which is to be accumulated on a daily basis (INTERVAL=DTDAY) by summing the transactions (ACCUMULATE=TOTAL). The ID statement also specifies that the data is accumulated on the weekly boundaries starting on the week of 27JUL1997 and ending on the week of 15OCT2000 (START=’27JUL1997:00:00:00’DT END=’21OCT2000:11:59:59’DT). The INPUT statement spec- ifies that the input variable is HISTORY, which is to be normalized using absolute normalization (NORMALIZE=ABSOLUTE). The TARGET statement specifies that the target variable is RECENT, which is to be normalized by using absolute normalization (NORMALIZE=ABSOLUTE) and that the similarity measure be computed by using mean absolute deviation (MEASURE=MABSDEV). The SLIDE=SEASON options specifies season index sliding. To illustrate the results of the similarity analysis, the output sequence data set must be subset by using the output summary data set. data _NULL_; set summary; call symput('MEASURE', left(trim(putn(recent,'BEST20.')))); run; data result; set sequences; by _SLIDE_; retain flag 0; if first._SLIDE_ then do; if (&measure - 0.00001 < _SIM_ < &measure + 0.00001) then flag = 1; 1656 ✦ Chapter 23: The SIMILARITY Procedure end; if flag then output; if last._SLIDE_ then flag = 0; run; The following statements generate a cross series plot of the results: proc timeseries data=result out=_NULL_ crossplot=series; id datetime interval=dtday; var _TARSEQ_; crossvar _INPSEQ_; run; The cross series plot illustrates that the historical time series analogy most similar to the most recent time series data that started on 20AUG2000 occurred on 02AUG1998. Output 23.4.1 Cross Series Plot of the Historical Time Series References ✦ 1657 References Barry, M. J. and Linoff, G. S. (1997), Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: John Wiley & Sons. Han, J. and Kamber, M. (2001), Data Mining: Concepts and Techniques, San Francisco: Morgan Kaufmann Publishers. Leonard, M. J. and Wolfe, B. L. (2005), “Mining Transactional and Time Series Data,” Proceedings of the Thirtieth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc. Leonard, M. J., Elsheimer, D. B., and Sloan, J. (2008), “An Introduction to Similarity Analysis Using SAS,” Proceedings of the SAS Global Forum 2008 Conference, Cary, NC: SAS Institute Inc. Pyle, D. (1999), Data Preparation for Data Mining, San Francisco: Morgan Kaufman Publishers, Inc. Sankoff, D. and Kruskal, J. B. (2001), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Stanford, CA: CSLI Publications. 1658 Chapter 24 The SIMLIN Procedure Contents Overview: SIMLIN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1659 Getting Started: SIMLIN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 1660 Prediction and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1661 Syntax: SIMLIN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1662 Functional Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1662 PROC SIMLIN Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1663 BY Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1664 ENDOGENOUS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1665 EXOGENOUS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665 ID Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665 LAGGED Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665 OUTPUT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1666 Details: SIMLIN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1666 Defining the Structural Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 1667 Computing the Reduced Form . . . . . . . . . . . . . . . . . . . . . . . . . . 1667 Dynamic Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1667 Multipliers for Higher Order Lags . . . . . . . . . . . . . . . . . . . . . . . 1668 EST= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1669 DATA= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1670 OUTEST= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1670 OUT= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1671 Printed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1671 ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1673 Examples: SIMLIN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1673 Example 24.1: Simulating Klein’s Model I . . . . . . . . . . . . . . . . . . 1673 Example 24.2: Multipliers for a Third-Order System . . . . . . . . . . . . . 1682 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1687 Overview: SIMLIN Procedure The SIMLIN procedure reads the coefficients for a set of linear structural equations, which are usually produced by the SYSLIN procedure. PROC SIMLIN then computes the reduced form and, if 1660 ✦ Chapter 24: The SIMLIN Procedure input data are given, uses the reduced form equations to generate predicted values. PROC SIMLIN is especially useful when dealing with sets of structural difference equations. The SIMLIN procedure can perform simulation or forecasting of the endogenous variables. The SIMLIN procedure can be applied only to models that are: linear with respect to the parameters linear with respect to the variables square (as many equations as endogenous variables) nonsingular (the coefficients of the endogenous variables form an invertible matrix) Getting Started: SIMLIN Procedure The SIMLIN procedure processes the coefficients in a data set created by the SYSLIN procedure using the OUTEST= option or by another regression procedure such as PROC REG. To use PROC SIMLIN you must first produce the coefficient data set and then specify this data set on the EST= option of the PROC SIMLIN statement. You must also tell PROC SIMLIN which variables are endogenous and which variables are exogenous. List the endogenous variables in an ENDOGENOUS statement, and list the exogenous variables in an EXOGENOUS statement. The following example illustrates the creation of an OUTEST= data set with PROC SYSLIN and the computation and printing of the reduced form coefficients for the model with PROC SIMLIN. proc syslin data=in outest=e; model y1 = y2 x1; model y2 = y1 x2; run; proc simlin est=e; endogenous y1 y2; exogenous x1 x2; run; If the model contains lagged endogenous variables you must also use a LAGGED statement to tell PROC SIMLIN which variables contain lagged values, which endogenous variables they are lags of, and the number of periods of lagging. For dynamic models, the TOTAL and INTERIM= options can be used on the PROC SIMLIN statement to compute and print total and impact multipliers. (See "Dynamic Multipliers" later in this section for an explanation of multipliers.) In the following example the variables Y1LAG1, Y2LAG1, and Y2LAG2 contain lagged values of the endogenous variables Y1 and Y2. Y1LAG1 and Y2LAG1 contain values of Y1 and Y2 for the previous observation, while Y2LAG2 contains 2 period lags of Y2. The LAGGED statement specifies the lagged relationships, and the TOTAL and INTERIM= options request multiplier analysis. Prediction and Simulation ✦ 1661 The INTERIM=2 option prints matrices showing the impact that changes to the exogenous variables have on the endogenous variables after 1 and 2 periods. data in; set in; y1lag1 = lag(y1); y2lag1 = lag(y2); y2lag2 = lag2(y2); run; proc syslin data=in outest=e; model y1 = y2 y1lag1 y2lag2 x1; model y2 = y1 y2lag1 x2; run; proc simlin est=e total interim=2; endogenous y1 y2; exogenous x1 x2; lagged y1lag1 y1 1 y2lag1 y2 1 y2lag2 y2 2; run; After the reduced form of the model is computed, the model can be simulated by specifying an input data set on the PROC SIMLIN statement and using an OUTPUT statement to write the simulation results to an output data set. The following example modifies the PROC SIMLIN step from the preceding example to simulate the model and stores the results in an output data set. proc simlin est=e total interim=2 data=in; endogenous y1 y2; exogenous x1 x2; lagged y1lag1 y1 1 y2lag1 y2 1 y2lag2 y2 2; output out=sim predicted=y1hat y2hat residual=y1resid y2resid; run; Prediction and Simulation If an input data set is specified with the DATA= option in the PROC SIMLIN statement, the procedure reads the data and uses the reduced form equations to compute predicted and residual values for each of the endogenous variables. (If no data set is specified with the DATA= option, no simulation of the system is performed, and only the reduced form and multipliers are computed.) The character of the prediction is based on the START= value. Until PROC SIMLIN encounters the START= observation, actual endogenous values are found and fed into the lagged endogenous terms. Once the START= observation is reached, dynamic simulation begins, where predicted values are fed into lagged endogenous terms until the end of the data set is reached. The predicted and residual values generated here are different from those produced by the SYSLIN procedure since PROC SYSLIN uses the structural form with actual endogenous values. The . MAR 197 8 48 51 0 325.26 89 15 APR 197 8 48 51 -1 351.4161 16 MAY 197 8 48 51 -2 398 .0 490 17 JUN 197 8 48 50 -3 471. 693 1 18 JUL 197 8 48 49 -3 590 .80 89 19 AUG 197 8 48 48 0 595 .2538 20 SEP 197 8 48 47 -1 6 89. 223 3 21. 701.7138 7 AUG 197 7 48 51 3 646. 591 8 8 SEP 197 7 48 51 3 616.3258 9 OCT 197 7 48 51 3 510 .98 36 10 NOV 197 7 48 51 3 382.1434 11 DEC 197 7 48 51 3 340.4702 12 JAN 197 8 48 51 2 327.0572 13 FEB 197 8 48 51 1 322. 5460 14. Measure 0 JAN 197 7 48 51 3 497 .6737 1 FEB 197 7 48 51 1 482.6777 2 MAR 197 7 48 51 0 474.1251 3 APR 197 7 48 51 0 490 .7 792 4 MAY 197 7 48 51 -2 533.0788 5 JUN 197 7 48 51 -3 605.8 198 6 JUL 197 7 48 51 -3 701.7138 7