1612 ✦ Chapter 23: The SIMILARITY Procedure If the ACCUMULATE=TOTAL option is specified, the data are accumulated as follows: O1MAR1999 40 O1APR1999 . O1MAY1999 90 If the ACCUMULATE=AVERAGE option is specified, the data are accumulated as follows: O1MAR1999 20 O1APR1999 . O1MAY1999 30 If the ACCUMULATE=MINIMUM option is specified, the data are accumulated as follows: O1MAR1999 10 O1APR1999 . O1MAY1999 20 If the ACCUMULATE=MEDIAN option is specified, the data are accumulated as follows: O1MAR1999 20 01APR1999 . O1MAY1999 20 If the ACCUMULATE=MAXIMUM option is specified, the data are accumulated as follows: O1MAR1999 30 O1APR1999 . O1MAY1999 50 If the ACCUMULATE=FIRST option is specified, the data are accumulated as follows: O1MAR1999 10 O1APR1999 . O1MAY1999 50 If the ACCUMULATE=LAST option is specified, the data are accumulated as follows: O1MAR1999 30 O1APR1999 . O1MAY1999 20 If the ACCUMULATE=STDDEV option is specified, the data are accumulated as follows: Missing Value Interpretation ✦ 1613 O1MAR1999 14.14 O1APR1999 . O1MAY1999 17.32 As can be seen from the preceding examples, even though the data set observations contain no missing values, the accumulated time series can have missing values. Missing Value Interpretation Sometimes missing values should be interpreted as unknown values. But sometimes missing values are known, such as when missing values are created from accumulation and no observations should be interpreted as no (zero) value. In the former case, the SETMISSING= option in the ID, INPUT, or TARGET statement can be used to interpret how missing values are treated. The SETMISSING=0 option should be used when missing observations are to be treated as no (zero) values. In other cases, missing values should be interpreted as global values, such as minimum or maximum values of the accumulated series. The accumulated and interpreted time series is used in subsequent analyses. The SETMISSING=0 option should be used with missing observations are to be treated as a zero value. In other cases, missing values should be interpreted as global values, such as minimum or maximum values of the accumulated series. The accumulated and interpreted time series is then used in subsequent analyses. Zero Value Interpretation When querying certain databases for time-stamped data based on a particular time range, time periods that contain no data are sometimes assigned zero values. For certain analyses, it is more desirable to assign these values to missing. Often, these beginning or ending zero values need to be interpreted as missing values. The ZEROMISS= option in the ID, INPUT, or TARGET statement specifies that the beginning, ending, or both the beginning and ending values are to be interpreted as zero values. Time Series Transformation Transformations are useful when you want to stabilize the time series before computing the similarity measures. There are four transformations available, for strictly positive series only. Let y t > 0 be the original time series, and let w t be the transformed series. The transformations are defined as follows: Log is the logarithmic transformation, w t D ln.y t / 1614 ✦ Chapter 23: The SIMILARITY Procedure Logistic is the logistic transformation, w t D ln.cy t =.1 cy t // where the scaling factor c is c D .1 e 6 /10 ceil.log 10 .max.y t /// and ceil.x/ is the smallest integer greater than or equal to x. Square root is the square root transformation, w t D p y t Box-Cox is the Box-Cox transformation, w t D ( y t 1 ¤0 ln.y t / D 0 User-Defined is the transformation computed by a user-defined subroutine that is created by using the FCMP procedure, where User-Defined is the subroutine name. Other time series transformations can be performed prior to invoking the SIMILARITY procedure by using the SAS/ETS EXPAND procedure or the DATA step. Time Series Differencing After optionally transforming the series, the accumulated series can be simply or seasonally dif- ferenced using the INPUT or TARGET statement DIF= and SDIF= options. Simple and seasonal differencing are useful when you want to detrend or deseasonalize the time series before computing the similarity measures. For example, suppose y t is a monthly time series. The following examples of the DIF= and SDIF= options demonstrate how to simply and seasonally difference the time series: DIF=(1,3) specifies first, then third, order differencing; SDIF=(1,3) specifies first, then third, order seasonal differencing. Additionally, assuming that y t is strictly positive, the INPUT or TARGET statement TRANSFORM= option and the DIF= and SDIF= options can be combined. Time Series Missing Value Trimming In some instances, missing values should be interpreted as an unknown observation, but other times, missing values are known and should be interpreted as a zero value. This is the case when missing values are created from accumulation, and a missing observation should be interpreted as having no value (meaning a value of zero). In the former case, the SETMISSING=option in the ID, INPUT, or TARGET, statement can be used to interpret how missing observations should be treated. By default, missing values, at the beginning and ending of the data set, are trimmed from the data set prior to analysis. This can be performed using TRIMMISS=both. Time Series Descriptive Statistics ✦ 1615 Time Series Descriptive Statistics After a series has been optionally accumulated and transformed with missing values inter- preted, descriptive statistics can be computed for the resulting working series by specifying the PRINT=DESCSTATS option. This option produces an ODS table that contains the sum, mean, minimum, maximum, and standard deviation of the working series. Input and Target Sequences After the input and target working series are formed, they can be treated as two ordered sequences. Given an input time sequence, x i , for i D 1 to N x , where i is the input sequence index, and a target time sequence, y j , for j D 1 to N y , where j is the target sequence index, these sequences are analayzed for similarity. Sliding Sequences Similarity measures can be computed between the target sequence and any contiguous subsequences of the input time series. There are three types of sequence sliding: no sliding slide by time index slide by season index For more information, see Leonard, Elsheimer, and Sloan (2008). Time Warping Time warping allows for the comparison between target and input sequences of differing lengths by compressing or expanding the input sequence with respect the target sequence while respecting the order of the sequence elements. For more information, see Leonard, Elsheimer, and Sloan (2008). 1616 ✦ Chapter 23: The SIMILARITY Procedure Sequence Normalization The working (input or target) sequence can be normalized prior to further analysis. Let q i be the original sequence with mean q and standard deviation q , and let r t be the normalized sequence. The normalizations are defined as follows: Standard is the standard normalization r i D .q i q /= q Absolute is the absolute normalization r i D .q i mi n.q i //=.max.q i / mi n.q i // User-defined is a user-defined normalization created by the FCMP procedure. Sequence Scaling The working input sequence can be scaled to the working target sequence. Sequence scaling is applied after normalization. Let y j be the working target sequence with mean y and standard deviation y . Let x i be the working input sequence and let q i be the scaled sequence. The scaling is defined as follows: Standard is the standard normalization q i D .x i y /= y Absolute is the absolute scaling q i D .x i mi n.y j //=.max.y j / mi n.y j // User-defined is a user-defined scaling created by the FCMP procedure. Similarity Measures The working input sequence can be compared to the working target sequence to create a similarity. For more information, see Leonard, Elsheimer, and Sloan (2008). User-Defined Functions and Subroutines ✦ 1617 User-Defined Functions and Subroutines A user-defined routine can be written in the SAS language by using the FCMP procedure or in the C language by using both the FCMP procedure and the PROTO procedure, respectively. The SIMILARITY procedure cannot use C language routines directly. The procedure can use only SAS language routines that might or might not call C language routines. Creating user-defined routines is more completely described in the FCMP procedure and the PROTO procedure documentation. The FCMP and PROTO procedures are part of Base SAS software. The SAS language provides integrated memory management and exception handling such as opera- tions on missing values. The C language provides flexibility and allows the integration of existing C language libraries. However, proper memory management and exception handling are solely the responsibility of the user. Additionally, the support for standard C libraries is restricted. If you have a choice, it is highly recommended that you write user-defined functions and subroutines in the SAS language using the FCMP procedure. For each of the tasks previously described, the following sections describe the required subroutine or function signature and provide examples of using a user-defined routine with the SIMILARITY procedure. Time Series Transformations A user-defined transformation subroutine has the following subroutine signature: SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[ * ] ); where the array-name is the time series to be transformed. For example, to duplicate the functionality of the built-in TRANSFORM=LOG option in the INPUT and TARGET statement, the following SAS statements create a user-defined version of this transformation called MYTRANSFORM and store this subroutine in the catalog SASUSER.MYSIMILAR. proc fcmp outlib=sasuser.mysimilar.package; subroutine mytransform( series[ * ] ); outargs series; length = DIM(series); do i = 1 to length; value = series[i]; if value > 0 then do; series[i] = log( value ); end; else do; series[i] = .; 1618 ✦ Chapter 23: The SIMILARITY Procedure end; end; endsub; run; This user-defined subroutine can be specified in the TRANSFORM= option in the INPUT or TARGET statement as follows: options cmplib = sasuser.mysimilar; proc similarity ; input myinput / transform=mytransform; target mytarget / transform=mytransform; run; Sequence Normalizations A user-defined normalization subroutine has the following signature: SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[ * ] ); where the array-name is the sequence to be normalized. For example, to duplicate the functionality of the built-in NORMALIZE=ABSOLUTE option in the INPUT and TARGET statement, the following SAS stements create a user-defined version of this normalization called MYNORMALIZE and store this subroutine in the catalog SASUSER.MYSIMILAR. proc fcmp outlib=sasuser.mysimilar.package; subroutine mynormalize( sequence[ * ] ); outargs sequence; length = DIM(sequence); minimum = .; maximum = .; do i = 1 to length; value = sequence[i]; if nmiss(minimum) | nmiss(maximum) then do; minimum = value; maximum = value; end; if nmiss(value) = 0 then do; if value < minimum then minimum = value; if value > maximum then maximum = value; User-Defined Functions and Subroutines ✦ 1619 end; end; do i = 1 to length; value = sequence[i]; if nmiss( value ) | minimum > maximum then do; sequence[i] = .; end; else do; sequence[i] = (value - minimum) / (maximum - minimum); end; end; endsub; run; This user-defined subroutine can be specified in the NORMALIZE= option in the INPUT or TARGET statement as follows: options cmplib = sasuser.mysimilar; proc similarity ; input myinput / normalize=mynormalize; target mytarget / normalize=mynormalize; run; Sequence Scaling A user-defined scaling subroutine has the following signature: SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[ * ], <ARRAY-NAME>[ * ] ); where the first array-name is the target sequence and the second array-name is the input sequence to be scaled. For example, to duplicate the functionality of the built-in SCALE=ABSOLUTE option in the INPUT statement, the following SAS statements create a user-defined version of this scaling called MYSCALE and store this subroutine in the catalog SASUSER.MYSIMILAR. proc fcmp outlib=sasuser.mysimilar.package; subroutine myscale( target[ * ], input[ * ] ); outargs input; length = DIM(target); 1620 ✦ Chapter 23: The SIMILARITY Procedure minimum = .; maximum = .; do i = 1 to length; value = target[i]; if nmiss(minimum) | nmiss(maximum) then do; minimum = value; maximum = value; end; if nmiss(value) = 0 then do; if value < minimum then minimum = value; if value > maximum then maximum = value; end; end; do i = 1 to length; value = input[i]; if nmiss( value ) | minimum > maximum then do; input[i] = .; end; else do; input[i] = (value - minimum) / (maximum - minimum); end; end; endsub; run; This user-defined subroutine can be specified in the SCALE= option in the INPUT statement as follows: options cmplib=sasuser.mysimilar; proc similarity ; input myinput / scale=myscale; run; Similarity Measures A user-defined similarity measure function has the following signature: FUNCTION <FUNCTION-NAME> ( <ARRAY-NAME>[ * ], <ARRAY-NAME>[ * ] ); where the first array-name is the target sequence and the second array-name is the input sequence. The return value of the function is the similarity measure associated with the target sequence and the input sequence. User-Defined Functions and Subroutines ✦ 1621 For example, to duplicate the functionality of the built-in MEASURE=ABSDEV option in the TARGET statement with no warping, the following SAS statements create a user-defined version of this measure called MYMEASURE and store this subroutine in the catalog SASUSER.MYSIMILAR. proc fcmp outlib=sasuser.mysimilar.package; function mymeasure( target[ * ], input[ * ] ); length = min(DIM(target), DIM(input)); sum = 0; num = 0; do i = 1 to length; x = input[i]; w = target[i]; if nmiss(x) = 0 & nmiss(w) = 0 then do; d = x - w; sum = sum + abs(d); end; end; if num <= 0 then return(.); return(sum); endsub; run; This user-defined function can be specified in the MEASURE= option in the TARGET statement as follows: options cmplib=sasuser.mysimilar; proc similarity ; target mytarget / measure=mymeasure; run; For another example, to duplicate the functionality of the built-in MEASURE=SQRDEV and MEASURE=ABSDEV options by using the C language, the following SAS statements create a user-defined C language version of these measures called DTW_SQRDEV_C and DTW_ABSDEV_C and store these functions in the catalog SASUSER.CSIMIL.CFUNCS. DTW refers to dynamic time warping. These C language functions can be then called by SAS language functions and subroutines. proc proto package=sasuser.csimil.cfuncs; mapmiss double = 999999999; double dtw_sqrdev_c( double * target / iotype=input, . accumulated as follows: O1MAR 199 9 40 O1APR 199 9 . O1MAY 199 9 90 If the ACCUMULATE=AVERAGE option is specified, the data are accumulated as follows: O1MAR 199 9 20 O1APR 199 9 . O1MAY 199 9 30 If the ACCUMULATE=MINIMUM. accumulated as follows: O1MAR 199 9 10 O1APR 199 9 . O1MAY 199 9 20 If the ACCUMULATE=MEDIAN option is specified, the data are accumulated as follows: O1MAR 199 9 20 01APR 199 9 . O1MAY 199 9 20 If the ACCUMULATE=MAXIMUM. accumulated as follows: O1MAR 199 9 30 O1APR 199 9 . O1MAY 199 9 50 If the ACCUMULATE=FIRST option is specified, the data are accumulated as follows: O1MAR 199 9 10 O1APR 199 9 . O1MAY 199 9 50 If the ACCUMULATE=LAST