... )clusteredonTID.Clusteringisbasedon a hashedortreestructuredorganization. A selectionindexonattribute A of relationRis a baserelationF (A, TID)clusteredon A. LetR1andR2betworelations,notnecessarilydistinct,andletTID1andTID2beidentifiers of tuples of R1and A2 ,respectively. A joinindexonrelationsR1and A2 is a relation of couples(TID1,TID2),whereeachcoupleindicatestwotuplesmatching a joinpredicate.Intuitively, a joinindexisanabstraction of the join of tworelations. A joinindexcanbeimplementedbytwobaserelationsF(TID1,TID2),oneclusteredonTID1and the otheronTID2.Joinindicesareuniquelydesignedtooptimizejoins. The joinpredicateassociatedwith a joinindexmaybequitegeneralandincludeseveralattributes of bothrelations.Furthermore,morethanonejoinindexcanbedefinedbetweenanytworelations. The identification of variousjoinindicesbetweentworelationsisbasedon the associatedjoinpredicate.Thus, the join of relations A1 andR2on the predicate(R1 .A =R2 .A andR1.B=R2.B)canbecapturedaseither a singlejoinindex,on the multi—attributejoinpredicate,ortwojoinindices,oneon(R1 .A =R2 .A) and the otheron(R1.BR2.B). The choicebetween the alternativesis a databasedesigndecisionbasedonjoinfrequencies,updateoverhead,etc.Letusconsider the followingrelationaldatabaseschema(keyattributesarebold):11CUSTOMER(cname,city,age,job)ORDER(cname,pname,qty,date)PART(pname,weight,price,spname) A (partial)physicalschemaforthisdatabase,basedon the storagemodeldescribedabove,is(clusteredattributesarebold)C_PC(CID,cname,city,age,job)City_IND(city,CID)Age_IND(age,CID)0_PC(OlD,cname,pname,qty,date)CnamelND(cname,OlD)CIDJI(CID,OlD)OID_Jl(OlD,CID)C_PCand0_PCareprimarycopies of CUSTOMERandORDERrelations.City_INDandAge_INDareselectionindicesonCUSTOMER.CnamelNDis a selectionindexonORDER.CIDJIandOlDJIarejoinindicesbetweenCUSTOMERandORDERfor the joinpredicate(CUSTOMER.Cname=ORDER.Cname).3.Optimization of Non—RecursiveQueries- The objective of queryoptimizationistoselectanaccessplanforaninputquerythatoptimizes a givencostfunction.Thiscostfunctiontypicallyreferstomachineresourcessuchasdiskaccesses,CPUtime,andpossiblycommunicationtime(for a distributeddatabasesystem). The queryoptimizerisincharge of decisionsregarding the ordering of databaseoperations,and the choice of the accesspathsto the data, the algorithmsforperformingdatabaseoperations,and the intermediaterelationstobematerialized.Thesedecisionsareundertakenbasedon the physicaldatabaseschemaandrelatedstatistics. A set of decisionsthatleadtoanexecutionplancanbecapturedby a processingtreeKrishnamurthy86]. A processingtree(PT)is a treeinwhich a leafis a baserelationand a non—leafnodeisanintermediaterelationmaterializedbyapplyinganinternaldatabaseoperation.Internaldatabaseoperationsimplementefficientlyrelationalalgebraoperationsusingspecificaccesspathsandalgorithms.Examples of internaldatabaseoperationsareexact—matchselect,sort—mergejoin,n—arypipelinedjoin,semi—join,etc. The application of algebraictransformationrulesJarke84]permitsgeneration of manycandidatePT’sfor a singlequery. The optimizationproblemcanbeformulatedasfinding the PT of minimalcostamongallequivalentPT’s.TraditionalqueryoptimizationalgorithmsSelinger79]performanexhaustivesearch of the solutionspace,definedas the set of allequivalentPT’s,for a givenquery. The estimation of the cost of a PTisobtainedbycomputing the sum of the costs of the individualinternaldatabaseoperationsin the PT. The cost of aninternaloperationisitself a monotonicfunction of the operandcardinalities.If the operandrelationsareintermediaterelationsthentheircardinalitiesmustalsobeestimated.Therefore,foreachoperationin the PT,twonumbersmustbepredicted:(1) the individualcost of the operationand(2) the cardinality of itsresultbasedon the selectivity of the conditionsSelinger79,Piatetsky84]. The possiblePT’sforexecutinganSPJqueryareessentiallygeneratedbypermutation of the joinordering.Withnrelations,therearen!possiblepermutations. The complexity of exhaustivesearchisthereforeprohibitivewhennislarge(e.g.,n>10). The use of dynamicprogrammingandheuristics,asinSelinger79],reducesthiscomplexityto2~,whichisstillsignificant.Tohandle the case of complexqueriesinvolving a largenumber of relations, the optimizationalgorithmmustbemoreefficient. The complexity of the optimizationalgorithmcanbefurtherreducedbyimposingrestrictionson the class of 12PT’sIbaraki84),limiting the generality of the costfunctionKrishnamurthy86),orusing a probabilistichill—climbingalgorithmloannidis87].Assumingthat the solutionspaceissearchedbyanefficientalgorithm,wenowillustrate the possiblePT’sthatcanbeproducedbasedon the storagemodelwithjoinindices. The addition of joinindicesin the storagemodelenlarges the solutionspaceforoptimization.Joinindicesshouldbeconsideredby the queryoptimizerasanyotherjoinmethod,andusedonlywhentheyleadto the optimalPT.InValduriez87],wegive a precisespecification of the joinalgorithmusingjoinindex,denotedbyJOINJI,anditscost.ThisalgorithmtakesasinputtwobaserelationsR1(TID1, A1 ,B1, ... of commonsubexpressioneliminationGM82],whichappearsparticularlyusefulwhenflatteningoccurs. A simpletechniqueusing a hill—climbingmethodiseasytosuperimposeon the proposedstrategy,butmoreambitioustechniqueprovide a topicforfutureresearch.Further,anextrapolation of commonsubexpressioninlogicqueriescanbeseenin the followingexample:letbothgoalsP (a, b,X)andP (a, Y,c)occurin a query.ThenitisconceivablethatcomputingP (a, Y,X)onceandrestricting the resultforeach of the casesmaybemoreefficient.Acknowledgments:WearegratefultoShamimNaqviforinspiringdiscussionsduring the development of anearlierversion of thispaper.References:AU79]Aho, A. andJ.Uliman,Universality of DataRetrievalLanguages,Proc.POPLCon!.,SanAntonio,TX,1979.B40]Birkhoff,G.,“LatticeTheory”,AmericanMathematicalSociety,1940.BMSU8S]Bancilhon,F.,D,Maier,Y.SagivandUliman,MagicSetsandotherStrangeWaystoImplementsLogicPrograms,Proc.5—thACMSIGMOD—SIGACTSymposiumonPrinciples of DatabaseSystems,pp.1—16,1986.BR86]Bancilhon,F.,andR.Ramakrishan,AnAmateur’sIntroductiontoRecursiveQueryProcessingStrategies,Proc.1986ACM—SIGMQDIntl.Conf.onMgt. of Data,pp.16—52,1986.D82]Daniels,D.,et.al.,“AnIntroductiontoDistributedQueryCompilationin~Proc. of SecondInternationalConf,onDistriutedDatabases,Berlin,Sept.1982.GM82]Grant,J.andMinkerJ.,OnOptimizing the Evaluation of a Set of Expressions,mt.Journal of Computer andInformationScience,11,3(1982),179—189.1W87]loannidis,Y.E,Wong,E,QueryOptimizationbySimulatedAnnealing,SIGMOD87,SanFrancisco.KBZ86]Krishnamurthy,R.,Boral,H.,Zaniolo,C.Optimization of NonrecursiveQueries,Proc. of 12thVLDB,Kyoto,Japan,1986.KRS87]Krishnamurthy,R,Ramakrishnan,R,Shmueli,0.,“TestingforSafetyandEffectiveComputability”,ManuscriptinPreparation.KT811Kellog,C.,andTravis,L.Reasoningwithdatain a deductivelyaugmenteddatabasesystem,inAdvancesinDatabaseTheory:Vol1,H.Gallaire,J.Minker,andJ.Nicholaseds.,PlenumPress,NewYork,1981,pp261—298.Lb84]Lloyd,J.W.,Foundations of LogicProgramming,SpringerVerlag,1984.M84]Maier,D., The Theory of RelationalDatabases,(pp.542—553),Comp.SciencePress,1984.Na86]Naish,L.,NegationandControlinPrologJournal of LogicProgramming,toappear.Sel79]Sellinger,P.G.et.al.AccessPathSelectionin a RelationalDatabaseManagementSystem.,Proc.1979ACM—SIGMODIntl.Conf.onMgt. of Data,pp.23—34,1979.5Z86]Sacca’,D.andC.Zaniolo, The GeneralizedCountingMethodforRecursiveLogicQueries,Proc.ICDT‘86——mt.Conf.onDatabaseTheory,Rome,Italy,1986.TZ86]Tsur,S.andC.Zaniobo,LDL: A Logic—BasedDataLanguage,Proc. of 12thVLDB,Kyoto,Japan,1986.U85]Ullman,J.D.,Implementation of logicalquerylanguagesfordatabases,TODS,10,3,(1985),289—321.UV85]Ullman,J.D.and A. VanGelder,TestingApplicability of Top—DownCaptureRules,StanfordUniv.ReportSTAN—CS—85—146,1985.V86]Viflarreal,M.,“Evaluation of anO(N**2)MethodforQueryOptimization”,MSThesis,Dept. of Computer Science,Univ. of TexasatAustin,Austin,TX.Z85]Zaniolo,C. The representationanddeductiveretrieval of complexobjects,Proc. of 11thVLDB,pp.458—469,1985.Z86]Zaniolo,C.,SafetyandCompilation of Non—RecursiveHornClauses,Proc.Firstmt.Con!.onExpertDatabaseSystems,Charleston,S.C.,1986.3OPTIMIZATION OF COMPLEXDATABASEQUERIESUSINGJOININDICESPatrickValduriezMicroelectronicsand Computer TechnologyCorporation3500WestBalconesCenterDriveAustin,Texas78759ABSTRACTNewapplicationareas of databasesystemsrequireefficientsupport of complexqueries.Suchqueriestypicallyinvolve a largenumber of relationsandmayberecursive.Therefore,theytendtouse the joinoperatormoreextensively. A joinindexis a simpledatastructurethatcanimprovesignificantly the performance of joinswhenincorporatedin the databasesystemstoragemodel.Thus,asanyotheraccessmethod,itshouldbeconsideredasanalternativejoinmethodby the queryoptimizer.Inthispaper,weelaborateon the use of joinindicesfor the optimization of bothnon—recursiveandrecursivequeries.Inparticular,weshowthat the incorporation of joinindicesin the storagemodelenlarges the solutionspacesearchedby the queryoptimizerandthusoffersadditionalopportunitiesforincreasingperformance.1.IntroductionRelationaldatabasetechnologycanwellbeextendedtosupportnewapplicationareas,suchasdeductivedatabasesystemsGallaire84].Comparedto the traditionalapplications of relationaldatabasesystems,theseapplicationsrequire the support of morecomplexqueries.Thosequeriesgenerallyinvolve a largenumber of relationsandmayberecursive.Therefore, the quality of the queryoptimizationmodule(queryoptimizer)becomes a keyissueto the success of databasesystems. The idealgoal of a queryoptimizeristoselect the optimalaccessplanto the relevantdataforaninputquery.Most of the workontraditionalqueryoptimizationJarke84]hasconcentratedonselect—project—join(SPJ)queries,fortheyare the mostfrequentonesintraditionaldataprocessing(business)applications.Furthermore,emphasishasbeengivento the optimization of joinsIbaraki84]becausejoinremains the mostcostlyoperator.Whencomplexqueriesareconsidered, the joinoperatorisusedevenmoreextensivelyforbothnon—recursivequeriesKrishnamurthy86]andrecursivequeriesValduriez8 6a] .InValduriez87],weproposed a simpledatastructure,called a joinindex,thatimprovessignificantly the performance of joins.Inthispaper,weelaborateon the use of joinindicesin the context of non—recursiveandrecursivequeries.Weview a joinindexasanalternativejoinmethodthatshouldbeconsideredby the queryoptimizerasanyotheraccessmethod.Ingeneral, a queryoptimizermaps a queryexpressedonconceptualrelationsintoanaccessplan,i.e., a low—levelprogramexpressedon the physicalschema. The physicalschemaitselfisbasedon the storagemodel, the set of datastructuresavailablein the databasesystem. The incorporation of joinindicesin the storagemodelenlarges the solutionspacesearchedby the queryoptimizer,andthusoffersadditionalopportunitiesforincreasingperformance.10Joinindicescouldbeusedinmanydifferentstoragemodels.However,inordertosimplifyourdiscussionregardingqueryoptimization,wepresent the integration of joinindicesin a simplestoragemodelwithsingleattributeclusteringandselectionindices.Thenweillustrate the impact of the storagemodelwithjoinindiceson the optimization of non—recursivequeries,assumedtobeSPJqueries.Inparticular,efficientaccessplans,where the mostcomplex(andcostly)part of the querycanbeperformedthroughindices,canbegeneratedby the queryoptimizer.Finally,weillustrate the use of joinindicesin the optimization of recursivequeries,where a recursivequeryismappedinto a program of relationalalgebraenrichedwith a transitiveclosureoperator.2.StorageModelwithJoinIndices The storagemodelprescribes the storagestructuresandrelatedalgorithmsthataresupportedby the databasesystemtomap the conceptualschemainto the physicalschema.In a relationalsystemimplementedon a disk—basedarchitecture,conceptualrelationscanbemappedintobaserelationson the basis of twofunctions,partitioningandreplicating.All the tuples of a baserelationareclusteredbasedon the value of oneattribute.Weassumethateachconceptualtupleisassigned a surrogatefortupleidentity,called a TID(tupleidentifier). A TIDis a valueuniqueforalltuples of a relation.Itiscreatedby the systemwhen a tupleisinstantiated.TID’spermitefficientupdatesandreorganizations of baserelations,sincereferencesdonotinvolvephysicalpointers. The partitioningfunctionmaps a relationintooneormorebaserelations,where a baserelationcorrespondsto a TIDtogetherwithanattribute,severalattributes,orall the conceptualrelation’sattributes. The rationalefor a partitioningfunctionis the optimization of projection,bystoringtogetherattributeswithhighaffinity,i.e.,frequentlyaccessedtogether. The replicatingfunctionreplicatesoneormoreattributesassociatedwith the TID of the relationintooneormorebaserelations. The primaryuse of replicatedattributesisforoptimizingselectionsbasedonthoseattributes.Anotheruseisforincreasedreliabilityprovidedbythoseadditionaldatacopies.inthispaper,weassume a simplestoragemodel ... usedtodeflect the readingbeamveryfast.As a result,itismuchfastertoretrieveinformationfromtracksthatarelocatednear the currentlocation of the readinghead.Wecallthis a spanaccesscapability. The spanaccesscapability of opticaldiskshasimplicationsforschedulingalgorithmsanddatastructuresthatareappropriateforopticaldisks,aswellassignificantimpactonretrievalperformanceChristodoulakis8 7a] .InChristodoulakis87]wealsoderiveexactanalyticcostestimatesaswellasapproximationsthatarecheapertoevaluate,for the retrieval of recordsandlongerobjectssuchastext,images,voice,anddocuments(possiblycrossingblockboundaries)fromCAVopticaldisks.Theseestimatesmaybeusedbyqueryoptimizers of traditionalormultimediadatabases.RetrievalPerformance of CLVOpticalDisksConstantLinearVelocity(CLV)opticaldiskshavedifferentcharacteristicsthan the CAVopticaldisks.CLVopticaldisksvary the rotationalspeedsothat the unitlength of the trackwhichisreadpassesunder the readingmechanisminconstanttime,whichisindependent of the location of the track.Thishasimplicationson the rotationaldelaycostwhich,inCLVdisks,dependson the tracklocation.Thisalsoimpliesthat,inCLVdisks, the number of sectorspertrackvaries(outsidetrackshavemoresectors). The latter(variablecapacity of a track)hasmanyfundamentalimplicationsonselection of datastructuresthataredesirableforCLVopticaldisksand the parameters of theirimplementation,for the selection of accesspathstobesupportedfordatabasesstoredonCLVdisks,aswellasfor the retrievalperformanceand the optimalqueryprocessingstrategytobechosen.(TheseimplicationsarestudiedindetailinChristodoulakis87b],inwhichisshownthatthesedecisionsdependon the location of dataplacementon the disk.)Analyticcostestimatesfor the performance of retrieval of recordsandobjectsfromCLVdisksarealsoderivedinChristodoulakis87b]).Theseestimatesmaybeusedbytraditionalormultimediaqueryoptimizers.Itisshownthat the optimalqueryprocessingstrategydependson the location of fileson the CLVdisk.Thisimpliesthatqueryoptimizersmayhavetomaintaininformationabout the location of fileson the disk.Estimation of SelectivitiesinTextInmultimediainformationsystemsmuch of the contentspecificationwillbedonebyspecifying a pattern of textwords.Queriesbasedon the content of imagesaredifficulttospecify,andimageaccessmethodsareveryexpensive.Voicecontentistransformedtotextcontentif a goodvoicerecognition18deviceisavailable.Thusaccurateestimation of textselectivitiesisimportantinqueryoptimizationinmultimediaobjects.Thereisanotherimportantreasonwhyaccurateestimation of textselectivitiesisimportant.Frequently the userwantstohave a fastfeedback of howmanyobjectsqualifyinhisquery.Iftoomanyobjectsquality, the usermaywanttorestrict the set of qualifyingobjectsbyaddingmoreconjunctiveterms.Iftoofewobjectsqualify, the usermaywanttoincrease the number of objectsthathereceivesbyaddingmoredisjunctiveterms.(Tradeoffs of precisionversusrecallareextensivelydescribedin the informationretrieval bibliography. )Althoughsuchstatisticsmaybefoundbytraversinganindexontext(possiblyseveraltimesforcomplicatedqueries)indexesmaynotbe the desirabletextaccessmethodsinseveralenvironmentsHaskin81].Given a set of stopwords(wordsthatappeartoofrequentlyinEnglishtobe of a practicalvalueincontentaddressibility),itiseasytogiveananalyticformulathatcalculates the averagenumber of wordsthatqualifyin a textqueryChristodoulakisandNg87].Thisanalyticformulauses the factthat the distribution of wordsin a longpiece of textisZipfwithknownparameters.However, the averagenumber of documentsmaynotbe a goodenoughestimate(insomecases)forqueryoptimizationorforgivinganestimate of the size of the responseto the userChristodoulakis84].Moredetailedestimateswillhavetoconsiderselectivities of individualwordsandqueries.Thiscanbedoneusingsampling. A samplingstrategylooksatsomeblocks of text,counts the number of occurrences of a particularwordortextpattern,andbasedonthisextrapolates the probabilitydistribution of the number of patternoccurrencesto the wholedatabase. A potentialproblemwiththisapproachisthatinordertobeconfidentabout the statistics a largeportion of the filemayhavetobescanned.Instead of blocks of the actualtextfile,blocks of the textsignaturescouldbeusedwhensignaturesareusedastextaccessmethods.Sincemoreinformationexistsinblocks of signaturesthaninblocks of the...