XML 代價估計方法研究綜述_第1頁
XML 代價估計方法研究綜述_第2頁
XML 代價估計方法研究綜述_第3頁
XML 代價估計方法研究綜述_第4頁
XML 代價估計方法研究綜述_第5頁
已閱讀5頁,還剩40頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、XML 代價估計方法研究綜述代價估計方法研究綜述 XML Group Outlinen研究代價估計的必要性nxml中代價估計研究的難點n背景知識介紹n現(xiàn)有方法分類研究代價估計的必要性n查詢優(yōu)化的基礎(chǔ)n不同分支對查詢目標的選擇率不同,如果選擇性低的分支先于選擇性高的分支執(zhí)行,就會有效的減少中間結(jié)果從而提供查詢效率n結(jié)構(gòu)連接在XML中是一個很重要的操作,連接的順序的選擇需要計算不同連接順序的代價n及早給用戶提供反饋信息xml中代價估計研究的難點nXML數(shù)據(jù)的不規(guī)則性是對傳統(tǒng)統(tǒng)計信息方法的重要挑戰(zhàn),具體表現(xiàn)在以下幾個方面:n路徑依賴性,n如同為name結(jié)點,在person下和在city下出現(xiàn)意義就完

2、全不同n結(jié)構(gòu)相關(guān)性n嵌套在不同祖先下面的同類結(jié)點的個數(shù)差別可能很大,如book結(jié)點下的author個數(shù)是不確定的n值相關(guān)性n/purchase/Itemtype=bookprice100nXML的有序性制約了轉(zhuǎn)換規(guī)則的靈活性n所有這些問題,都使得在xml中采用傳統(tǒng)的代價估計方法不切實際,會帶來很大的誤差。針對xml數(shù)據(jù)的特點,我們應(yīng)該尋求一種新的代價估計方法。 Background Knowledge nXML Data ModelnA large, node-labeled tree T(V,E)d0a1a2a3p4p5n6t14k15y131999y16t17k18k19n7b9p8y20

3、t21k22y24t25k26n10b12p9t23t26200220012000Example XML Document TreeBackground Knowledge nXML Data ModelnA large, node-labeled tree T(V,E)d0a1a2a3p4p5n6t14k15y131999y16t17k18k19n7b9p8y20t21k22y24t25k26n10b12p9t23t26200220012000Example XML DocumentBackground KnowledgenXML Query ModelnA node-labeled que

4、ry tree TQnEach node of TQ is labeled with a variable name qi in Q nEach edge (qi,qj) of TQ is annotated with an XPath expression path (qi,qj) that describes the specific structural constraints specified in QQuery Fragment: for $b in doc(“bib.xml”)/bib/book$a in $b/author $t in $b/title bibbookautho

5、rtitleroot$b$a$tTwig Query Model現(xiàn)有估計方法的分類n路徑選擇性計算nTwig匹配個數(shù)統(tǒng)計n值謂詞選擇率估計n結(jié)構(gòu)連接代價估計 Overviewdata treedifferent summarization methods based on:pathall m-length pathF/B-stabilityLoreMcHugh et al, VLDB99pruningPathTree, Markov TableAboulnaga et al, VLDB01XSketchPolyzotis et al, VLDB02typeXSketchesPolyzotis

6、et al, SIGMOD02+ value+ twigXSketchesPolyzotis et al, SIGMOD04Region Code2D Model1D ModelStatiXFreire et al, SIGMOD02PH HistogramWu Y EDBT 2002Interva&Position ModelH.Lu SIGMOD 2003路徑選擇性計算n問題描述 /a/bc/d/eg/e/f/*/*/e/f/a/b/*/*/e/f/c/d/e/g/e/f/a/b/*/*/e/f/c/d/e/g/e/fPath Tree & Markov Table nAs

7、hraf Aboulnaga, Alaa R. Alameldeen, and Jeffry F.Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. VLDB 2001 Path TreenConstructionnStart from an original path treenCount of nodes in absolute pathsnAggregate the tree to fit in the limited spacenSibling-*nE

8、stimationnMatch the query against the path tree, matching * as the last resort.nNeed to decide whether to use total count or average count.An XML document and its path tree A1B2C1D1D1E3Estimation:count(A/B/D)=1 count(A/C/D)=1 Aggregate Path TreeA1C9B13G 10F15H6K12E5D7K11J4I2Aggregate Path TreeA1C9B1

9、3G 10F15H6K12E5D7K11J4I2紅色結(jié)點表示那些不頻繁出現(xiàn)的結(jié)點,需要刪除,從而保證path樹能夠放入內(nèi)存Sibling-* SummarizationA1C9B13*F15*K*f=23n=2683把查詢和path 樹匹配需要決定用總數(shù)還是平均數(shù)估計方法Estimation:count(/C/H/K)=11 count(/C/*/K)=23 Markov Tablen構(gòu)造一個表,存儲長度=2。n當查詢路徑長度=m時,可以直接從表中讀出結(jié)果,否則,用公式計算。n例如,m3,查詢?yōu)?A/B/C/D,則n當表不能裝入內(nèi)存的時候,進行剪枝操作。f(B/C/D)f(B/C)f(A/B/

10、C/D) f(A/B/C)Markov Table : Exampleabbcddeeefda1a/b2b2a/c1c1b/d2d3c/d1e3c/e3f1c/f1abdde12213cf11m=2b2c/e3d3*3/3e3c/*2/2a/b2*/*1/1b/d2suffix-*Q1: a/c/e( / )( / )( / / )( / )(*/*)3( )(*)f c ef c ef a c ef a cff cfTwig 匹配個數(shù)統(tǒng)計n問題描述nTwig Query (basic building block of declarative query languages for XML)

11、FOR $f IN document(“person.xml”)/department/faculty FOR $r in $f/RA, FOR $t in $f/TADepartmentFacultyRATA$f$r$tXSketch方法n N.Polyzotis, M.Garofalakis. Statistical Synopses for Graph-Structured XML Databases, In Proc.of ACM SIGMOD Conf.2002 nN. Polyzotis and M. Garofalakis. Structure and value synopse

12、s for xml data graphs. In Proceedings of 28th VLDB Conf., 2002.nN.Polyzotis, M.Garofalakis, and Yannis Iosnnidis. Selectivity Estimation for XML Twigs. In Proc.of the 20th Intl.Conf. on Data Engineering,2004nN. Polyzotis and M. Garofalakis and Y.Ioannidis. Approximate XML Query Answers. In Proc .ACM

13、 SIGMOD 2004d0a1a2a3p4p5n6t14k15y131999 y16t17k18k19n7b9p8y20t21k22y24t25k26n10b12p9t23t26200220012000D(1)A(3)N(3)P(4)B(2)Y(4)K(6)T(6)H(Y)B/FB/FB/FB/FBB/FFFXSketch方法B/F stabilityn Definition:n Let U,V be sets of elements in an XML data Tree T. We say that V is backward-stable(B-stale) with respect t

14、o U, iff for each vV there exists a u U such that edge (u,v) is in T.nSimilarly,U is said to be forward-stable(F-stable) with respect to V,iff for each u U there exists a v V such that the edge(u,v) is in G.D(1)A(3)N(3)P(4)B(2)Y(4)K(6)T(6)H(Y)B/FB/FB/FB/FBB/FFFXSketch方法How to estimate the cardinalit

15、y of xpath query such as A/P/K or A/p ?Utilizing edge stability,We can give an accurate match number:Count(A/P/K)=6Count(A/p)=3But what about A/P/T which doesnt conform to backward stability?2 assumptions Path independence Backward-edge uniformityCount(A/P/T)=f(A/P/T)*count(T)f(A/P/T)=f(A/P)*f(P/T|A

16、/P)=f(P/T|A/P) =f(P/T)count(P)/(count(B)+count(P)XSketch方法(處理值謂詞)v1v3v5v2v4BBFFSTN(v5)Dep(v5)=v1,v4v1v4Histogram at v5H(1, 4)= count(v11v2/v3v44/v5)v3 v5 count(v11v2/v33v44/v5)=H(1, 4)*count(v33)/count(v3)A1Edge-Value IndependenceAcross Non-Satble Edgesf(u/v|v) f(u/v)A2Value-Independence Outside Cor

17、relation ScopeXSkecth方法(處理twig)nProblem with the former modelnLet us see a simple examplefor t0 in A, t1 in t0/B , t2 in t0/Cra1a2b1c1b2c2ra1a2b1c1b2c21001010010100100 1010RA(2)B(110)C(110)B/FB/FB/F(a)(b)(c) Structural XSketch2000 binding tuples on the first document10100 tuples on the second docume

18、nt XSkecth方法(處理twig)nSinge-path XSketch model dose not store detailed enough information to capture the correlation patterns between multiple path expressions.nIf node A records a two-dimensional distribution fA(b,c) for the fraction of elements in A that have b children in B and c children in C.CBC

19、CElementsfp10010a10.510100a20.52(0.5100100.510100)CBCCElementsfp100100a10.51010a20.52(0.51001000.51010)XSkecth方法(處理twig)CKCYElementsfp21p40.25P54(0.2521100.25110.51)5CPCN21210.2521110.511p8,p9Another ExampleEdge-distribution: fP(CY,CK,CP,CN) PYPKAPANStatix方法n Freire J, Jayant R, Ramanath M, Roy P an

20、d Simeon J. StatiX: Making XML Count. In: Proc. of ACM SIGMOD 2002, USA. Statix方法nConstructionnevery instance node is assigned a unique id(type id+sequential node id ) during parsing and validation, meanwhile gathering statisticsnConstructing the histogramsnHistogramnBuild histogram for each typenMi

21、ght contain histogram describing the distribution of the parents of a typenValue histogram can be built for types that are defined in terms of base types (eg. integer)nEstimationnConvert the query into SQL (i.e., relational join on IDs)nUse standard histogram multiplicationStatiX : ExampleFOR $s in

22、document (“imdb.xml”)/show, $r in $s/reviewWHERE $s/year 1992RETURN $s/title, $rSELECT title, reviewFROM Show, ReviewWHERE Show.year 1992 AND Show.ID = Review.ParIDSTAT Show CARD: 5 ID: 16 STAT Show_Year VALUE: 19902001 BUCKET_NUM: 2 BUCKETS: 1990, 1994): 3; 1994, 2001): 2STAT Review CARD: 8 ID: 303

23、8 PARENT HISTOGRAM Show BUCKET_NUM: 3 BUCKETS: 1, 2): 4; 2, 3): 1; 3, 5): 3 STAT Aka CARD:6 ID:6-12 PARENT HISTOGRAM Show_Episode BUCKET_NUM: 3 BUCKETS: 1, 2): 1; 2, 6): 0; 6, 12): 7 31/28/5=2.4SummaryQuery/*BranchValueCorrelationSchemaStatiXSimple Twig + ValueNYYYNYPath TreeSimple PathNYNNNNMarkov

24、TableSimple PathNYNNNNXSketchSimple TwigNNYNNNXSketchesSimple Twig + ValueNNYYY (mD methods)N結(jié)構(gòu)連接代價估計n問題描述n給定任意的祖先節(jié)點集合A和后代節(jié)點集合D,計算A/D結(jié)果集合的大小,估計匹配的祖先后代的結(jié)果個數(shù) n當同一祖先有多個后代,或者同一后代有多個祖先時,節(jié)點對個數(shù)可能遠遠大于祖先或后代的個數(shù)。n應(yīng)用n連接順序的選擇結(jié)構(gòu)連接代價估計nExisted workn2D model: pH histogram Wu Y, Patel J and Jagadish H. V. Estimating

25、 Answer Sizes for XML Queries. In: Proc. of the EDBT 2002. nInterval and Positional model W.Wang, H.Jiang, H.Lu, and J.F.Xu. Containment join Size Estimation:Models and Methods. In:Proc. Of ACM SIGMOD 2003 Absolute Region Code (start, end) blah 1234 5678 0000 contactnamephoneblahofficehomemobile1234

26、56780000(1,16)(2,4)(3,3)(5,15)(6,8)(7,7)(9,11)(10,10)(12,14)(13,13)234a.start d.start and d.end a.endpH HistogramForbidden Regions For a nodeR1和R2是結(jié)點A的ForbiddenRegion 兩個結(jié)點的(start,end)滿足:(1)no overlap (2)nested 不可能有部分重疊的(partiallyoverlap)情況 pH HistogramEstPA=HistP1A1/4HistP2A+HistP2B+HistP2C+HistP2E+

27、1/2(HistP2D+HistP2F)Ancestor-based Join EstimationEstPA=HistP2A1/4HistP1A+HistP1F+HistP2G+HistP1HDescendant-based Join EstimationpH Histogram ( 1 4 )ADDAAAAEAHAHGHHHIHA35 51 3 71021203 00 0 00000est = 3*(12+1+2+0.25*5) = 48.75AGIHstartendstartstartendendFor off-diagonal cells:pH for ApH for DInterva

28、l & Position ModelsnMap into new problems under 2 newly proposed models.nInterval model and Position modelnHistogram based method that is adaptive to the data: PL histogramnAssumption: 1D uniform of the descendant setnEstimation is robust when correlation is lownSampling based method that captur

29、es the correlation: IM Sampling and PM SamplingnAssumption: Height of the XML data tree is a small constantnBoth have theoretical bounds on the accuracy of the estimationInterval ModelnInterval ModelnMap elements to 1D line segments (points)nIMA(A): interval setnIMD(D): point setn|A SJ D|=| interval

30、-point pairs| a1a2a3IMA(A)022d1d2d3d4IMD(D)022a(start, end)a1(2, 7)a2(18, 21)a3(1, 22)I IMMA A( (A A) )d(start, end)d1(3, 4)d2(9, 10)d3(11, 12)d3(19, 20)I IMMD D( (D D) )interval setpoint set (2,7)a3a1a2d1d2d3d4(1,22)(3,4)(9,10) (11,12)(19,20)(18,21)Position ModelnPosition ModelnUse frequency tables to record coverage and starting point informationnPMA(S): coverage tablenPMD(S): start tablen|A SJ D|=nContainment

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論