




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
DataStreamMiningandQueryingSlidestakenfromanexcellentTutorialonDataStreamMiningandQueryingbyMinosGarofalakis,
JohannesGehrkeandRajeevRastogi
AndbyMino’slecturesslidesfromthere:/cs286sp07/ProcessingDataStreams:MotivationAgrowingnumberofapplicationsgeneratestreamsofdataPerformancemeasurementsinnetworkmonitoringandtrafficmanagementCalldetailrecordsintelecommunicationsTransactionsinretailchains,ATMoperationsinbanksLogrecordsgeneratedbyWebServersSensornetworkdataApplicationcharacteristicsMassivevolumesofdata(severalterabytes)RecordsarriveatarapidrateGoal:Minepatterns,processqueriesandcomputestatisticsondatastreamsinreal-timeDataStreams:ComputationModelAdatastreamisa(massive)sequenceofelements:
StreamprocessingrequirementsSinglepass:EachrecordisexaminedatmostonceBoundedstorage:LimitedMemory(M)forstoringsynopsisReal-time:Perrecordprocessingtime(tomaintainsynopsis)mustbelowStreamProcessingEngine(Approximate)AnswerSynopsisinMemoryDataStreamsDataStreamProcessingAlgorithmsGenerally,algorithmscomputeapproximateanswersDifficulttocomputeanswersaccuratelywithlimitedmemoryApproximateanswers-DeterministicboundsAlgorithmsonlycomputeanapproximateanswer,butboundsonerrorApproximateanswers-ProbabilisticboundsAlgorithmscomputeanapproximateanswerwithhighprobabilityWithprobabilityatleast,thecomputedansweriswithinafactoroftheactualanswerSingle-passalgorithmsforprocessingstreamsalsoapplicableto(massive)terabytedatabases!
Sampling:BasicsIdea:AsmallrandomsampleSofthedataoftenwell-representsallthedataForafastapproxanswer,apply“modified”querytoSExample:selectaggfromRwhereR.eisodd
(n=12)
Ifaggisavg,returnaverageofoddelementsinSIfaggiscount,returnaverageoverallelementseinSofnifeisodd0ifeisevenUnbiased:Forexpressionsinvolvingcount,sum,avg:theestimatorisunbiased,i.e.,theexpectedvalueoftheansweristheactualanswerDatastream:935271658491SampleS:9518answer:5answer:12*3/4=9ProbabilisticGuaranteesExample:Actualansweriswithin5±1withprob0.9UseTailInequalitiestogiveprobabilisticboundsonreturnedanswerMarkovInequalityChebyshev’sInequalityHoeffding’sInequalityChernoffBoundTailInequalitiesGeneralboundsontailprobabilityofarandomvariable(thatis,probabilitythatarandomvariabledeviatesfarfromitsexpectation)
BasicInequalities:LetXbearandomvariablewithexpectationandvarianceVar[X].ThenforanyProbabilitydistributionTailprobabilityMarkov:Chebyshev:TailInequalitiesforSumsPossibletoderivestrongerboundsontailprobabilitiesforthesumofindependentrandomvariablesHoeffding’sInequality:LetX1,...,Xmbeindependentrandomvariableswith0<=Xi<=r.Letandbetheexpectationof.Then,foranyApplicationtoavgqueries:missizeofsubsetofsampleSsatisfyingpredicate(3inexample)risrangeofelementvaluesinsample(8inexample)TailInequalitiesforSums(Contd.)PossibletoderiveevenstrongerboundsontailprobabilitiesforthesumofindependentBernoullitrialsChernoffBound:LetX1,...,XmbeindependentBernoullitrialssuchthatPr[Xi=1]=p(Pr[Xi=0]=1-p).Letandbetheexpectationof.Then,forany,Applicationtocountqueries:missizeofsampleS(4inexample)pisfractionofoddelementsinstream(2/3inexample)
Remark:ChernoffboundresultsintighterboundsforcountqueriescomparedtoHoeffding’sinequalityComputingStreamSampleReservoirSampling[Vit85]:
MaintainsasampleSofafixed-sizeMAddeachnewelementtoSwithprobabilityM/n,wherenisthecurrentnumberofstreamelementsIfaddanelement,evictarandomelementfromSInsteadofflippingacoinforeachelement,determinethenumberofelementstoskipbeforethenexttobeaddedtoSConcisesampling[GM98]:DuplicatesinsampleSstoredas<value,count>pairs(thus,potentiallyboostingactualsamplesize)AddeachnewelementtoSwithprobability1/T(simplyincrementcountifelementalreadyinS)IfsamplesizeexceedsMSelectnewthresholdT’>TEvicteachelement(decrementcount)fromSwithprobability1-T/T’AddsubsequentelementstoSwithprobability1/T’StreamingModel:SpecialCases
Time-SeriesModelOnlyj-thupdateupdatesA[j](i.e.,A[j]:=c[j])
Cash-RegisterModelc[j]isalways>=0(i.e.,increment-only)Typically,c[j]=1,soweseeamulti-setofitemsinonepass
TurnstileModelMostgeneralstreamingmodelc[j]canbe>0or<0(i.e.,incrementordecrement)
ProblemdifficultyvariesdependingonthemodelE.g.,MIN/MAXinTime-Seriesvs.Turnstile!Linear-Projection(akaAMS)SketchSynopsesGoal:Buildsmall-spacesummaryfordistributionvectorf(i)(i=1,...,N)seenasastreamofi-valuesBasicConstruct:
RandomizedLinearProjectionoff()=projectontoinner/dotproductoff-vectorSimpletocomputeoverthestream:Addwheneverthei-thvalueisseenGenerate‘sinsmall(logN)spaceusingpseudo-randomgeneratorsTunableprobabilisticguaranteesonapproximationerrorDelete-Proof:Justsubtracttodeleteani-thvalueoccurrenceComposable:Simplyaddindependently-builtprojectionsDatastream:3,1,2,4,2,3,5,...Datastream:3,1,2,4,2,3,5,...f(1)f(2)f(3)f(4)f(5)11122where=vectorofrandomvaluesfromanappropriatedistributionExample:Binary-JoinCOUNTQueryProblem:ComputeanswerforthequeryCOUNT(RAS)Example:Exactsolution:tooexpensive,requiresO(N)space!N=sizeof(domain(A))DatastreamR.A:41241412032134DatastreamS.A:31242412213421=10(2+2+0+6)BasicAMSSketchingTechnique[AMS96]KeyIntuition:Userandomizedlinearprojectionsoff()todefinerandomvariableXsuchthatXiseasilycomputedoverthestream(insmallspace)E[X]=COUNT(RAS)Var[X]issmall
BasicIdea:Defineafamilyof4-wiseindependent{-1,+1}randomvariables
Pr[=+1]=Pr[=-1]=1/2Expectedvalueofeach,E[]=0Variablesare4-wiseindependentExpectedvalueofproductof4distinct=0Variablescanbegeneratedusingpseudo-randomgeneratorusingonlyO(logN)space(forseeding)!Probabilisticerrorguarantees(e.g.,actualansweris10±1withprobability0.9)AMSSketchConstructionComputerandomvariables:andSimplyaddtoXR(XS)wheneverthei-thvalueisobservedintheR.A(S.A)streamDefineX=XRXStobeestimateofCOUNTqueryExample:DatastreamR.A:412414DatastreamS.A:3124241202134122134213Binary-JoinAMSSketchingAnalysisExpectedvalueofX=COUNT(RAS)
Using4-wiseindependence,possibletoshowthat
isself-joinsizeofR(second/L2moment)01BoostingAccuracyChebyshev’sInequality:
BoostaccuracytobyaveragingoverseveralindependentcopiesofX(reducesvariance)
ByChebyshev:xxxAverageyBoostingConfidenceBoostconfidencetobytakingmedianof2log(1/)independentcopiesofYEachY=BernoulliTrial
Pr[|median(Y)-COUNT|COUNT](ByChernoffBound)=Pr[#failuresin2log(1/)trials>=log(1/)]yyycopiesmedian2log(1/)“FAILURE”:SummaryofBinary-JoinAMSSketchingStep1:Computerandomvariables:andStep2:DefineX=XRXSSteps3&4:AverageindependentcopiesofX;Returnmedianofaverages
MainTheorem(AGMS99):SketchingapproximatesCOUNTtowithinarelativeerrorofwithprobabilityusingspace
Remember:O(logN)spacefor“seeding”theconstructionofeachX
xxxAverageyxxxAverageyxxxAverageycopiescopiesmedian2log(1/)ASpecialCase:Self-joinSizeEstimateCOUNT(RAR)(originalAMSpaper)Second(L2)momentofdatadistribution,Giniindexofheterogeneity,measureofskewinthedata
Inthiscase,COUNT=SJ(R),sowegetan(e,d)-estimateusingspaceonly
Best-caseforAMSstreamingjoin-sizeestimationDistinctValueEstimationProblem:Findthenumberofdistinctvaluesinastreamofvalueswithdomain[0,...,N-1]Zerothfrequencymoment,L0(Hamming)streamnormStatistics:numberofspeciesorclassesinapopulationImportantforqueryoptimizersNetworkmonitoring:distinctdestinationIPaddresses,source/destinationpairs,requestedURLs,etc.Example(N=64)Datastream:305301751037Numberofdistinctvalues:5Assumeahashfunctionh(x)thatmapsincomingvaluesxin[0,…,N-1]uniformlyacross[0,…,2^L-1],whereL=O(logN)Letlsb(y)denotethepositionoftheleast-significant1bitinthebinaryrepresentationofyAvaluexismappedtolsb(h(x))MaintainHashSketch=BITMAParrayofLbits,initializedto0Foreachincomingvaluex,setBITMAP[lsb(h(x))]=1Hash(akaFM)SketchesforDistinctValueEstimation[FM85]x=5h(x)=101100lsb(h(x))=2000001BITMAP543210Hash(akaFM)SketchesforDistinctValueEstimation[FM85]Byuniformitythroughh(x):Prob[BITMAP[k]=1]=Prob[]=Assumingddistinctvalues:expectd/2tomaptoBITMAP[0],d/4tomaptoBITMAP[1],...LetR=positionofrightmostzeroinBITMAPUseasindicatoroflog(d)[FM85]provethatE[R]=,whereEstimated=Averageseveraliidinstances(differenthashfunctions)toreduceestimatorvariancefringeof0/1saroundlog(d)000001BITMAP000111111111position<<log(d)position>>log(d)0L-1[FM85]assume“ideal”hashfunctionsh(x)(N-wiseindependence) [AMS96]:pairwiseindependenceissufficienth(x)=,wherea,barerandombinaryvectorsin[0,…,2^L-1]Small-spaceestimatesfordistinctvaluesproposedbasedonFMideasDelete-Proof:Justusecountersinsteadofbitsinthesketchlocations+1forinserts,-1fordeletesComposable:Compon
溫馨提示
- 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 高層土方施工方案
- 樓板管口灌漿施工方案
- 房產(chǎn)委托代理合同
- 旅游酒店業(yè)智慧客房服務(wù)系統(tǒng)建設(shè)方案
- 橋梁基礎(chǔ)注漿施工方案
- 鐵藝別墅施工方案
- 冷凍機(jī)房施工方案
- 低壓柜施工方案
- phc靜壓樁施工方案
- 順德瀝青鋪路工程施工方案
- 23G409先張法預(yù)應(yīng)力混凝土管樁
- 2024年江蘇省中小學(xué)生金鑰匙科技競(jìng)賽(高中組)考試題庫(kù)(含答案)
- DBJ53/T-39-2020 云南省民用建筑節(jié)能設(shè)計(jì)標(biāo)準(zhǔn)
- 2023年山東春季高考數(shù)學(xué)試題
- 初中 初一 勞動(dòng)教育《舉辦一次家庭聚會(huì)》教學(xué)設(shè)計(jì)
- 人教版六年級(jí)上冊(cè)數(shù)學(xué)第六單元測(cè)試卷(百分?jǐn)?shù)(一))
- 《基礎(chǔ)英語(yǔ)》課件 Unit 1 Thinking as a Hobby
- 雅思大作文資料_十大類題材_解析詳細(xì)_應(yīng)有盡有(最好全部打印后看_非常全)
- 小學(xué)綜合實(shí)踐食品添加劑
- 電氣消防設(shè)計(jì)說明專篇
- GCP知識(shí)考核試題與答案
評(píng)論
0/150
提交評(píng)論