WhySimpleHashFunctionsWorkExploitingtheEntropyina_第1頁(yè)
WhySimpleHashFunctionsWorkExploitingtheEntropyina_第2頁(yè)
WhySimpleHashFunctionsWorkExploitingtheEntropyina_第3頁(yè)
WhySimpleHashFunctionsWorkExploitingtheEntropyina_第4頁(yè)
WhySimpleHashFunctionsWorkExploitingtheEntropyina_第5頁(yè)
已閱讀5頁(yè),還剩19頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Why Simple Hash Functions Work :Exploiting the Entropyin a Data StreamMichael MitzenmacherSalil Vadhan1How Collaborations AriseAt a talk on Bloom filters a hash-based data structure.Salil: Your analysis assumes perfectly random hash functions. What do you use in your experiments?Michael: In practice

2、, it works even with standard hash functions.Salil: Can you prove it?Michael: Um 2QuestionWhy do simple hash functions work?Simple = chosen from a pairwise (or k-wise) independent family. Our results are more general.Work = perform just like random hash functions in most real-world experiments.Motiv

3、ation: Close the divide between theory and practice.3ApplicationsPotentially, wherever hashing is usedBloom FiltersPower of Two ChoicesLinear ProbingCuckoo HashingMany Others 4Review: Bloom FiltersGiven a set S = x1,x2,x3,xn on a universe U, want to answer queries of the form:Bloom filter provides a

4、n answer in“Constant” time (time to hash).Small amount of space.But with some probability of being wrong.5Bloom FiltersStart with an m bit array, filled with 0s.Hash each item xj in S k times. If Hi(xj) = a, set Ba = 1.0000000000000000B0100101001110110BTo check if y is in S, check B at Hi(y). All k

5、values must be 1.0100101001110110B0100101001110110BPossible to have a false positive; all k values are 1, but y is not in S.n items m = cn bits k hash functions6Power of Two ChoicesHashing n items into n bucketsWhat is the maximum number of items, or load, of any bucket?Assume buckets chosen uniform

6、ly at random.Well-known result: (log n / log log n) maximum load w.h.p.Suppose each ball can pick two bins independently and uniformly and choose the bin with less load.Maximum load is log log n / log 2 + (1) w.h.p.With d 2 choices, max load is log log n / log d + (1) w.h.p.7Power of Two ChoicesSupp

7、ose each ball can pick two bins independently and uniformly and choose the bin with less load.What is the maximum load now?log log n / log 2 + (1) w.h.p.What if we have d 2 choices?log log n / log d + (1) w.h.p.8Linear ProbingHash elements into an array.If h(x) is already full, try h(x)+1,h(x)+2, un

8、til empty spot is found, place x there.Performance metric: expected lookup time.9Not Really a New Question“The Power of Two Choices” = “Balanced Allocations.” Pairwise independent hash functions match theory for random hash functions on real data.Bloom filters. Noted in 1970s that pairwise independe

9、nt hash functions match theory for random hash functions on real data. But analysis depends on perfectly random hash functions.Or sophisticated, highly non-trivial hash functions.10Worst Case : Simple Hash Functions Dont Work!Lower bounds show result cannot hold for “worst case” input.There exist pa

10、irwise independent hash families, inputs for which Linear Probing performance is worse than random PPR 07. There exist k-wise independent hash families, inputs for which Bloom filter performance is provably worse than random.Open for other problems. Worst case does not match practice.11Random Data?A

11、nalysis usually trivial if data is independently, uniformly chosen over large universe.Then all hashes appear “perfectly random”.Not a good model for real data.Need intermediate model between worst-case, average case.12 A Model for DataBased on models of semi-random sources.SV 84, CG 85Data is a fin

12、ite stream, modeled by a sequence of random variables X1,X2,XT.Range of each variable is N.Each stream element has some entropy, conditioned on values of previous elements.Correlations possible.But each element has some unpredictability, even given the past. 13IntuitionIf each element has entropy, t

13、hen extract the entropy to hash each element to near-uniform location.Extractors should provide near-uniform behavior. 14Notions of Entropymax probability : min-entropy :block source with max probability p per blockcollision probability : Renyi entropy :block source with coll probability p per block

14、“Entropy” within a factor of 2.We use collision probability/Renyi entropy.15Leftover Hash LemmaClassical results apply.BBR 88,ILL 89,CG 85, Z 90Let be a random hash function from a 2-universal hash family. If cp(X) 264 for pairwise independent hash functions, but K 264 for 4-wise independence.22Open ProblemsImproving our results.Other/better hash functions?Better analysis for 2,4-wise independent hash families?Tightening connection to practice.How to estimate relevant entropy of data streams?Performance/theory of real-world hash functions?Generalize model/a

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論