Algorithms for Nearest Neighbor Search-大學課件-在線

上傳人：1*** IP屬地：湖北上傳時間：2023-11-30 格式：PPTX 頁數：35 大小：99.12KB 積分：6 舉報 版權申訴

Algorithms for Nearest Neighbor Search-大學課件-在線_第2頁

Algorithms for Nearest Neighbor Search-大學課件-在線_第3頁

Algorithms for Nearest Neighbor Search-大學課件-在線_第4頁

Algorithms for Nearest Neighbor Search-大學課件-在線_第5頁

已閱讀5頁，還剩30頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

Algorithms

for

Nearest

NeighborSearchPiotr

IndykMITNearest

Neighbor

SearchGiven:

set

points

Goal:

data

structure,

which

given

quepoint

finds

the

nearest

neighbor

ofin

PpqOutline

this

talkVariantsMotivationMain

memory

algorithms:quadtreeskd-treesLocality

Sensitive

HashingSecondary

storage

algorithms:R-tree

(and

its

variants)VA-fileVariants

nearest

neighbor

Near

neighbor

(range

search):

find

one/alpoints

within

distance

from

Spatial

join:

given

two

sets

P,Q,

find

allpairs

such

that

withindistance

from

Approximate

near

neighbor:

find

one/allpoints

p’

whose

distance

atmost

(1+e)

times

the

distance

from

itsnearest

neighborMotivationDepends

the

value

d:low

graphics,

vision,

GIS,

etchigh

d:similarity

databases

(text,

imagesfinding

pairs

similar

objects

(e.g.,

copyrviolation

detection)useful

subroutine

for

clusteringAlgorithmsMain

memory

(Computational

Geometry)linear

scantree-based:quadtreekd-treehashing-based:

Locality-Sensitive

HashingSecondary

storage

(Databases)R-tree

(and

numerous

variants)Vector

Approximation

File

(VA-file)QuadtreeSimplest

spatial

structure

Earth

!Quadtree

ctd.Split

the

space

into

equal

subsquaresRepeat

until

done:only

one

pixel

leftonly

one

point

leftonly

few

points

leftVariants:split

only

one

dimension

timek-d-trees

(in

moment)Range

searchNear

neighbor

(range

search):put

the

root

the

stackrepeatpop

the

node

from

the

stackfor

each

child

T:if

leaf,

examine

point(s)

Cif

intersects

with

the

ball

radius

around

add

Cthe

stackNear

neighbor

ctdNearest

neighborStart

range

with

=Whenever

point

found,

update

Only

investigate

nodes

with

respect

tocurrent

rQuadtree

ctd.Simple

data

structureVersatile,

easy

implementSo

why

doesn’t

this

talk

end

here

?Empty

spaces:

the

points

form

sparse

cloudit

takes

while

reach

themSpace

exponential

dimensionTime

exponential

dimension,

e.g.,

points

othe

hypercubeSpace

issues:

exampleK-d-trees

[Bentley’75]Main

ideas:only

one-dimensional

splitsinstead

splitting

the

middle,

choose

thsplit

“carefully”

(many

variations)near(est)

neighbor

queries:

for

quadtreesAdvantages:no

(or

less)

empty

spacesonly

linear

spaceExponential

query

time

still

possibleExponential

query

timeWhat

does

mean

exactly

?Unless

something

really

stupid,

query

time

ismost

dnTherefore,

the

actual

query

time

isMin[

dn,

exponential(d)

]

This

still

quite

bad

though,

when

the

dimensiois

around

20-30

Unfortunately,

seems

inevitable

(both

theoand

practice)Approximate

nearest

neighbor

Can

using

(augmented)

k-d

trees,

byinterrupting

earlier

[Arya

al’Still

exponential

time

(in

the

worst

caseTry

different

approach:for

exact

queries,

can

use

binary

searchtrees

hashingcan

adapt

hashing

nearest

neighborsearch

?Locality-Sensitive

Hashing[Indyk-Motwani’98]

Hash

functions

are

locality-sensitive,

random

hash

random

function

for

anypair

points

p,q

have:Pr[h(p)=h(q)]

“high”

“close”

tqPr[h(p)=h(q)]

“l(fā)ow”

is”far”

fromqDo

such

functions

exist

?Consider

the

hypercube,

i.e.,pointsfrom{0,1}dHamming

distance

D(p,q)=

positions

onwhich

and

differDefine

hash

function

choosing

set

Iof

random

coordinates,

and

settingh(p)

=projection

onIExampleTake–

d=10,

p=0101110010–

k=2,

I={2,5}Then

h(p)=11h’s

are

locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe

can

vary

the

probability

changing

kk=1k=2distancedistancePrPrHow

can

use

LSH

?Choose

several

h1..hlInitialize

hash

array

for

each

hiStore

each

point

the

bucket

hi(p)

ti-th

hash

array,

i=1...lIn

order

answer

query

qfor

each

i=1..l,

retrieve

points

bucket

hreturn

the

closest

point

foundWhat

does

this

algorithm

proper

choice

parameters

and

canmake,

for

any

the

probability

thathi(p)=hi(q)

for

some

ilook

this:Can

control:Position

the

slopeHow

steep

isdistanceThe

LSH

algorithm

Therefore,

can

solve

(approximately)

the

nearneighbor

problem

with

given

parameter

rWorst-case

analysis

guarantees

dn1/(1+e)

query

Practical

evaluation

indicates

much

better

beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:

works

best

for

Hamming

distance

(although

can

generalizeto

Euclidean

space)requires

radius

fixed

advanceSecondary

storage

Seek

time

same

time

needed

transferhundreds

KBsGrouping

the

data

crucialDifferent

approach

required:in

main

memory,

any

reduction

the

numberof

inspected

points

was

goodon

disk,

this

not

the

case

!Disk-based

algorithmsR-tree

[Guttman’84]departing

point

for

many

variationsover

600

citations

(according

CiteSeer)“optimistic”

approach:

try

answer

queries

inlogarithmic

timeVector

Approximation

File

[WSB’98]“pessimistic”

approach:

need

scan

the

whdata

set,

better

fastLSH

works

also

diskR-tree

“Bottom-up”

approach

(k-d-tree

was“top-down”)

:Start

with

set

points/rectanglesPartition

the

set

into

groups

small

cardinFor

each

group,

find

minimum

rectanglecontaining

objects

from

this

groupRepeatR-tree

ctd.R-tree

ctd.Advantages:Supports

near(est)

neighbor

(similarbefore)Works

for

points

and

rectanglesAvoids

empty

spacesMany

variants:

X-tree,

SS-tree,

SR-tree

etcWorks

well

for

low

dimensionsNot

great

for

high

dimensionsVA-file

[Weber,

Schek,Blott’98]Approach:In

high-dimensional

spaces,

all

tree-basedindexing

structures

examine

large

fraction

ofleavesIf

need

visit

many

nodes

anyway,

isbetter

scan

the

whole

data

set

and

avoidperforming

seeks

altogether1

seek

transfer

few

hundred

KBVA-file

ctd.

Natural

question:

how

speed-up

linearscan

?Answer:

use

approximationUse

only

bits

per

dimension

(and

speed-up

thscan

factor

32/i)Identify

all

points

which

could

returned

aan

answerVerify

the

points

using

original

data

setTime

sum

up“Curse

dimensionality”

indeed

curse

main

memory,

can

perform

sublinear-timesearch

using

trees

hashing

secondary

storage,

linear

scan

人人文庫> 全部分類> 應用文書 > 作業(yè)報告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

Algorithms for Nearest Neighbor Search-大學課件-在線

文檔簡介

溫馨提示

最新文檔

評論

Algorithms for Nearest Neighbor Search-大學課件-在線

文檔簡介

溫馨提示

最新文檔

評論

相關文檔