




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
Algorithms
for
Nearest
NeighborSearchPiotr
IndykMITNearest
Neighbor
SearchGiven:
a
set
P
of
n
points
in
Rd
Goal:
a
data
structure,
which
given
a
quepoint
q,
finds
the
nearest
neighbor
p
ofin
PpqOutline
of
this
talkVariantsMotivationMain
memory
algorithms:quadtreeskd-treesLocality
Sensitive
HashingSecondary
storage
algorithms:R-tree
(and
its
variants)VA-fileVariants
of
nearest
neighbor
Near
neighbor
(range
search):
find
one/alpoints
in
P
within
distance
r
from
q
Spatial
join:
given
two
sets
P,Q,
find
allpairs
p
in
P,
q
in
Q,
such
that
p
is
withindistance
r
from
q
Approximate
near
neighbor:
find
one/allpoints
p’
in
P,
whose
distance
to
q
is
atmost
(1+e)
times
the
distance
from
q
to
itsnearest
neighborMotivationDepends
on
the
value
of
d:low
d:
graphics,
vision,
GIS,
etchigh
d:similarity
search
in
databases
(text,
imagesfinding
pairs
of
similar
objects
(e.g.,
copyrviolation
detection)useful
subroutine
for
clusteringAlgorithmsMain
memory
(Computational
Geometry)linear
scantree-based:quadtreekd-treehashing-based:
Locality-Sensitive
HashingSecondary
storage
(Databases)R-tree
(and
numerous
variants)Vector
Approximation
File
(VA-file)QuadtreeSimplest
spatial
structure
on
Earth
!Quadtree
ctd.Split
the
space
into
2d
equal
subsquaresRepeat
until
done:only
one
pixel
leftonly
one
point
leftonly
a
few
points
leftVariants:split
only
one
dimension
at
a
timek-d-trees
(in
a
moment)Range
searchNear
neighbor
(range
search):put
the
root
on
the
stackrepeatpop
the
next
node
T
from
the
stackfor
each
child
C
of
T:if
C
is
a
leaf,
examine
point(s)
in
Cif
C
intersects
with
the
ball
of
radius
r
around
q,
add
Cthe
stackNear
neighbor
ctdNearest
neighborStart
range
search
with
r
=Whenever
a
point
is
found,
update
r
Only
investigate
nodes
with
respect
tocurrent
rQuadtree
ctd.Simple
data
structureVersatile,
easy
to
implementSo
why
doesn’t
this
talk
end
here
?Empty
spaces:
if
the
points
form
sparse
cloudit
takes
a
while
to
reach
themSpace
exponential
in
dimensionTime
exponential
in
dimension,
e.g.,
points
othe
hypercubeSpace
issues:
exampleK-d-trees
[Bentley’75]Main
ideas:only
one-dimensional
splitsinstead
of
splitting
in
the
middle,
choose
thsplit
“carefully”
(many
variations)near(est)
neighbor
queries:
as
for
quadtreesAdvantages:no
(or
less)
empty
spacesonly
linear
spaceExponential
query
time
still
possibleExponential
query
timeWhat
does
it
mean
exactly
?Unless
we
do
something
really
stupid,
query
time
ismost
dnTherefore,
the
actual
query
time
isMin[
dn,
exponential(d)
]
This
is
still
quite
bad
though,
when
the
dimensiois
around
20-30
Unfortunately,
it
seems
inevitable
(both
in
theoand
practice)Approximate
nearest
neighbor
Can
do
it
using
(augmented)
k-d
trees,
byinterrupting
search
earlier
[Arya
et
al’Still
exponential
time
(in
the
worst
caseTry
a
different
approach:for
exact
queries,
we
can
use
binary
searchtrees
or
hashingcan
we
adapt
hashing
to
nearest
neighborsearch
?Locality-Sensitive
Hashing[Indyk-Motwani’98]
Hash
functions
are
locality-sensitive,
ia
random
hash
random
function
h,
for
anypair
of
points
p,q
we
have:Pr[h(p)=h(q)]
is
“high”
if
p
is
“close”
tqPr[h(p)=h(q)]
is
“l(fā)ow”
if
p
is”far”
fromqDo
such
functions
exist
?Consider
the
hypercube,
i.e.,pointsfrom{0,1}dHamming
distance
D(p,q)=
#
positions
onwhich
p
and
q
differDefine
hash
function
h
by
choosing
a
set
Iof
k
random
coordinates,
and
settingh(p)
=projection
of
p
onIExampleTake–
d=10,
p=0101110010–
k=2,
I={2,5}Then
h(p)=11h’s
are
locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe
can
vary
the
probability
by
changing
kk=1k=2distancedistancePrPrHow
can
we
use
LSH
?Choose
several
h1..hlInitialize
a
hash
array
for
each
hiStore
each
point
p
in
the
bucket
hi(p)
of
ti-th
hash
array,
i=1...lIn
order
to
answer
query
qfor
each
i=1..l,
retrieve
points
in
a
bucket
hreturn
the
closest
point
foundWhat
does
this
algorithm
do
?
By
proper
choice
of
parameters
k
and
l,
we
canmake,
for
any
p,
the
probability
thathi(p)=hi(q)
for
some
ilook
like
this:Can
control:Position
of
the
slopeHow
steep
it
isdistanceThe
LSH
algorithm
Therefore,
we
can
solve
(approximately)
the
nearneighbor
problem
with
given
parameter
rWorst-case
analysis
guarantees
dn1/(1+e)
query
ti
Practical
evaluation
indicates
much
better
beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:
works
best
for
Hamming
distance
(although
can
be
generalizeto
Euclidean
space)requires
radius
r
to
be
fixed
in
advanceSecondary
storage
Seek
time
same
as
time
needed
to
transferhundreds
of
KBsGrouping
the
data
is
crucialDifferent
approach
required:in
main
memory,
any
reduction
in
the
numberof
inspected
points
was
goodon
disk,
this
is
not
the
case
!Disk-based
algorithmsR-tree
[Guttman’84]departing
point
for
many
variationsover
600
citations
!
(according
to
CiteSeer)“optimistic”
approach:
try
to
answer
queries
inlogarithmic
timeVector
Approximation
File
[WSB’98]“pessimistic”
approach:
if
we
need
to
scan
the
whdata
set,
we
better
do
it
fastLSH
works
also
on
diskR-tree
“Bottom-up”
approach
(k-d-tree
was“top-down”)
:Start
with
a
set
of
points/rectanglesPartition
the
set
into
groups
of
small
cardinFor
each
group,
find
minimum
rectanglecontaining
objects
from
this
groupRepeatR-tree
ctd.R-tree
ctd.Advantages:Supports
near(est)
neighbor
search
(similarbefore)Works
for
points
and
rectanglesAvoids
empty
spacesMany
variants:
X-tree,
SS-tree,
SR-tree
etcWorks
well
for
low
dimensionsNot
so
great
for
high
dimensionsVA-file
[Weber,
Schek,Blott’98]Approach:In
high-dimensional
spaces,
all
tree-basedindexing
structures
examine
large
fraction
ofleavesIf
we
need
to
visit
so
many
nodes
anyway,
it
isbetter
to
scan
the
whole
data
set
and
avoidperforming
seeks
altogether1
seek
=
transfer
of
few
hundred
KBVA-file
ctd.
Natural
question:
how
to
speed-up
linearscan
?Answer:
use
approximationUse
only
i
bits
per
dimension
(and
speed-up
thscan
by
a
factor
of
32/i)Identify
all
points
which
could
be
returned
aan
answerVerify
the
points
using
original
data
setTime
to
sum
up“Curse
of
dimensionality”
is
indeed
a
curse
In
main
memory,
we
can
perform
sublinear-timesearch
using
trees
or
hashing
In
secondary
storage,
linear
scan
is
p
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 橋梁工程延期整改措施
- 粵滬科版八年級物理上冊教案計劃
- 大學生創(chuàng)業(yè)火鍋店推廣渠道拓展方案范文
- 副校長教育科研協調計劃
- 教師教學行為師德警示心得體會
- 2025年教導處教學設備升級計劃
- 風電施工技術標準化方案和措施
- 十四五規(guī)劃人才培養(yǎng)心得體會
- 以小組合作之翼展初中數學課堂新程
- 以客戶價值為核心的產品規(guī)劃創(chuàng)新方法與實踐探究
- 蘇教版六年級科學下冊期末測試卷及答案
- 人教版高中物理(必修一)同步講義+練習4.6 超重和失重(含解析)
- 2022年江蘇省常州市強基計劃選拔數學試卷(附答案解析)
- 三年級數學下冊計算題大全(每日一練共18份)
- 山東省泰安市2023-2024學年高二下學期7月期末數學試題(原卷版+解析版)
- 2024年越南玻尿酸填充行業(yè)現狀及前景分析2024-2030
- JBT 14714-2024 鋰離子電池X射線檢測設備(正式版)
- 【欽州市S區(qū)居民飲用水安全現狀、問題及優(yōu)化建議探析8300字(論文)】
- 城市總體規(guī)劃專題研究報告總結
- 新課標小學生必背古詩75首(帶拼音)
- 高中數學知識
評論
0/150
提交評論