版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
專題論壇大數(shù)據(jù)課件Big
Data
vs
Smart
Model:Beauty
and
the
BeastProf.
Yike
GuoDepartment
of
ComputingImperial
College
LondonBigDatavsSmartModel:Prof.Model
:
Mathematical
Representation
of
a
SimplifiedPhysical
World
Modelling
is
an
essential
and
inseparable
part
of
all
scientific
activity.
A
scientific
model
seeks
to
represent
empirical
objects,
phenomena,
and
physical
processes
in
a
logical
and
objective
way
To
understand
the
world
or
an
object
(called
a
target
T),
a
modelM
is
a
simplified
mathematical
representation
of
it.
Model
is
the
result
of
abstraction
from
observations
made,
and
it’s
used
to
give
prediction
Human
/
SensorHuman
/
Machine
Human
/
Machine.Model:MathematicalRepresentNo
Model
Is
Perfect:
?
Inherent
Uncertainty
:
These
targets
consist
of
a
set
of
continuous
phenomena
(in
both
time
and
space),
and
they
typically
produce
rich
signals.
Because
of
the
continuity
in
both
time
and
space
of
target,
the
signals
are
in
principle
infinite.
But
observations
(
e.g.
sensor
readings
)
are
made
at
discrete
points
in
time
and
space,
so
they
are
incomprehensive,
and
approximate,
which
brings
the
“uncertainty”.
?
Overfitting
or
Underfitting:
When
learning
a
model
from
observations,
such
as
learning
a
nonlinear
regression
model,
we
need
to
choose
the
parameters
such
as
K.
Considering
the
fact
that
the
information
from
observations
is
partial
.
It
is
hard
to
make
a
perfect
choice
of
K.
Such
imperfectness
causes
the
problem
of
model
error,
like
underfitting
(small
k)
and
overfitting
(large
k).?
Simplification:
From
observations,
we
project
from
a
multi-dimensional
world
a
simplified
model
with
significant
reduced
dimensionality
to
focus
on
the
features
or
properties
we
are
interested
in.Nonlinearregression:
K-order
polynomialNoModelIsPerfect:?SimplGeorge
Box
(statistician)
“All
models
are
wrong,
but
some
areuseful.”
Only
models,
from
cosmological
equations
to
theories
of
humanbehavior,
seemed
to
be
able
to
consistently,
if
imperfectly,
explain
the
worldaround
us.
1980Peter
Norvig
(Google)
:
"All
models
are
wrong,
and
increasinglyyou
can
succeed
without
them."
2008Chris
Anderson
(Wired)
:
There
is
now
a
better
way.
Petabytesallow
us
to
say:
"Correlation
is
enough."
We
can
stop
looking
for
models.We
can
analyze
the
data
without
hypotheses
about
what
it
might
show.
Wecan
throw
the
numbers
into
the
biggest
computing
clusters
the
world
hasever
seen
and
let
statistical
algorithms
find
patterns
where
science
cannot.(The
Data
Deluge
Makes
the
Scientific
Method
Obsolete)20124So,
Why
Model
?GeorgeBox(statistician)The
ArgumentAt
the
petabyte
scale,
information
is
not
a
matter
of
simple
three-
and
four-dimensionaltaxonomy
and
order
but
of
dimensionally
agnostic
statistics.
It
calls
for
an
entirely
differentapproach,
one
that
requires
us
to
lose
the
tether
of
data
as
something
that
can
be
visualizedin
its
totality.
It
forces
us
to
view
data
mathematically
first
and
establish
a
context
for
it
later.For
instance,
conquered
the
advertising
world
with
nothing
more
than
appliedmathematics.
It
didn't
pretend
to
know
anything
about
the
culture
and
conventions
ofadvertising
—
it
just
assumed
that
better
data,
with
better
analytical
tools,
would
win
the
day.And
was
right.Google's
founding
philosophy
is
that
we
don't
know
why
this
page
is
better
than
thatone:
If
the
statistics
of
incoming
links
say
it
is,
that's
good
enough.
No
semantic
orcausal
analysis
is
required.
That's
why
can
translate
languages
without
actually"knowing"
them
(given
equal
corpus
data,
can
translate
Klingon
into
Farsi
aseasily
as
it
can
translate
French
into
German).
And
why
it
can
match
ads
to
contentwithout
any
knowledge
or
assumptions
about
the
ads
or
the
content.TheGoogleArgumentAtthepetaModel
Free
Sensor
Informatics
:
Query
Driventime10am10am
..10amid12..7temp
20
21
…
29Database
Table
raw-dataSensorNetwork3.
Write
output
to
a
file/back
to
the
database4.
Write
data
processing
tools
to
process/aggregate
the
output
(maybe
using
User1.
Extract
all
readings
into
a
file2.
Run
MATLAB/R/other
data
processing
tools
DB)
5.
Decide
new
data
to
acquire
RepeatModel-free
sensing
treats
the
sensory
system
as
a
database,
and
sensing
as
querying
to
fetch
data
from
physicalworld.
One
of
the
leading
vendors
[Crossbow]
is
bundling
a
query
processor
with
their
devices.ModelFreeSensorInformaticsWikisensing
:
A
Model
Free
Sensor
Informatics
SystemBased
on
Big
Data
ArchitectureWikisensing:AModelFreeSenModel
Free
Sensing
is
Super
Inefficient?
Data
misrepresentation
without
model?
Latent
information
missing
without
model?
High
demand
of
computation/storage
without
model?
Require
too
much
of
interoperability
between
sensorsand
analyticsModelFreeSensingisSuperInBayesian:
Data
Is
Not
the
Enemy
of
Models
,
Rather
aGreat
Supporter!Bayesian
probability
is
a
formalism
that
allows
us
to
reason
about
beliefs
of
models
underconditions
of
uncertainty
based
on
the
observations
(data)
.If
we
have
observed
that
a
particular
event
has
happened,
such
as
Britain
coming
10th
in
themedal
table
at
the
2004
Olympics,
then
there
is
no
uncertainty
about
it.However,
suppose
a
is
the
statement
“Britain
sweeps
the
boards
at
2012
London
Olympics,winning
more
than
30
Gold
Medals!“
made
before
28th
of
JulySince
this
is
a
statement
about
a
future
event,
nobody
can
state
with
any
certainty
whether
ornot
it
is
true.
Different
people
may
have
different
beliefs
in
the
statement
depending
on
theirspecific
knowledge
of
factors
that
might
effect
its
likelihoodThe
belief’s
of
the
model
were
changing
daily
based
on
the
performance
data
available
eachday.
By
the
10
of
August,
most
of
people’s
belief
to
this
model
should
be
almost
80%Thus,
in
general,
a
person's
subjective
belief
in
a
statement
a
will
depend
on
some
body
ofknowledge
K.
We
write
this
as
P(a|K).
Henry's
belief
in
a
is
different
from
Marcel's
because
theyare
using
different
K's.
However,
even
if
they
were
using
the
same
K
they
might
still
havedifferent
beliefs
in
a.The
expression
P(a|K)
thus
represents
a
belief
measure.
Sometimes,
for
simplicity,
when
Kremains
constant
we
just
write
P(a),
but
you
must
be
aware
that
this
is
a
simplification.Bayesian:DataIsNottheEneModel
and
Data
Interaction
:
Bayesian
Inference10?Bayes
Rule:
Interaction
between
data
and
model?Learning
as
A
Sequence
of
Interactionsp(Y
|
)
p(
)
p(Y)P(
|
Y)
ModelandDataInteraction:BBig
Data
Meets
Smart
Models
:
A
Bayesian
Approachtowards
Sensor
Informatics?We
need
model
:
a
model
is
the
representation
of
our
knowledge
so
far?????Data
:
the
observations
which
may
revise
our
belief
to
the
models
we
haveAnalysis
:
assessing
our
belief
and
updating
our
models
to
make
them
more
believableSensing
:
acquiring
needed
data
to
update
(enrich)
modelsModels
are
learned
from
data
(observations)
by
scientists
(theoretical
abstraction)
or
by
machine
(machinelearning)
?
Models
are
hypothesis
(
when
making
new
observation)
?
Models
are
knowledge
(when
established
belief)Sensor
Informatics:
Sensing
management
Managing
the
“neediness”
:
when
and
where
to
sense
?
Sensing
analytics
Managing
model
updating
:
how
to
enrich
models
with
observations
?
Reasoning
Decision
making
based
on
integration
of
trusted
models
?P(M
|
D)
=
P(D
|
M
)
P(M)
/
P(D)BigDataMeetsSmartModels:
Surprising
Event
:
When
an
Observation
Does
not
Fit
a
Known
Model
Posterior
and
prior
(P(M|D)
~
P(M)
)
has
great
variance
->
surprise!How
great
is
great
variance?
Surprise
threshold
αKullback-Leibler
divergence:Other
methods:
signficant
level,
Chebyshev’s
Theorem,
…
From
model,
we
get
C(A,
B)
(e.g.
a
multivariate
Gaussian
distribution)
A:
100mm
B:
50mmModel
consistentA:
100mmB:
500mmSurprise! SurprisingEvent:WhenanObCamera
example:
Image
->
Analog
Signal
->Digital
Data
->
Compressed
Data
->
InformationWhy
sensing
so
much
data
and
then
throw
themaway?Why
not
sensing
information
directly?Using
Compressive
Sensing
Technology
to
OptimizeObservations
Compressive
sensing:
Take
the
advantage
of
sparseness,
to
solve
the
under-determined
signals
with
just
a
small
amount
of
measurement.
Unobserved
behavior
(behavior
not
captured
by
the
current
model)
is
typically
sparse.Reconstruction
method:
L1-min,
Bayesian
CS.Sensing
data
is
enough
when
we
can
recover
the
need
information
through
compressive
sensing.Ψ:
CS
Matrix
built
from
the
modelΦ:
Placement
MatrixCameraexample:Image->AnaloHow
to
Update
Model
–
Parameter
Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC
25
2011
21:15:23NODAL
SOLUTIONSTEP=360SUB
=1TIME=1800TEMP
(AVG)RSYS=0SMN
=131.03SMX
=646.41
MX
MN
Z
XEstimating
parameter
θ
to
maximize
the
likelihoodof
data
given
the
model:HowtoUpdateModel–ParametModel
:
An
Example
in
Digital
CityModelling
City
Life
via
Causality
:
C(eA,
eB)
is
used
for
predict
current
value
of
location
(A)
whenanother
location
(B)
value
is
given
Location
:
physical
/
logical
locations
with
causality
(through
sensory
cortex)(city
areas,
A.
B)
Relationship
:
topology
(geo
topology
between
A
and
B:
diffusion
Structure
)
Event:
events,
which
is
the
dynamics
of
observable
signal
S
=
f(E)
(heavyrainfall)Model:AnExampleinDigitalOntologies
are
adopted
to
represent
locations
L,
relationships
R*events
E,
and
signals
S.Diffusion:
An
event
e1∈
E
in
n1causes
another
event
e2
∈
E
in
n2,when
two
nodes
n1,
n2
in
G
arelinked.
Digital
City
Model
:
looking
into
the
detailsSystem
T
=
(L,
R,
E)Model
M(T)
=
(G,
?,
B)Training
for
causality
?:
use
Bayesian
network
to
represent
theconditional
independencies
between
cause
and
target
variables:1.
Gaussian
Mixture
Models
(GMMs),
estimated
via
expectationmaximization
(EM)
2.
Gaussian
Process
with
Bayesian
Inference.Ontologiesareadoptedtorepr
When
the
surprise
>
surprise
threshold
Diversity
detected
identify
the
incorrect
causality
C(el,
ep),
which
is
sparse
Compressive
sensing
approachNew
observation->
measurement
thatcould
revise
model
in
model
space
tomaximize
the
likelihood
of
observations
Focusing
on
diversityPlacementModel
Updating
Model
Driven
Sensing
:
No
Surprise
!
The
dynamics
of
model
update:
Surprise
->
Sensing
->
Model
Updating
The
goal
for
sensing:
Capturingsurprise
The
goal
of
analysis
:
RevisingmodelA
model
cannot
overfit
/
underfit,
when
there
is
diversity,
it
could
be
updated->
consistent
with
the
universe
(target) Whenthesurprise>surpriseModel
UpdateIt’s
a
Bayesian:
P(M,
?
|
D)
=
P(D
|
M,
?)
P(M,
?)
/
P(D)T:
target,
M:
model,
?:
top-down
parameter*
When
?
is
fixed:
P(M
|
D)
=
P(D
|
M)
P(M)
/
P(D)->
The
variance
between
posterior
and
prior
is
“surprise”->
bottom-up
attention
->
model
update
(data
assimilation):combining
observations
of
the
current
state
of
a
system
with
the
resultsfrom
a
model
(the
forecast)
to
produce
an
analysis.
The
model
is
thenadvanced
in
time
and
its
result
becomes
the
forecast
in
the
nextanalysis
cycle*
When
?
is
updated:
P(M,
?)
=
P(M
|
?)P(?)->
top-down
attention
(alertness)
->
model
updateModelUpdateIt’saBayesian:PAdaptive
Observation:
Sensing
and
Numerical
ModellingCityGML
Ontology
->
GIS
->
Geometry
meshAdaptiveObservation:SensingBuilding
An
Initial
Model
and
Making
Prediction
bySimulationsSetting
up
boundary
conditions,
numerical
schemas,
model
parameters,
etc.BuildingAnInitialModelandSimulation24
Building
Case
(Fine
Mesh
–
600000
Nodes):
20
ProcessorsSimulation24BuildingCase(FiSimulationMoving
Vehicles
and
Scalar
Dispersions
in
Street
CanyonsSimulationMovingVehiclesandUsing
Sensor
to
Verify
the
Prediction
Results
of
theModel
Sensing:
Acquiring
data
to
get
posterior
of
model,
for
validate
(consistent)
or
update
model
.
P(M
|
D)
=
P(D
|
M)
P(M)
/
P(D)Data
sensingModelvalidateupdateUsingSensortoVerifythePreNew
WikiSensing:
Elastic
Sensing
Environment
forLarge
Scale
Sensor
Informatics?
Elastic
sensing
theory
based
on
Bayesian
inference?
Big
Data
architecture
for
large
scale
sensory
data
management?
Ontology
for
the
background
knowledge
management?
Model
driven
adaptive
observation
support?
Digital
City
and
digital
life
applicationsNewWikiSensing:ElasticSensiThe
architecture
of
the
New
WikiSensing
SystemThearchitectureoftheNewWiOntology
Used
to
Organise
the
Complex
knowledgemanagementUsing
ontology
to
represent
the
targets,
signals,sensing
methods,
measurements,
etc.Ontology
to
support
flexible
resolution
Upper
ontology
for
unified
operationOntoSensorOntologyUsedtoOrganisetheConclusion?
Big
data
offers
great
opportunity
for
building
smart
models?
Big
data
provides
new
methodology
for
model
research?
New
informatics
comes
from
the
close
coupled
integration
of
the
data
and
the
model
worlds?
Bayesian
theory
provides
a
nature
foundation
for
such
an
integration?
Sensor
Informatics
is
a
good
example
for
such
a
paradigm?
A
new
uniform
framework
of
sensor
informatics
can
be
developed
based
on
the
Bayesian
theory
wherethe
dynamics
of
data
and
model
capturing
the
essence
of
building
a
sensory
system?
We
are
developing
the
WikiSensing
system
to
realise
this
paradigmConclusion?BigdataoffersThank
youThankyouUnderstanding
Big
DataHaixun
WangUnderstandingBigDataHaixunWData
ExplosionMB
=
106
bytesa
typical
book
in
text
formatGB
=
109
bytesa
one
hour
video
is
about
1GB;data
produced
by
a
biologyexperiment
in
one
dayTB
=
1012
bytesastronomy
data
in
one
night;US
Library
of
Congress
has
1000
TB
data;search
log
of
Bing
is
20
TB
per
day
(2009)DataExplosionMB=106bytesaThe
Arecibo
TelescopeWorld’s
largest
radio
telescopeDiameter
:
305
m
(1,000
ft)Area
:
18
acresLocation:
Arecibo,
Puerto
RicoThe
P-ALFA
surveys800
Terabytes
in
5
yearsTheAreciboTelescopeWorld’slSoftware
Driven
Telescopefrom
few,
large,
expensive,directional
dishes
to
many,
small,cheap,
omni
directional
antennaea
large
number
of
high-speedinput
streams(2Gbps
per
antenna,
25,000antennae
in
an
area
of
340
km
indiameter)SoftwareDrivenTelescopefromData
sizeChallenge
1:
It’s
the
data,
stupid!Data
complexityKey/value
storeColumn
storeDocument
storeGraph
SystemsDatasizeChallenge1:It’stheBig
data
drives
tomorrow’s
economy.?
The
value
of
big
data
lies
in
its
degree
ofconnectedness.?
Existing
systems
cannot
handle
richconnectedness
of
big
data.Bigdatadrivestomorrow’secoRDBMS
and
Rich
Relationships?
Performance
of
multi-way
joins
is
very
poor
inRDBMS?
Managing
data
of
rich
connectedness
requiresmulti-way
Joins
in
RDBMSRDBMSandRichRelationships?Trinity?
A
general
purpose,
distributed,
in
memory
graph
system?
Online
graph
query
processing?
Offline
graph
analyticsTrinity?Ageneralpurpose,dTrinity
Performance
Highlight?
Onlinequeryprocessing
:–
visiting
2.2
million
users
(3
hop
neighborhood)
on
Facebook:
<=
100ms–
foundation
for
graph-based
service,
e.g.,
entity
search?
Offlinegraphanalytics
:–
one
iteration
on
a
1
billion
node
graph:
<=
60sec–
foundation
for
analytics,
e.g.,
social
analyticsTrinityPerformanceHighlight?PeopleSearchDemoPeopleSearchDemoMulti-way
Join
vs.
Graph
TraversalCompanyIncidentProblem…IDCompanyID1ID2ID…IncidentID3ID4ID…ProblemRDBMSTrinityMulti-wayJoinvs.GraphTraveChallenge
2:
Interpretation
of
Big
Data?
IBM
Watson:–
Runs
on
2,880
cores,
15
terabytes
of
RAM,
and80kW
of
power?
A
human
brain:–
Runs
on
a
tuna
fish
sandwich
and
a
glass
of
waterChallenge2:Interpretationofansweringthe
questionunconstrainednatural
languageinferencing
&reasoningdomain
specificlanguagesimplecalculation
Human(Turing
Test)SIRI
Watson
Wolfram
AlphaGoogle/Bing?
the
Eternal
Questunderstanding
the
question
SQLcalculatoransweringthequestionunconstraTurning
the
Web
intoa
DatabaseTurningtheWeb intoWhat
you
see
when
you
look
at
my
homepage
…Haixun
WangMicrosoft
Research
AsiaEmail:
haixunw
@
microsoft
.
comTel:
+86-10-58963289Tel:
+1-914-902-0749I
joined
Microsoft
Research
Asia
in
2009.I
was
with
IBM
T.
J.
Watson
ResearchCenter
from
2000
to
2009.
I
received
theB.S.
and
M.S.
Degree
in
Computer
Sciencefrom
ShanghaiJiaoTongUniversity
in1994
and
1996,
the
Ph.D.
Degree
inComputer
Science
fromUniversityofCalifornia,LosAngelesin
June,
2000.WhatyouseewhenyoulookatAWhat
a
machine
sees
when
it
looks
at
my
homepage
…A
JPEG
Imagea
jpeg
Filetext
in
bigA
bold
fontA4
lines
of
textanother
dozen
lines
oftext
with
twoembedded
URLsAWhatamachineseeswhenitl專題論壇大數(shù)據(jù)課件Semantic
Web??
Number
1
trend
in
2008–
Richard
MacManus?
The
infrastructure
to
power
theSemantic
Web
is
already
here.–
Tim
Berners-Lee?
Unstructured
information
will
give
way
to
structuredinformation
–
paving
the
road
to
intelligent
computing.–
Alex
IskoldSemanticWeb??Number1tren專題論壇大數(shù)據(jù)課件More
data
beats
better
algorithmsBanko
and
Brill
2001MoredatabeatsbetteralgoritMean
translation
quality(1=incomprehensible,
4
=
perfect)English-Spanish
translation
quality,Microsoft
technical
texts2.5
23.52001200220032004200520062007Systran
Improvealgorithms,
scale
system,and
add
data!Rule-based
system
with
expensive
customizations
for
Microsoft3
MSRMT
Logos
Off-the-shelfrule-based
systemFrom
Rick
Rashid’s
talk:
It’s
a
data
driven
world
–
get
over
it!Meantranslationquality(1=incProbase
isA(concept,entities)isPropertyOf
(attributes)Co-occurrence
(isCEOof,
LocatedIn,etc)Concepts
(“SpanishArtists”)Entities
(“PabloPicaso”)Probase isAisPropertyOfCo-occuExplicit
vs.
Latent
Knowledge?
Abstract
representations
(such
as
clustersfrom
latent
analysis)
that
lack
linguisticcounterparts
are
hard
to
learn
or
validate
andtend
to
lose
information.?
Human
language
has
evolved
over
millennia
tohave
words
for
the
important
concepts;
let’suse
them.Halevy,
Norvig,
Pereira,
“The
Unreasonable
Effectiveness
of
Data”,
IEEE
Intelligent
Systems,
2009.Explicitvs.LatentKnowledge?What
is
interpretation?Whatisinterpretation?Add
Common
Sense
to
ComputingPablo
Picasso
25
Oct
1881SpanishAddCommonSensetoComputingPWhich
is
“kiki”
and
which
is
“bouba”?Whichis“kiki”andwhichis“soundshapezigzaggednesssoundshapezigzaggednessChinaIndiacountryBrazilemerging
marketChinaIndiacountryBrazilemerginbodytastesmell
winebodytastesmellIT
companyThe
engineer
is
eating
an
applefruitITcompanyTheengineeriseat
Multiple
ConceptsObama’s
real-estatepolicypresident,
politicianinvestment,
property,
asset,
plan,
documentpresident,
politician,investment,
property,
asset,
plan,
document MultipleConceptspresident,pMultiple
Concepts
applesoftware
company,
brand,
fruit,
juice
adobebrand,
software
company,
materialsoftware
company,software
manufacturer,
brand
juice,
materialbrand,
company,
fruit,MultipleConcepts apple adobes
Multiple
ConceptsObama’s
real-estatepolicypresident,
politicianinvestment,
property,
asset,
plan,
documentpresident,
politician,investment,
property,
example
plan,
documentthing,
issue,
term,
asset, MultipleConceptspresident,pExample:
(from
B.
Dolan)Who
assassinatedAbraham
Lincoln?Example:(fromB.Dolan)WhoasThe
far
reaching
implicationsScientific
MethodThefarreachingimplicationsSScientific
MethodScientificMethodWhat
really
counts
isunderstandingora
mastery
of
some
commonvocabularyWhatreallycountsisunderstanHow
can
big
data
help?A
much
more
rapid
cycle
of
hypothesisgeneration
and
testing?
General
access
toknowledge
in
science?
Autonomousexperimentation,
withan
‘a(chǎn)ctive
learning’modelHowcanbigdatahelp?AmuchmTechnological
Singularityif
machines
could
even
slightly
surpass
human
intellect,
they
could
improve
theirown
designs
in
ways
unforeseen
by
their
designers,
and
thus
recursively
augmentthemselves
into
far
greater
intelligencesTechnologicalSingularityifmaThanksThanks大數(shù)據(jù)平臺(tái)及互聯(lián)網(wǎng)應(yīng)用服務(wù)大數(shù)據(jù)平臺(tái)及互聯(lián)網(wǎng)應(yīng)用服務(wù)Agenda
當(dāng)前面臨問(wèn)題和挑戰(zhàn)
國(guó)內(nèi)外公司解決方案
大數(shù)據(jù)領(lǐng)域騰訊解決之道Agenda當(dāng)前面臨問(wèn)題和挑戰(zhàn)國(guó)內(nèi)外公司解決方案Agenda第一篇:當(dāng)前面臨問(wèn)題和挑戰(zhàn)Agenda第一篇:當(dāng)前面臨問(wèn)題和挑戰(zhàn)大數(shù)據(jù)挑戰(zhàn)(1)-海量數(shù)據(jù)存儲(chǔ)技術(shù)?
1.PB級(jí)數(shù)據(jù)向ZB級(jí)演進(jìn),如何降低存儲(chǔ)
和計(jì)算成本數(shù)據(jù)量:46PB機(jī)器數(shù)量:5600臺(tái)2.工業(yè)級(jí)業(yè)務(wù)發(fā)展迅速對(duì)大數(shù)據(jù)計(jì)算時(shí)
效性和可靠性提出新的挑戰(zhàn)大數(shù)據(jù)挑戰(zhàn)(1)-海量數(shù)據(jù)存儲(chǔ)技術(shù)?數(shù)據(jù)量:46PB機(jī)器數(shù)量大數(shù)據(jù)挑戰(zhàn)(2)—數(shù)據(jù)應(yīng)用難大數(shù)據(jù)挑戰(zhàn)(2)—數(shù)據(jù)應(yīng)用難大數(shù)據(jù)挑戰(zhàn)(3)-精準(zhǔn)推薦難1.企業(yè)信息泛濫的問(wèn)題(全互聯(lián)網(wǎng))2.推薦精度低3.推薦效果有效評(píng)估問(wèn)題4.如何有效收集用戶主動(dòng)行為數(shù)據(jù)大數(shù)據(jù)挑戰(zhàn)(3)-精準(zhǔn)推薦難1.企業(yè)信息泛濫的問(wèn)題(全互聯(lián)網(wǎng)Agenda第二篇:
國(guó)內(nèi)外公司解決方案Agenda第二篇:國(guó)內(nèi)外公司解決方案hadoop開(kāi)源產(chǎn)品HbaseMahoutHive/Pig海豚技術(shù)海狗章魚(yú)海星劍魚(yú)藍(lán)鯨…..…..海量計(jì)算:基于Hadoop海量存儲(chǔ)計(jì)算集群,同時(shí)提供一站式的計(jì)算和存儲(chǔ)資源管理
分布式數(shù)據(jù)挖掘:
基于Mahout分布式數(shù)
據(jù)數(shù)據(jù)挖掘數(shù)據(jù)分發(fā)中心:提供批量數(shù)據(jù)抽取和轉(zhuǎn)載,同時(shí)準(zhǔn)實(shí)時(shí)消息,日志分發(fā)(采用客戶pull方式)
海量數(shù)據(jù)實(shí)時(shí)搜索:
基于Hbase和Solr集成,
提供千億級(jí)別數(shù)據(jù)實(shí)時(shí)
查詢和全文檢索流計(jì)算框架:類似M/R流式計(jì)算框架,可以實(shí)現(xiàn)應(yīng)用快速,提供在線數(shù)據(jù)加工服務(wù)海量數(shù)據(jù)查詢:基于hive和Pig,提供Web頁(yè)面海量數(shù)據(jù)可視化查詢服務(wù)國(guó)內(nèi)案例-支付寶大數(shù)據(jù)平臺(tái)
支付寶hadoop相關(guān)應(yīng)用服務(wù)hadoop開(kāi)源HbaseMahoutHive/Pig海豚技?????Online
news,
News
reports
that
recommendations
increasearticles
viewed
by
38%
(Das
et
al.
2007).Movies,
Netflix
reports
that
over
60%
of
their
rentals
originate
fromrecommendations
(Thompson
2008).Amazon,
which
sells
music,
books,
and
movies,
35%
of
sales
arereported
to
originate
from
recommendations
(Lamere
&
Green
2008).Video,
YouTub
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年度商業(yè)活動(dòng)場(chǎng)地借用及宣傳合同2篇
- 二零二五年度體育產(chǎn)業(yè)普通合伙企業(yè)合作協(xié)議范本4篇
- 2025年度5G產(chǎn)業(yè)投資理財(cái)協(xié)議
- 2025年三方知識(shí)產(chǎn)權(quán)轉(zhuǎn)讓還款協(xié)議書(shū)范本及內(nèi)容說(shuō)明3篇
- 個(gè)性化定制2024年版民間資金借貸協(xié)議范本版B版
- 2025年酒店住宿賠償協(xié)議范本
- 個(gè)人股份轉(zhuǎn)讓協(xié)議書(shū)
- 2025年標(biāo)準(zhǔn)植樹(shù)承包合同模板:森林碳匯項(xiàng)目專用3篇
- 個(gè)人汽車出租公司用協(xié)議細(xì)則(2024版)版B版
- 二零二五年度小微企業(yè)專項(xiàng)借貸合同
- 2024-2030年中國(guó)海泡石產(chǎn)業(yè)運(yùn)行形勢(shì)及投資規(guī)模研究報(bào)告
- 動(dòng)物醫(yī)學(xué)類專業(yè)生涯發(fā)展展示
- 2024年同等學(xué)力申碩英語(yǔ)考試真題
- 消除“艾梅乙”醫(yī)療歧視-從我做起
- 非遺文化走進(jìn)數(shù)字展廳+大數(shù)據(jù)與互聯(lián)網(wǎng)系創(chuàng)業(yè)計(jì)劃書(shū)
- 2024山西省文化旅游投資控股集團(tuán)有限公司招聘筆試參考題庫(kù)附帶答案詳解
- 科普知識(shí)進(jìn)社區(qū)活動(dòng)總結(jié)與反思
- 加油站廉潔培訓(xùn)課件
- 現(xiàn)金日記賬模板(帶公式)
- 消化內(nèi)科??票O(jiān)測(cè)指標(biāo)匯總分析
- 混凝土結(jié)構(gòu)工程施工質(zhì)量驗(yàn)收規(guī)范
評(píng)論
0/150
提交評(píng)論