專題論壇大數(shù)據(jù)課件_第1頁(yè)
專題論壇大數(shù)據(jù)課件_第2頁(yè)
專題論壇大數(shù)據(jù)課件_第3頁(yè)
專題論壇大數(shù)據(jù)課件_第4頁(yè)
專題論壇大數(shù)據(jù)課件_第5頁(yè)
已閱讀5頁(yè),還剩82頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

專題論壇大數(shù)據(jù)課件Big

Data

vs

Smart

Model:Beauty

and

the

BeastProf.

Yike

GuoDepartment

of

ComputingImperial

College

LondonBigDatavsSmartModel:Prof.Model

:

Mathematical

Representation

of

a

SimplifiedPhysical

World

Modelling

is

an

essential

and

inseparable

part

of

all

scientific

activity.

A

scientific

model

seeks

to

represent

empirical

objects,

phenomena,

and

physical

processes

in

a

logical

and

objective

way

To

understand

the

world

or

an

object

(called

a

target

T),

a

modelM

is

a

simplified

mathematical

representation

of

it.

Model

is

the

result

of

abstraction

from

observations

made,

and

it’s

used

to

give

prediction

Human

/

SensorHuman

/

Machine

Human

/

Machine.Model:MathematicalRepresentNo

Model

Is

Perfect:

?

Inherent

Uncertainty

:

These

targets

consist

of

a

set

of

continuous

phenomena

(in

both

time

and

space),

and

they

typically

produce

rich

signals.

Because

of

the

continuity

in

both

time

and

space

of

target,

the

signals

are

in

principle

infinite.

But

observations

(

e.g.

sensor

readings

)

are

made

at

discrete

points

in

time

and

space,

so

they

are

incomprehensive,

and

approximate,

which

brings

the

“uncertainty”.

?

Overfitting

or

Underfitting:

When

learning

a

model

from

observations,

such

as

learning

a

nonlinear

regression

model,

we

need

to

choose

the

parameters

such

as

K.

Considering

the

fact

that

the

information

from

observations

is

partial

.

It

is

hard

to

make

a

perfect

choice

of

K.

Such

imperfectness

causes

the

problem

of

model

error,

like

underfitting

(small

k)

and

overfitting

(large

k).?

Simplification:

From

observations,

we

project

from

a

multi-dimensional

world

a

simplified

model

with

significant

reduced

dimensionality

to

focus

on

the

features

or

properties

we

are

interested

in.Nonlinearregression:

K-order

polynomialNoModelIsPerfect:?SimplGeorge

Box

(statistician)

“All

models

are

wrong,

but

some

areuseful.”

Only

models,

from

cosmological

equations

to

theories

of

humanbehavior,

seemed

to

be

able

to

consistently,

if

imperfectly,

explain

the

worldaround

us.

1980Peter

Norvig

(Google)

:

"All

models

are

wrong,

and

increasinglyyou

can

succeed

without

them."

2008Chris

Anderson

(Wired)

:

There

is

now

a

better

way.

Petabytesallow

us

to

say:

"Correlation

is

enough."

We

can

stop

looking

for

models.We

can

analyze

the

data

without

hypotheses

about

what

it

might

show.

Wecan

throw

the

numbers

into

the

biggest

computing

clusters

the

world

hasever

seen

and

let

statistical

algorithms

find

patterns

where

science

cannot.(The

Data

Deluge

Makes

the

Scientific

Method

Obsolete)20124So,

Why

Model

?GeorgeBox(statistician)The

Google

ArgumentAt

the

petabyte

scale,

information

is

not

a

matter

of

simple

three-

and

four-dimensionaltaxonomy

and

order

but

of

dimensionally

agnostic

statistics.

It

calls

for

an

entirely

differentapproach,

one

that

requires

us

to

lose

the

tether

of

data

as

something

that

can

be

visualizedin

its

totality.

It

forces

us

to

view

data

mathematically

first

and

establish

a

context

for

it

later.For

instance,

Google

conquered

the

advertising

world

with

nothing

more

than

appliedmathematics.

It

didn't

pretend

to

know

anything

about

the

culture

and

conventions

ofadvertising

it

just

assumed

that

better

data,

with

better

analytical

tools,

would

win

the

day.And

Google

was

right.Google's

founding

philosophy

is

that

we

don't

know

why

this

page

is

better

than

thatone:

If

the

statistics

of

incoming

links

say

it

is,

that's

good

enough.

No

semantic

orcausal

analysis

is

required.

That's

why

Google

can

translate

languages

without

actually"knowing"

them

(given

equal

corpus

data,

Google

can

translate

Klingon

into

Farsi

aseasily

as

it

can

translate

French

into

German).

And

why

it

can

match

ads

to

contentwithout

any

knowledge

or

assumptions

about

the

ads

or

the

content.TheGoogleArgumentAtthepetaModel

Free

Sensor

Informatics

:

Query

Driventime10am10am

..10amid12..7temp

20

21

29Database

Table

raw-dataSensorNetwork3.

Write

output

to

a

file/back

to

the

database4.

Write

data

processing

tools

to

process/aggregate

the

output

(maybe

using

User1.

Extract

all

readings

into

a

file2.

Run

MATLAB/R/other

data

processing

tools

DB)

5.

Decide

new

data

to

acquire

RepeatModel-free

sensing

treats

the

sensory

system

as

a

database,

and

sensing

as

querying

to

fetch

data

from

physicalworld.

One

of

the

leading

vendors

[Crossbow]

is

bundling

a

query

processor

with

their

devices.ModelFreeSensorInformaticsWikisensing

:

A

Model

Free

Sensor

Informatics

SystemBased

on

Big

Data

ArchitectureWikisensing:AModelFreeSenModel

Free

Sensing

is

Super

Inefficient?

Data

misrepresentation

without

model?

Latent

information

missing

without

model?

High

demand

of

computation/storage

without

model?

Require

too

much

of

interoperability

between

sensorsand

analyticsModelFreeSensingisSuperInBayesian:

Data

Is

Not

the

Enemy

of

Models

,

Rather

aGreat

Supporter!Bayesian

probability

is

a

formalism

that

allows

us

to

reason

about

beliefs

of

models

underconditions

of

uncertainty

based

on

the

observations

(data)

.If

we

have

observed

that

a

particular

event

has

happened,

such

as

Britain

coming

10th

in

themedal

table

at

the

2004

Olympics,

then

there

is

no

uncertainty

about

it.However,

suppose

a

is

the

statement

“Britain

sweeps

the

boards

at

2012

London

Olympics,winning

more

than

30

Gold

Medals!“

made

before

28th

of

JulySince

this

is

a

statement

about

a

future

event,

nobody

can

state

with

any

certainty

whether

ornot

it

is

true.

Different

people

may

have

different

beliefs

in

the

statement

depending

on

theirspecific

knowledge

of

factors

that

might

effect

its

likelihoodThe

belief’s

of

the

model

were

changing

daily

based

on

the

performance

data

available

eachday.

By

the

10

of

August,

most

of

people’s

belief

to

this

model

should

be

almost

80%Thus,

in

general,

a

person's

subjective

belief

in

a

statement

a

will

depend

on

some

body

ofknowledge

K.

We

write

this

as

P(a|K).

Henry's

belief

in

a

is

different

from

Marcel's

because

theyare

using

different

K's.

However,

even

if

they

were

using

the

same

K

they

might

still

havedifferent

beliefs

in

a.The

expression

P(a|K)

thus

represents

a

belief

measure.

Sometimes,

for

simplicity,

when

Kremains

constant

we

just

write

P(a),

but

you

must

be

aware

that

this

is

a

simplification.Bayesian:DataIsNottheEneModel

and

Data

Interaction

:

Bayesian

Inference10?Bayes

Rule:

Interaction

between

data

and

model?Learning

as

A

Sequence

of

Interactionsp(Y

|

)

p(

)

p(Y)P(

|

Y)

ModelandDataInteraction:BBig

Data

Meets

Smart

Models

:

A

Bayesian

Approachtowards

Sensor

Informatics?We

need

model

:

a

model

is

the

representation

of

our

knowledge

so

far?????Data

:

the

observations

which

may

revise

our

belief

to

the

models

we

haveAnalysis

:

assessing

our

belief

and

updating

our

models

to

make

them

more

believableSensing

:

acquiring

needed

data

to

update

(enrich)

modelsModels

are

learned

from

data

(observations)

by

scientists

(theoretical

abstraction)

or

by

machine

(machinelearning)

?

Models

are

hypothesis

(

when

making

new

observation)

?

Models

are

knowledge

(when

established

belief)Sensor

Informatics:

Sensing

management

Managing

the

“neediness”

:

when

and

where

to

sense

?

Sensing

analytics

Managing

model

updating

:

how

to

enrich

models

with

observations

?

Reasoning

Decision

making

based

on

integration

of

trusted

models

?P(M

|

D)

=

P(D

|

M

)

P(M)

/

P(D)BigDataMeetsSmartModels:

Surprising

Event

:

When

an

Observation

Does

not

Fit

a

Known

Model

Posterior

and

prior

(P(M|D)

~

P(M)

)

has

great

variance

->

surprise!How

great

is

great

variance?

Surprise

threshold

αKullback-Leibler

divergence:Other

methods:

signficant

level,

Chebyshev’s

Theorem,

From

model,

we

get

C(A,

B)

(e.g.

a

multivariate

Gaussian

distribution)

A:

100mm

B:

50mmModel

consistentA:

100mmB:

500mmSurprise! SurprisingEvent:WhenanObCamera

example:

Image

->

Analog

Signal

->Digital

Data

->

Compressed

Data

->

InformationWhy

sensing

so

much

data

and

then

throw

themaway?Why

not

sensing

information

directly?Using

Compressive

Sensing

Technology

to

OptimizeObservations

Compressive

sensing:

Take

the

advantage

of

sparseness,

to

solve

the

under-determined

signals

with

just

a

small

amount

of

measurement.

Unobserved

behavior

(behavior

not

captured

by

the

current

model)

is

typically

sparse.Reconstruction

method:

L1-min,

Bayesian

CS.Sensing

data

is

enough

when

we

can

recover

the

need

information

through

compressive

sensing.Ψ:

CS

Matrix

built

from

the

modelΦ:

Placement

MatrixCameraexample:Image->AnaloHow

to

Update

Model

Parameter

Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC

25

2011

21:15:23NODAL

SOLUTIONSTEP=360SUB

=1TIME=1800TEMP

(AVG)RSYS=0SMN

=131.03SMX

=646.41

MX

MN

Z

XEstimating

parameter

θ

to

maximize

the

likelihoodof

data

given

the

model:HowtoUpdateModel–ParametModel

:

An

Example

in

Digital

CityModelling

City

Life

via

Causality

:

C(eA,

eB)

is

used

for

predict

current

value

of

location

(A)

whenanother

location

(B)

value

is

given

Location

:

physical

/

logical

locations

with

causality

(through

sensory

cortex)(city

areas,

A.

B)

Relationship

:

topology

(geo

topology

between

A

and

B:

diffusion

Structure

)

Event:

events,

which

is

the

dynamics

of

observable

signal

S

=

f(E)

(heavyrainfall)Model:AnExampleinDigitalOntologies

are

adopted

to

represent

locations

L,

relationships

R*events

E,

and

signals

S.Diffusion:

An

event

e1∈

E

in

n1causes

another

event

e2

E

in

n2,when

two

nodes

n1,

n2

in

G

arelinked.

Digital

City

Model

:

looking

into

the

detailsSystem

T

=

(L,

R,

E)Model

M(T)

=

(G,

?,

B)Training

for

causality

?:

use

Bayesian

network

to

represent

theconditional

independencies

between

cause

and

target

variables:1.

Gaussian

Mixture

Models

(GMMs),

estimated

via

expectationmaximization

(EM)

2.

Gaussian

Process

with

Bayesian

Inference.Ontologiesareadoptedtorepr

When

the

surprise

>

surprise

threshold

Diversity

detected

identify

the

incorrect

causality

C(el,

ep),

which

is

sparse

Compressive

sensing

approachNew

observation->

measurement

thatcould

revise

model

in

model

space

tomaximize

the

likelihood

of

observations

Focusing

on

diversityPlacementModel

Updating

Model

Driven

Sensing

:

No

Surprise

!

The

dynamics

of

model

update:

Surprise

->

Sensing

->

Model

Updating

The

goal

for

sensing:

Capturingsurprise

The

goal

of

analysis

:

RevisingmodelA

model

cannot

overfit

/

underfit,

when

there

is

diversity,

it

could

be

updated->

consistent

with

the

universe

(target) Whenthesurprise>surpriseModel

UpdateIt’s

a

Bayesian:

P(M,

?

|

D)

=

P(D

|

M,

?)

P(M,

?)

/

P(D)T:

target,

M:

model,

?:

top-down

parameter*

When

?

is

fixed:

P(M

|

D)

=

P(D

|

M)

P(M)

/

P(D)->

The

variance

between

posterior

and

prior

is

“surprise”->

bottom-up

attention

->

model

update

(data

assimilation):combining

observations

of

the

current

state

of

a

system

with

the

resultsfrom

a

model

(the

forecast)

to

produce

an

analysis.

The

model

is

thenadvanced

in

time

and

its

result

becomes

the

forecast

in

the

nextanalysis

cycle*

When

?

is

updated:

P(M,

?)

=

P(M

|

?)P(?)->

top-down

attention

(alertness)

->

model

updateModelUpdateIt’saBayesian:PAdaptive

Observation:

Sensing

and

Numerical

ModellingCityGML

Ontology

->

GIS

->

Geometry

meshAdaptiveObservation:SensingBuilding

An

Initial

Model

and

Making

Prediction

bySimulationsSetting

up

boundary

conditions,

numerical

schemas,

model

parameters,

etc.BuildingAnInitialModelandSimulation24

Building

Case

(Fine

Mesh

600000

Nodes):

20

ProcessorsSimulation24BuildingCase(FiSimulationMoving

Vehicles

and

Scalar

Dispersions

in

Street

CanyonsSimulationMovingVehiclesandUsing

Sensor

to

Verify

the

Prediction

Results

of

theModel

Sensing:

Acquiring

data

to

get

posterior

of

model,

for

validate

(consistent)

or

update

model

.

P(M

|

D)

=

P(D

|

M)

P(M)

/

P(D)Data

sensingModelvalidateupdateUsingSensortoVerifythePreNew

WikiSensing:

Elastic

Sensing

Environment

forLarge

Scale

Sensor

Informatics?

Elastic

sensing

theory

based

on

Bayesian

inference?

Big

Data

architecture

for

large

scale

sensory

data

management?

Ontology

for

the

background

knowledge

management?

Model

driven

adaptive

observation

support?

Digital

City

and

digital

life

applicationsNewWikiSensing:ElasticSensiThe

architecture

of

the

New

WikiSensing

SystemThearchitectureoftheNewWiOntology

Used

to

Organise

the

Complex

knowledgemanagementUsing

ontology

to

represent

the

targets,

signals,sensing

methods,

measurements,

etc.Ontology

to

support

flexible

resolution

Upper

ontology

for

unified

operationOntoSensorOntologyUsedtoOrganisetheConclusion?

Big

data

offers

great

opportunity

for

building

smart

models?

Big

data

provides

new

methodology

for

model

research?

New

informatics

comes

from

the

close

coupled

integration

of

the

data

and

the

model

worlds?

Bayesian

theory

provides

a

nature

foundation

for

such

an

integration?

Sensor

Informatics

is

a

good

example

for

such

a

paradigm?

A

new

uniform

framework

of

sensor

informatics

can

be

developed

based

on

the

Bayesian

theory

wherethe

dynamics

of

data

and

model

capturing

the

essence

of

building

a

sensory

system?

We

are

developing

the

WikiSensing

system

to

realise

this

paradigmConclusion?BigdataoffersThank

youThankyouUnderstanding

Big

DataHaixun

WangUnderstandingBigDataHaixunWData

ExplosionMB

=

106

bytesa

typical

book

in

text

formatGB

=

109

bytesa

one

hour

video

is

about

1GB;data

produced

by

a

biologyexperiment

in

one

dayTB

=

1012

bytesastronomy

data

in

one

night;US

Library

of

Congress

has

1000

TB

data;search

log

of

Bing

is

20

TB

per

day

(2009)DataExplosionMB=106bytesaThe

Arecibo

TelescopeWorld’s

largest

radio

telescopeDiameter

:

305

m

(1,000

ft)Area

:

18

acresLocation:

Arecibo,

Puerto

RicoThe

P-ALFA

surveys800

Terabytes

in

5

yearsTheAreciboTelescopeWorld’slSoftware

Driven

Telescopefrom

few,

large,

expensive,directional

dishes

to

many,

small,cheap,

omni

directional

antennaea

large

number

of

high-speedinput

streams(2Gbps

per

antenna,

25,000antennae

in

an

area

of

340

km

indiameter)SoftwareDrivenTelescopefromData

sizeChallenge

1:

It’s

the

data,

stupid!Data

complexityKey/value

storeColumn

storeDocument

storeGraph

SystemsDatasizeChallenge1:It’stheBig

data

drives

tomorrow’s

economy.?

The

value

of

big

data

lies

in

its

degree

ofconnectedness.?

Existing

systems

cannot

handle

richconnectedness

of

big

data.Bigdatadrivestomorrow’secoRDBMS

and

Rich

Relationships?

Performance

of

multi-way

joins

is

very

poor

inRDBMS?

Managing

data

of

rich

connectedness

requiresmulti-way

Joins

in

RDBMSRDBMSandRichRelationships?Trinity?

A

general

purpose,

distributed,

in

memory

graph

system?

Online

graph

query

processing?

Offline

graph

analyticsTrinity?Ageneralpurpose,dTrinity

Performance

Highlight?

Onlinequeryprocessing

:–

visiting

2.2

million

users

(3

hop

neighborhood)

on

Facebook:

<=

100ms–

foundation

for

graph-based

service,

e.g.,

entity

search?

Offlinegraphanalytics

:–

one

iteration

on

a

1

billion

node

graph:

<=

60sec–

foundation

for

analytics,

e.g.,

social

analyticsTrinityPerformanceHighlight?PeopleSearchDemoPeopleSearchDemoMulti-way

Join

vs.

Graph

TraversalCompanyIncidentProblem…IDCompanyID1ID2ID…IncidentID3ID4ID…ProblemRDBMSTrinityMulti-wayJoinvs.GraphTraveChallenge

2:

Interpretation

of

Big

Data?

IBM

Watson:–

Runs

on

2,880

cores,

15

terabytes

of

RAM,

and80kW

of

power?

A

human

brain:–

Runs

on

a

tuna

fish

sandwich

and

a

glass

of

waterChallenge2:Interpretationofansweringthe

questionunconstrainednatural

languageinferencing

&reasoningdomain

specificlanguagesimplecalculation

Human(Turing

Test)SIRI

Watson

Wolfram

AlphaGoogle/Bing?

the

Eternal

Questunderstanding

the

question

SQLcalculatoransweringthequestionunconstraTurning

the

Web

intoa

DatabaseTurningtheWeb intoWhat

you

see

when

you

look

at

my

homepage

…Haixun

WangMicrosoft

Research

AsiaEmail:

haixunw

@

microsoft

.

comTel:

+86-10-58963289Tel:

+1-914-902-0749I

joined

Microsoft

Research

Asia

in

2009.I

was

with

IBM

T.

J.

Watson

ResearchCenter

from

2000

to

2009.

I

received

theB.S.

and

M.S.

Degree

in

Computer

Sciencefrom

ShanghaiJiaoTongUniversity

in1994

and

1996,

the

Ph.D.

Degree

inComputer

Science

fromUniversityofCalifornia,LosAngelesin

June,

2000.WhatyouseewhenyoulookatAWhat

a

machine

sees

when

it

looks

at

my

homepage

…A

JPEG

Imagea

jpeg

Filetext

in

bigA

bold

fontA4

lines

of

textanother

dozen

lines

oftext

with

twoembedded

URLsAWhatamachineseeswhenitl專題論壇大數(shù)據(jù)課件Semantic

Web??

Number

1

trend

in

2008–

Richard

MacManus?

The

infrastructure

to

power

theSemantic

Web

is

already

here.–

Tim

Berners-Lee?

Unstructured

information

will

give

way

to

structuredinformation

paving

the

road

to

intelligent

computing.–

Alex

IskoldSemanticWeb??Number1tren專題論壇大數(shù)據(jù)課件More

data

beats

better

algorithmsBanko

and

Brill

2001MoredatabeatsbetteralgoritMean

translation

quality(1=incomprehensible,

4

=

perfect)English-Spanish

translation

quality,Microsoft

technical

texts2.5

23.52001200220032004200520062007Systran

Improvealgorithms,

scale

system,and

add

data!Rule-based

system

with

expensive

customizations

for

Microsoft3

MSRMT

Logos

Off-the-shelfrule-based

systemFrom

Rick

Rashid’s

talk:

It’s

a

data

driven

world

get

over

it!Meantranslationquality(1=incProbase

isA(concept,entities)isPropertyOf

(attributes)Co-occurrence

(isCEOof,

LocatedIn,etc)Concepts

(“SpanishArtists”)Entities

(“PabloPicaso”)Probase isAisPropertyOfCo-occuExplicit

vs.

Latent

Knowledge?

Abstract

representations

(such

as

clustersfrom

latent

analysis)

that

lack

linguisticcounterparts

are

hard

to

learn

or

validate

andtend

to

lose

information.?

Human

language

has

evolved

over

millennia

tohave

words

for

the

important

concepts;

let’suse

them.Halevy,

Norvig,

Pereira,

“The

Unreasonable

Effectiveness

of

Data”,

IEEE

Intelligent

Systems,

2009.Explicitvs.LatentKnowledge?What

is

interpretation?Whatisinterpretation?Add

Common

Sense

to

ComputingPablo

Picasso

25

Oct

1881SpanishAddCommonSensetoComputingPWhich

is

“kiki”

and

which

is

“bouba”?Whichis“kiki”andwhichis“soundshapezigzaggednesssoundshapezigzaggednessChinaIndiacountryBrazilemerging

marketChinaIndiacountryBrazilemerginbodytastesmell

winebodytastesmellIT

companyThe

engineer

is

eating

an

applefruitITcompanyTheengineeriseat

Multiple

ConceptsObama’s

real-estatepolicypresident,

politicianinvestment,

property,

asset,

plan,

documentpresident,

politician,investment,

property,

asset,

plan,

document MultipleConceptspresident,pMultiple

Concepts

applesoftware

company,

brand,

fruit,

juice

adobebrand,

software

company,

materialsoftware

company,software

manufacturer,

brand

juice,

materialbrand,

company,

fruit,MultipleConcepts apple adobes

Multiple

ConceptsObama’s

real-estatepolicypresident,

politicianinvestment,

property,

asset,

plan,

documentpresident,

politician,investment,

property,

example

plan,

documentthing,

issue,

term,

asset, MultipleConceptspresident,pExample:

(from

B.

Dolan)Who

assassinatedAbraham

Lincoln?Example:(fromB.Dolan)WhoasThe

far

reaching

implicationsScientific

MethodThefarreachingimplicationsSScientific

MethodScientificMethodWhat

really

counts

isunderstandingora

mastery

of

some

commonvocabularyWhatreallycountsisunderstanHow

can

big

data

help?A

much

more

rapid

cycle

of

hypothesisgeneration

and

testing?

General

access

toknowledge

in

science?

Autonomousexperimentation,

withan

‘a(chǎn)ctive

learning’modelHowcanbigdatahelp?AmuchmTechnological

Singularityif

machines

could

even

slightly

surpass

human

intellect,

they

could

improve

theirown

designs

in

ways

unforeseen

by

their

designers,

and

thus

recursively

augmentthemselves

into

far

greater

intelligencesTechnologicalSingularityifmaThanksThanks大數(shù)據(jù)平臺(tái)及互聯(lián)網(wǎng)應(yīng)用服務(wù)大數(shù)據(jù)平臺(tái)及互聯(lián)網(wǎng)應(yīng)用服務(wù)Agenda

當(dāng)前面臨問(wèn)題和挑戰(zhàn)

國(guó)內(nèi)外公司解決方案

大數(shù)據(jù)領(lǐng)域騰訊解決之道Agenda當(dāng)前面臨問(wèn)題和挑戰(zhàn)國(guó)內(nèi)外公司解決方案Agenda第一篇:當(dāng)前面臨問(wèn)題和挑戰(zhàn)Agenda第一篇:當(dāng)前面臨問(wèn)題和挑戰(zhàn)大數(shù)據(jù)挑戰(zhàn)(1)-海量數(shù)據(jù)存儲(chǔ)技術(shù)?

1.PB級(jí)數(shù)據(jù)向ZB級(jí)演進(jìn),如何降低存儲(chǔ)

和計(jì)算成本數(shù)據(jù)量:46PB機(jī)器數(shù)量:5600臺(tái)2.工業(yè)級(jí)業(yè)務(wù)發(fā)展迅速對(duì)大數(shù)據(jù)計(jì)算時(shí)

效性和可靠性提出新的挑戰(zhàn)大數(shù)據(jù)挑戰(zhàn)(1)-海量數(shù)據(jù)存儲(chǔ)技術(shù)?數(shù)據(jù)量:46PB機(jī)器數(shù)量大數(shù)據(jù)挑戰(zhàn)(2)—數(shù)據(jù)應(yīng)用難大數(shù)據(jù)挑戰(zhàn)(2)—數(shù)據(jù)應(yīng)用難大數(shù)據(jù)挑戰(zhàn)(3)-精準(zhǔn)推薦難1.企業(yè)信息泛濫的問(wèn)題(全互聯(lián)網(wǎng))2.推薦精度低3.推薦效果有效評(píng)估問(wèn)題4.如何有效收集用戶主動(dòng)行為數(shù)據(jù)大數(shù)據(jù)挑戰(zhàn)(3)-精準(zhǔn)推薦難1.企業(yè)信息泛濫的問(wèn)題(全互聯(lián)網(wǎng)Agenda第二篇:

國(guó)內(nèi)外公司解決方案Agenda第二篇:國(guó)內(nèi)外公司解決方案hadoop開(kāi)源產(chǎn)品HbaseMahoutHive/Pig海豚技術(shù)海狗章魚(yú)海星劍魚(yú)藍(lán)鯨…..…..海量計(jì)算:基于Hadoop海量存儲(chǔ)計(jì)算集群,同時(shí)提供一站式的計(jì)算和存儲(chǔ)資源管理

分布式數(shù)據(jù)挖掘:

基于Mahout分布式數(shù)

據(jù)數(shù)據(jù)挖掘數(shù)據(jù)分發(fā)中心:提供批量數(shù)據(jù)抽取和轉(zhuǎn)載,同時(shí)準(zhǔn)實(shí)時(shí)消息,日志分發(fā)(采用客戶pull方式)

海量數(shù)據(jù)實(shí)時(shí)搜索:

基于Hbase和Solr集成,

提供千億級(jí)別數(shù)據(jù)實(shí)時(shí)

查詢和全文檢索流計(jì)算框架:類似M/R流式計(jì)算框架,可以實(shí)現(xiàn)應(yīng)用快速,提供在線數(shù)據(jù)加工服務(wù)海量數(shù)據(jù)查詢:基于hive和Pig,提供Web頁(yè)面海量數(shù)據(jù)可視化查詢服務(wù)國(guó)內(nèi)案例-支付寶大數(shù)據(jù)平臺(tái)

支付寶hadoop相關(guān)應(yīng)用服務(wù)hadoop開(kāi)源HbaseMahoutHive/Pig海豚技?????Online

news,

Google

News

reports

that

recommendations

increasearticles

viewed

by

38%

(Das

et

al.

2007).Movies,

Netflix

reports

that

over

60%

of

their

rentals

originate

fromrecommendations

(Thompson

2008).Amazon,

which

sells

music,

books,

and

movies,

35%

of

sales

arereported

to

originate

from

recommendations

(Lamere

&

Green

2008).Video,

YouTub

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論