操作系統(tǒng)-清華-參考_第1頁
操作系統(tǒng)-清華-參考_第2頁
操作系統(tǒng)-清華-參考_第3頁
操作系統(tǒng)-清華-參考_第4頁
操作系統(tǒng)-清華-參考_第5頁
免費預(yù)覽已結(jié)束,剩余45頁可下載查看

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

File

Systems計算機科學(xué)與技術(shù)系2014.11.04操作系統(tǒng)專題訓(xùn)練20142OutlineBackgroundThe

Rising

of

Big

DataFile

System

BasisFundamentalsKeyIssuesFile

Systems

Optimization

inthe

Real

WorldExample:GFS/HDFSOptimization

Techniques數(shù)據(jù)增長(2010-2020)2010

年全球數(shù)字世界的規(guī)模首次達(dá)到了ZB級別,即1.227

ZB2005

年這個數(shù)字只有130

EB到2020

年 的數(shù)字世界規(guī)模將達(dá)到40ZB40

ZB相當(dāng)于地球上所有海灘上的沙粒數(shù)量的57倍;全世界人均擁有5,247

GB

的數(shù)據(jù)3Qmee:

Online

in

60

Seconds4Data

type

distribution5相對于傳統(tǒng)的結(jié)構(gòu)化數(shù)據(jù),非結(jié)構(gòu)化數(shù)據(jù)、內(nèi)容數(shù)據(jù)的增長迅速,且蘊含了極大的價值New

development:Data-Intensive

Computing

as

the

4th

ParadigmThousand

yearsago

ExperimentalScienceDescription

ofnatural

phenomenaLast

few

hundred

years

–Theoretical

ScienceNewton’s

Laws,

Maxwell’sEquations…Last

few

decades

ComputationalScienceSimulation

of

complex

phenomenaToday

Data-Intensive

Scienceunify

theory,

experiment,

&

simulation6其他一些說法7Hype

Cycle

for

Big

DataHype

Cycle

for

Big

Data9Hype

Cycle

for

Big

Data10Big

Data

Opportunity

Heat

Map1114OutlineBackgroundThe

Rising

of

Big

DataFile

System

BasisFundamentalsKeyIssuesFile

Systems

Optimization

inthe

Real

WorldExample:GFS/HDFSOptimization

TechniquesFile

System

FundamentalsFile

system:

a

layer

of

OS

that

provides

a

friendly

way

forusers

to

use

block

deviseComponentsDisk

Management:

collecting

disk

blocks

into

filesNaming:

Interface

to

find

files

by

name,

not

by

blocksProtection:

Layers

to

keep

data

secureReliability/Durability:

Kee of

files

durable

despite

crashes,media

failures,

attacks,

etcFile

ionDisk

ionByte-orientedBlock-orientedNamesBlock

#sAccess

protectionNo

protectionConsistency

guaranteesNo

guarantees

beyond

block

write15File

&

DirectoryFile:

user-visible

group

of

blocks

arrangedsequentially

in

logical

spaceDirectory:

user-visible

index

map names

tofiles

or

a

relation

used

for

namingJust

a

table

of

(file

name,

unique

ID)

pairsThe

ID

canbe

used

to

look

upother

fileinformationOften

stored

in

files16What

Gets

StoredUser

data

itself

is

the

bulk

of

the

file

system'scontentsAlso

includes

meta-data

on

a

drive-wide

andper-file

basis:Drive-wide:

Available

spaceFormatting

infocharacter

set...Per-file:

nameownermodification

datephysical

layout...High-Level

OrganizationFiles

are

organized

in

a

“tree”

structure

madeofnested

directoriesOne

directory

acts

as

the

“root”“l(fā)inks”

(symlinks,

shortcuts,

etc)

provide

simplemeans

of

providing

multiple

access

paths

to

onefileOther

file

systems

can

be

“mounted”

anddropped

in

as

sub-hierarchies

(other

drives,network

shares)Low-Level

Organization

(1/2)File

data

and

meta-data

stored

separa

yFile

descriptors

+

meta-data

stored

ininodesLarge

tree

or

table

at

designatedlocation

on

diskls

how

to

look

up

file

contentsMeta-data

may

be

replicated

to

increasesystem

reliabilityLow-Level

Organization

(2/2)“Standard”

read-write

medium

is

a

harddrive

(other

media:

CDROM,

tape,

...)Viewed

as

a

sequential

array

of

blocksMust

address

~1

KB

chunk

at

a

timeTree

structure

is

“flattened”

into

blocksOverlap

reads/writes/deletes

cancause

fragmentation:

files

are

often

notstored

in

a

linear

layout–

inodes

store

all

block

numbers

related

tofileFragmentationABC(free

space)ABCA(free

space)A(free

space)CA(free

space)ADCAD(free)22File

System

RequirementsNamingShould

be

flexible,

e.g.,

allow

multiplenames

forsamefilesSupport

hierarchyfor

easy

ofusePersistenceWant

to

be

sure

data

has

been

written

to

disk

in

casecrashoccursSharing/ProtectionWant

to

restrict

whohas

access

to

filesWant

to

sharefileswith

other

users23File

System

Requirements

(cont’d)Speed

&Efficiency

for

different

access

patternsSequentialaccessRandom

accessKeyed

access

(not

usually

provided

by

OS)Minimum

Space

OverheadDisk

space

needed

tostore

metadata

is

lost

for

user

dataTwist:

all

metadata

that

is

requiredto

do

translation

mustbe

stored

ondiskTranslation

scheme

should

minimize

number

of

additional

accesses

fora

given

access

patternHarder

than,

say

page

tables

where

we

assumed

pagetablesthemselves

arenot

subject

to

paging!24Key

IssuesWhere

to

store

file

metadata?On

disk

for

local

filesystemsOn

dedicated

server(s)

for

distributed/parallel

filesystemHow

to

store

file

data?As

a

whole

on

one

diskSplit

and

stored

on

multiple

disksHow

to

guarantee

reliability

and

efficiency?Reliability:replication,

RAID,

dedicated

supervisor,

…Efficiency:replication,

cache,

hardware-specific

spaceallocation,

…How

to

set

block

size?Source:

Tanenbaum,

Modern

Operating

SystemsAssumption:

all

files

are

2KB

insizeQuestion:

Why

is

the

data

rate

corresponding

smallblocksizeslow?25Distributed

File

SystemsSupport

access

to

files

on

remote

servers– Uniform

view

of

filesMust

support

concurrencyMake

varying

guarantees

about

locking,

who“wins”with

concurrent

writes,etc...Must

gracefully

handle

dropped

connectionsCanoffer

support

for

replicationandlocal

cachingDifferent

implementations

sit

in

different

placeson

complexity/feature

scale分布式文件系統(tǒng)概況27擴展性:節(jié)點的加入和退出必須以熱插拔的方式進(jìn)行;并發(fā)性:每個云組件必須被設(shè)計成在并發(fā)環(huán)境中是安全的??煽啃裕好總€云組件需要清楚所依賴的組件可能出現(xiàn)故障的方式,組件要設(shè)計成能適當(dāng)?shù)奶幚砻總€故障。效率:用戶云系統(tǒng)享數(shù)據(jù)的算法應(yīng)該避免性能瓶頸,頻繁的數(shù)據(jù)需要的副本,用戶能夠就近獲得最快的時間,同時用戶使用云服務(wù)的接口應(yīng)該盡可能簡單。命名服務(wù)(naming

service)元數(shù)據(jù)管理(metadatamanagement)緩存(cache)副本(replica)接口(interface)實例NFSAFSGFS/HDFS分布式文件系統(tǒng)命名服務(wù)在物理目標(biāo)和邏輯目標(biāo)之間形成 關(guān)系基本要求位置透明:使用單一的文件命名空間位置無關(guān):物理

位置改變無需改變邏輯文件名元數(shù)據(jù)管理元數(shù)據(jù):關(guān)于數(shù)據(jù)的數(shù)據(jù)文件名、文件大小、時間戳、控制信息、用戶、組、兩種管理方式In-band

Mode(帶內(nèi)模式):元數(shù)據(jù)與數(shù)據(jù)放在一起效率低,大數(shù)據(jù)量操作容易形成瓶頸Out-of-bandMode(帶外模式):使用專門的服務(wù)其存放元數(shù)據(jù)28分布式文件系統(tǒng)緩存目的:性能,提高優(yōu)化文件效率對象:元數(shù)據(jù):提高并發(fā)度數(shù)據(jù):減少網(wǎng)絡(luò)流量位置:內(nèi)存:速度快,開銷大硬盤:支持大文件,離線:緩存一致性解決方案客戶端發(fā)起的解決方案服務(wù)端發(fā)起的解決方案29目的保證可靠性保證可用性實現(xiàn)負(fù)載均衡要求副本位置對用戶透明問題:一致性強一致性弱一致性分布式文件系統(tǒng)副本接口無狀態(tài)(Sta

ess)服務(wù)

服務(wù)器不記錄狀態(tài)信息,每一個發(fā)起的請求都是自包含的

請求消息包大,處理時間長,不支持鎖操作有狀態(tài)(Stateful)服務(wù)服務(wù)器記錄請求的會話信息30架構(gòu)的選擇

Scale

Up架構(gòu)的選擇Scale

OutScale

up

vs.

Scale

out擴展因素Scale-out(SAN/NAS)Scale-up(DAS/SAN/NAS)硬件擴展增加 硬件更換硬件硬件限制沒有硬件限制有硬件限制可用性,可靠性更高較少管理的復(fù)雜性資源

, 管理需管理資源較少跨地理位置YesNoNAS可用Yes,NAS機制很普遍YesSAN可用Yes,增加

交換機YesDAS可用有限制Yes破壞性較少較多OutlineBackgroundThe

Rising

of

Big

DataFile

System

BasisFundamentalsKeyIssuesFile

Systems

Optimization

inthe

Real

WorldExample:GFS/HDFSOptimization

Techniques34分布式文件系統(tǒng)實例:GFS/HDFS35產(chǎn)品特征:基于低成本的PC服務(wù)器+開源Linux+千兆網(wǎng)+自研高度可伸縮:單集群規(guī)??梢赃_(dá)到上萬節(jié)點,存儲能力達(dá)到幾百PB和計算相結(jié)合:通過將計算移動到數(shù)據(jù)所在節(jié)點,提高計算性能,主要用于數(shù)據(jù)分析數(shù)據(jù)可靠性:采用多副本保證數(shù)據(jù)的可靠性,通常采用3個副本文件被切割成固定大小的塊(Chunk)一個主Master,多個Shadow

Master多個chunkserver多clientHDFS:GFS的開源實現(xiàn)File

SystemWhy

not

use

an

existing

file

system?’s

problems

are

different

from

anyone

else’sAssumptionsHigh

component

failure

ratesInexpensive

commodity

components

fail

all

the

time“Modest”

number

of

HUGE

filesJust

a

few

millionEach

is

100GB

or

larger;

multi-GB

files

typicalFiles

are

write-once,

mostly

appended

toPerhaps

concurrentlyLargestreaming

readsHigh

sustained

throughput

favored

over

lowlatency36GFS

Design

DecisionsFiles

stored

in

chunks– Fixed

size(64MB)Reliability

through

replicationEach

chunk

replicated

across

3+

chunkserversSingle

master

to

coordinate

access,

keep

metadataSimple

centralized

managementNo

d

achingLittle

benefit

due

to

large

data

sets,

streaming

readsFamiliar

interface,

but

customized

APISimplify

the

problem;

focus

on

appsAdd

snapshot

and

record

append

operationsOptimization

of

Metadata

ServiceSplittingthe

functionsa

single

master

intoMultiple

metadataserversMultiple

supervisorsthat

are

in

charge

ofsystem

monitoring,fault

recovery,

replica

management,garbage

collection38metadata

server

implementation基本原則:–必須實現(xiàn)自動故障恢復(fù)和節(jié)點宕機之后的元數(shù)據(jù)服務(wù)轉(zhuǎn)移功能,保證元數(shù)據(jù)服務(wù)盡可能的;為了支持多樣化的負(fù)載,元數(shù)據(jù)服務(wù)器必須是可擴展的;盡量減少元數(shù)據(jù)節(jié)點和其它節(jié)點的交互次數(shù),降低元數(shù)據(jù)節(jié)點的負(fù)載;文件被組織成一個傳統(tǒng)的

樹讀寫鎖去冗余的控制列表39data

server

implementation,一個chunk對文件被按32M大小進(jìn)行分塊(chunk)應(yīng)Linux文件系統(tǒng)中的一個實體文件基于UUID算法產(chǎn)生128位chunk

id記錄Chunk文件數(shù)據(jù)的MD5值來檢查已保存數(shù)據(jù)的完整性40Supervisor

Implementation41基于內(nèi)聯(lián)及熱度統(tǒng)計的小文件優(yōu)化技術(shù)對于數(shù)據(jù)與元數(shù)據(jù)分離的分布式文件系統(tǒng),

小文件

主要受限于網(wǎng)絡(luò)延遲,

提出基于內(nèi)聯(lián)及熱度統(tǒng)計的小文件優(yōu)化技術(shù),

提升小文件

性能效果:采用內(nèi)聯(lián)數(shù)據(jù)后,小文件

性能提升約2倍數(shù)據(jù)遷移平衡了內(nèi)聯(lián)數(shù)據(jù)所獲得的性能優(yōu)勢與帶來的元數(shù)據(jù)服務(wù)器開銷文件內(nèi)聯(lián)技術(shù)對于小文件,將數(shù)據(jù) 在元數(shù)據(jù)中在打開文件時,將數(shù)據(jù)與元數(shù)據(jù)一起發(fā)送給客戶端,消除了數(shù)據(jù)位置計算時間和跟對象 的通信基于熱度統(tǒng)計的內(nèi)聯(lián)數(shù)據(jù)遷移技術(shù)文件大小超過閥值熱度超出定義的閾值06040頻繁的內(nèi)聯(lián)數(shù)據(jù)寫

可能增加元數(shù)據(jù)服務(wù)器負(fù)擔(dān)客戶端自動統(tǒng)計計算內(nèi)聯(lián)數(shù)據(jù)的寫 熱度進(jìn)行內(nèi)聯(lián)數(shù)據(jù)遷移的時機20Time(單位:秒)1000

2000

3000File

NumbersInline

data無inline

data有inline

data面向千億級文件Set模型的海量文件

技術(shù)需求,提供TB級數(shù)據(jù)

和快速運營支撐?;谒枷?.提出據(jù)Set模型,以

Set為數(shù)單元進(jìn)行部署,擴容和管理。文件索引和數(shù)據(jù)分離,通過文件索引和磁盤數(shù)據(jù)索引共同定位文件數(shù)據(jù),磁盤數(shù)據(jù)索引全內(nèi)存化實現(xiàn)高效IO。多Set間容量均衡調(diào)度算法,根據(jù)Set狀態(tài)和空間利用率,調(diào)度新增容量,實現(xiàn)容量均衡。應(yīng)用效果:解決

相冊千億級文件的問題;

相冊5000億+張,日增3億+張,

量100PB+。更新文件索引<文件名,chid,fid.>存?接入mastermaster文件索引引索挜扲服▂器Idx-master存?文件數(shù)據(jù)廜取文件數(shù)據(jù)<chid,fid>存?S

etfid->offset存?服▂器存?S

etfid->offset存?服▂器面向數(shù)百個業(yè)務(wù),萬億條無熱點小記錄,提供高并發(fā)和低延時的低成本。基于固態(tài)盤的高性能分布式 技術(shù)思想1.提出單機資源復(fù)用模型,單機間和IO資源劃分成固定小規(guī)格元,

單元間

且IO空單公平?;旌纤饕夹g(shù)實現(xiàn)低內(nèi)存開銷且IO高效的本地數(shù)據(jù)索引,小記錄采用哈希索引減少索引數(shù)量,大記錄獨立索引提升IO效率。SSD應(yīng)用層寫優(yōu)化,寫緩存實現(xiàn)低延時響應(yīng),動態(tài)索引和寫合并將高并發(fā)小記錄隨機寫轉(zhuǎn)化為低頻率的大塊寫。量10+TB,小記錄(<100字節(jié))應(yīng)用效果:提供SNS基礎(chǔ)數(shù)據(jù)服務(wù),數(shù)據(jù)可靠性高,高密度讀寫,

量40+w/s,長尾

無熱點。存?a

接入SSD存?

服▂器2

GB的存??元共享內(nèi)存寫緩存混合索引單元公平的讀寫IO調(diào)度3.根據(jù)索引

取ssd數(shù)據(jù)1.廜

取更新

2.廜

取索引SSD存?

服▂器2

GB的存??元共享內(nèi)存寫緩存混合索引單元公平的讀寫IO調(diào)度2.冥

裝成定椏

大數(shù)據(jù)?寫入1.廜

取更新

3.更新索引Get/Set/Del?

2元數(shù)據(jù)管理急?

1

源源增量同步IceFS:Separating

Physical

StructureSource:Physical

Disentanglement

in

aContainer-Based

File

System,

OSDI

2014New ion:

cubeenables

the

grou of

files

and

directoriesinside

a

physically

isolated

containerBenefitslocalized

reaction

to

faultsfast

recoveryconcurrent

file-system

updates45Using

New

Media46DRAM

ManagementLRU

block

replacementFlash

ManagementSegment

=

A

set

of

blocks/Erasing

unitSegment

list

(Free/Clean/Dirty)Segment

replacement

(FIFO

or

LRU)Disk

Management– Power

management

by

spin

up/downSource:FLASHCACHE

[HCSS’94]Using

New

MediaTo

reduce

the

power

consumption

ofdiskNVCacheTo

reduce

disk

power

consumption

by

combining

adaptive

diskspin-down

algorithmTo

extend

spin-down

periods

by

undertaking

i

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論