下載本文檔
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
File
Systems計算機科學(xué)與技術(shù)系2014.11.04操作系統(tǒng)專題訓(xùn)練20142OutlineBackgroundThe
Rising
of
Big
DataFile
System
BasisFundamentalsKeyIssuesFile
Systems
Optimization
inthe
Real
WorldExample:GFS/HDFSOptimization
Techniques數(shù)據(jù)增長(2010-2020)2010
年全球數(shù)字世界的規(guī)模首次達(dá)到了ZB級別,即1.227
ZB2005
年這個數(shù)字只有130
EB到2020
年 的數(shù)字世界規(guī)模將達(dá)到40ZB40
ZB相當(dāng)于地球上所有海灘上的沙粒數(shù)量的57倍;全世界人均擁有5,247
GB
的數(shù)據(jù)3Qmee:
Online
in
60
Seconds4Data
type
distribution5相對于傳統(tǒng)的結(jié)構(gòu)化數(shù)據(jù),非結(jié)構(gòu)化數(shù)據(jù)、內(nèi)容數(shù)據(jù)的增長迅速,且蘊含了極大的價值New
development:Data-Intensive
Computing
as
the
4th
ParadigmThousand
yearsago
–
ExperimentalScienceDescription
ofnatural
phenomenaLast
few
hundred
years
–Theoretical
ScienceNewton’s
Laws,
Maxwell’sEquations…Last
few
decades
–
ComputationalScienceSimulation
of
complex
phenomenaToday
–
Data-Intensive
Scienceunify
theory,
experiment,
&
simulation6其他一些說法7Hype
Cycle
for
Big
DataHype
Cycle
for
Big
Data9Hype
Cycle
for
Big
Data10Big
Data
Opportunity
Heat
Map1114OutlineBackgroundThe
Rising
of
Big
DataFile
System
BasisFundamentalsKeyIssuesFile
Systems
Optimization
inthe
Real
WorldExample:GFS/HDFSOptimization
TechniquesFile
System
FundamentalsFile
system:
a
layer
of
OS
that
provides
a
friendly
way
forusers
to
use
block
deviseComponentsDisk
Management:
collecting
disk
blocks
into
filesNaming:
Interface
to
find
files
by
name,
not
by
blocksProtection:
Layers
to
keep
data
secureReliability/Durability:
Kee of
files
durable
despite
crashes,media
failures,
attacks,
etcFile
ionDisk
ionByte-orientedBlock-orientedNamesBlock
#sAccess
protectionNo
protectionConsistency
guaranteesNo
guarantees
beyond
block
write15File
&
DirectoryFile:
user-visible
group
of
blocks
arrangedsequentially
in
logical
spaceDirectory:
user-visible
index
map names
tofiles
or
a
relation
used
for
namingJust
a
table
of
(file
name,
unique
ID)
pairsThe
ID
canbe
used
to
look
upother
fileinformationOften
stored
in
files16What
Gets
StoredUser
data
itself
is
the
bulk
of
the
file
system'scontentsAlso
includes
meta-data
on
a
drive-wide
andper-file
basis:Drive-wide:
Available
spaceFormatting
infocharacter
set...Per-file:
nameownermodification
datephysical
layout...High-Level
OrganizationFiles
are
organized
in
a
“tree”
structure
madeofnested
directoriesOne
directory
acts
as
the
“root”“l(fā)inks”
(symlinks,
shortcuts,
etc)
provide
simplemeans
of
providing
multiple
access
paths
to
onefileOther
file
systems
can
be
“mounted”
anddropped
in
as
sub-hierarchies
(other
drives,network
shares)Low-Level
Organization
(1/2)File
data
and
meta-data
stored
separa
yFile
descriptors
+
meta-data
stored
ininodesLarge
tree
or
table
at
designatedlocation
on
diskls
how
to
look
up
file
contentsMeta-data
may
be
replicated
to
increasesystem
reliabilityLow-Level
Organization
(2/2)“Standard”
read-write
medium
is
a
harddrive
(other
media:
CDROM,
tape,
...)Viewed
as
a
sequential
array
of
blocksMust
address
~1
KB
chunk
at
a
timeTree
structure
is
“flattened”
into
blocksOverlap
reads/writes/deletes
cancause
fragmentation:
files
are
often
notstored
in
a
linear
layout–
inodes
store
all
block
numbers
related
tofileFragmentationABC(free
space)ABCA(free
space)A(free
space)CA(free
space)ADCAD(free)22File
System
RequirementsNamingShould
be
flexible,
e.g.,
allow
multiplenames
forsamefilesSupport
hierarchyfor
easy
ofusePersistenceWant
to
be
sure
data
has
been
written
to
disk
in
casecrashoccursSharing/ProtectionWant
to
restrict
whohas
access
to
filesWant
to
sharefileswith
other
users23File
System
Requirements
(cont’d)Speed
&Efficiency
for
different
access
patternsSequentialaccessRandom
accessKeyed
access
(not
usually
provided
by
OS)Minimum
Space
OverheadDisk
space
needed
tostore
metadata
is
lost
for
user
dataTwist:
all
metadata
that
is
requiredto
do
translation
mustbe
stored
ondiskTranslation
scheme
should
minimize
number
of
additional
accesses
fora
given
access
patternHarder
than,
say
page
tables
where
we
assumed
pagetablesthemselves
arenot
subject
to
paging!24Key
IssuesWhere
to
store
file
metadata?On
disk
for
local
filesystemsOn
dedicated
server(s)
for
distributed/parallel
filesystemHow
to
store
file
data?As
a
whole
on
one
diskSplit
and
stored
on
multiple
disksHow
to
guarantee
reliability
and
efficiency?Reliability:replication,
RAID,
dedicated
supervisor,
…Efficiency:replication,
cache,
hardware-specific
spaceallocation,
…How
to
set
block
size?Source:
Tanenbaum,
Modern
Operating
SystemsAssumption:
all
files
are
2KB
insizeQuestion:
Why
is
the
data
rate
corresponding
smallblocksizeslow?25Distributed
File
SystemsSupport
access
to
files
on
remote
servers– Uniform
view
of
filesMust
support
concurrencyMake
varying
guarantees
about
locking,
who“wins”with
concurrent
writes,etc...Must
gracefully
handle
dropped
connectionsCanoffer
support
for
replicationandlocal
cachingDifferent
implementations
sit
in
different
placeson
complexity/feature
scale分布式文件系統(tǒng)概況27擴展性:節(jié)點的加入和退出必須以熱插拔的方式進(jìn)行;并發(fā)性:每個云組件必須被設(shè)計成在并發(fā)環(huán)境中是安全的??煽啃裕好總€云組件需要清楚所依賴的組件可能出現(xiàn)故障的方式,組件要設(shè)計成能適當(dāng)?shù)奶幚砻總€故障。效率:用戶云系統(tǒng)享數(shù)據(jù)的算法應(yīng)該避免性能瓶頸,頻繁的數(shù)據(jù)需要的副本,用戶能夠就近獲得最快的時間,同時用戶使用云服務(wù)的接口應(yīng)該盡可能簡單。命名服務(wù)(naming
service)元數(shù)據(jù)管理(metadatamanagement)緩存(cache)副本(replica)接口(interface)實例NFSAFSGFS/HDFS分布式文件系統(tǒng)命名服務(wù)在物理目標(biāo)和邏輯目標(biāo)之間形成 關(guān)系基本要求位置透明:使用單一的文件命名空間位置無關(guān):物理
位置改變無需改變邏輯文件名元數(shù)據(jù)管理元數(shù)據(jù):關(guān)于數(shù)據(jù)的數(shù)據(jù)文件名、文件大小、時間戳、控制信息、用戶、組、兩種管理方式In-band
Mode(帶內(nèi)模式):元數(shù)據(jù)與數(shù)據(jù)放在一起效率低,大數(shù)據(jù)量操作容易形成瓶頸Out-of-bandMode(帶外模式):使用專門的服務(wù)其存放元數(shù)據(jù)28分布式文件系統(tǒng)緩存目的:性能,提高優(yōu)化文件效率對象:元數(shù)據(jù):提高并發(fā)度數(shù)據(jù):減少網(wǎng)絡(luò)流量位置:內(nèi)存:速度快,開銷大硬盤:支持大文件,離線:緩存一致性解決方案客戶端發(fā)起的解決方案服務(wù)端發(fā)起的解決方案29目的保證可靠性保證可用性實現(xiàn)負(fù)載均衡要求副本位置對用戶透明問題:一致性強一致性弱一致性分布式文件系統(tǒng)副本接口無狀態(tài)(Sta
ess)服務(wù)
服務(wù)器不記錄狀態(tài)信息,每一個發(fā)起的請求都是自包含的
請求消息包大,處理時間長,不支持鎖操作有狀態(tài)(Stateful)服務(wù)服務(wù)器記錄請求的會話信息30架構(gòu)的選擇
Scale
Up架構(gòu)的選擇Scale
OutScale
up
vs.
Scale
out擴展因素Scale-out(SAN/NAS)Scale-up(DAS/SAN/NAS)硬件擴展增加 硬件更換硬件硬件限制沒有硬件限制有硬件限制可用性,可靠性更高較少管理的復(fù)雜性資源
, 管理需管理資源較少跨地理位置YesNoNAS可用Yes,NAS機制很普遍YesSAN可用Yes,增加
交換機YesDAS可用有限制Yes破壞性較少較多OutlineBackgroundThe
Rising
of
Big
DataFile
System
BasisFundamentalsKeyIssuesFile
Systems
Optimization
inthe
Real
WorldExample:GFS/HDFSOptimization
Techniques34分布式文件系統(tǒng)實例:GFS/HDFS35產(chǎn)品特征:基于低成本的PC服務(wù)器+開源Linux+千兆網(wǎng)+自研高度可伸縮:單集群規(guī)??梢赃_(dá)到上萬節(jié)點,存儲能力達(dá)到幾百PB和計算相結(jié)合:通過將計算移動到數(shù)據(jù)所在節(jié)點,提高計算性能,主要用于數(shù)據(jù)分析數(shù)據(jù)可靠性:采用多副本保證數(shù)據(jù)的可靠性,通常采用3個副本文件被切割成固定大小的塊(Chunk)一個主Master,多個Shadow
Master多個chunkserver多clientHDFS:GFS的開源實現(xiàn)File
SystemWhy
not
use
an
existing
file
system?’s
problems
are
different
from
anyone
else’sAssumptionsHigh
component
failure
ratesInexpensive
commodity
components
fail
all
the
time“Modest”
number
of
HUGE
filesJust
a
few
millionEach
is
100GB
or
larger;
multi-GB
files
typicalFiles
are
write-once,
mostly
appended
toPerhaps
concurrentlyLargestreaming
readsHigh
sustained
throughput
favored
over
lowlatency36GFS
Design
DecisionsFiles
stored
in
chunks– Fixed
size(64MB)Reliability
through
replicationEach
chunk
replicated
across
3+
chunkserversSingle
master
to
coordinate
access,
keep
metadataSimple
centralized
managementNo
d
achingLittle
benefit
due
to
large
data
sets,
streaming
readsFamiliar
interface,
but
customized
APISimplify
the
problem;
focus
on
appsAdd
snapshot
and
record
append
operationsOptimization
of
Metadata
ServiceSplittingthe
functionsa
single
master
intoMultiple
metadataserversMultiple
supervisorsthat
are
in
charge
ofsystem
monitoring,fault
recovery,
replica
management,garbage
collection38metadata
server
implementation基本原則:–必須實現(xiàn)自動故障恢復(fù)和節(jié)點宕機之后的元數(shù)據(jù)服務(wù)轉(zhuǎn)移功能,保證元數(shù)據(jù)服務(wù)盡可能的;為了支持多樣化的負(fù)載,元數(shù)據(jù)服務(wù)器必須是可擴展的;盡量減少元數(shù)據(jù)節(jié)點和其它節(jié)點的交互次數(shù),降低元數(shù)據(jù)節(jié)點的負(fù)載;文件被組織成一個傳統(tǒng)的
樹讀寫鎖去冗余的控制列表39data
server
implementation,一個chunk對文件被按32M大小進(jìn)行分塊(chunk)應(yīng)Linux文件系統(tǒng)中的一個實體文件基于UUID算法產(chǎn)生128位chunk
id記錄Chunk文件數(shù)據(jù)的MD5值來檢查已保存數(shù)據(jù)的完整性40Supervisor
Implementation41基于內(nèi)聯(lián)及熱度統(tǒng)計的小文件優(yōu)化技術(shù)對于數(shù)據(jù)與元數(shù)據(jù)分離的分布式文件系統(tǒng),
小文件
主要受限于網(wǎng)絡(luò)延遲,
提出基于內(nèi)聯(lián)及熱度統(tǒng)計的小文件優(yōu)化技術(shù),
提升小文件
性能效果:采用內(nèi)聯(lián)數(shù)據(jù)后,小文件
性能提升約2倍數(shù)據(jù)遷移平衡了內(nèi)聯(lián)數(shù)據(jù)所獲得的性能優(yōu)勢與帶來的元數(shù)據(jù)服務(wù)器開銷文件內(nèi)聯(lián)技術(shù)對于小文件,將數(shù)據(jù) 在元數(shù)據(jù)中在打開文件時,將數(shù)據(jù)與元數(shù)據(jù)一起發(fā)送給客戶端,消除了數(shù)據(jù)位置計算時間和跟對象 的通信基于熱度統(tǒng)計的內(nèi)聯(lián)數(shù)據(jù)遷移技術(shù)文件大小超過閥值熱度超出定義的閾值06040頻繁的內(nèi)聯(lián)數(shù)據(jù)寫
可能增加元數(shù)據(jù)服務(wù)器負(fù)擔(dān)客戶端自動統(tǒng)計計算內(nèi)聯(lián)數(shù)據(jù)的寫 熱度進(jìn)行內(nèi)聯(lián)數(shù)據(jù)遷移的時機20Time(單位:秒)1000
2000
3000File
NumbersInline
data無inline
data有inline
data面向千億級文件Set模型的海量文件
技術(shù)需求,提供TB級數(shù)據(jù)
和快速運營支撐?;谒枷?.提出據(jù)Set模型,以
Set為數(shù)單元進(jìn)行部署,擴容和管理。文件索引和數(shù)據(jù)分離,通過文件索引和磁盤數(shù)據(jù)索引共同定位文件數(shù)據(jù),磁盤數(shù)據(jù)索引全內(nèi)存化實現(xiàn)高效IO。多Set間容量均衡調(diào)度算法,根據(jù)Set狀態(tài)和空間利用率,調(diào)度新增容量,實現(xiàn)容量均衡。應(yīng)用效果:解決
相冊千億級文件的問題;
相冊5000億+張,日增3億+張,
量100PB+。更新文件索引<文件名,chid,fid.>存?接入mastermaster文件索引引索挜扲服▂器Idx-master存?文件數(shù)據(jù)廜取文件數(shù)據(jù)<chid,fid>存?S
etfid->offset存?服▂器存?S
etfid->offset存?服▂器面向數(shù)百個業(yè)務(wù),萬億條無熱點小記錄,提供高并發(fā)和低延時的低成本。基于固態(tài)盤的高性能分布式 技術(shù)思想1.提出單機資源復(fù)用模型,單機間和IO資源劃分成固定小規(guī)格元,
單元間
且IO空單公平?;旌纤饕夹g(shù)實現(xiàn)低內(nèi)存開銷且IO高效的本地數(shù)據(jù)索引,小記錄采用哈希索引減少索引數(shù)量,大記錄獨立索引提升IO效率。SSD應(yīng)用層寫優(yōu)化,寫緩存實現(xiàn)低延時響應(yīng),動態(tài)索引和寫合并將高并發(fā)小記錄隨機寫轉(zhuǎn)化為低頻率的大塊寫。量10+TB,小記錄(<100字節(jié))應(yīng)用效果:提供SNS基礎(chǔ)數(shù)據(jù)服務(wù),數(shù)據(jù)可靠性高,高密度讀寫,
量40+w/s,長尾
無熱點。存?a
接入SSD存?
服▂器2
GB的存??元共享內(nèi)存寫緩存混合索引單元公平的讀寫IO調(diào)度3.根據(jù)索引
廜
取ssd數(shù)據(jù)1.廜
取更新
2.廜
取索引SSD存?
服▂器2
GB的存??元共享內(nèi)存寫緩存混合索引單元公平的讀寫IO調(diào)度2.冥
裝成定椏
大數(shù)據(jù)?寫入1.廜
取更新
3.更新索引Get/Set/Del?
▂
2元數(shù)據(jù)管理急?
▂
1
急
源源增量同步IceFS:Separating
Physical
StructureSource:Physical
Disentanglement
in
aContainer-Based
File
System,
OSDI
2014New ion:
cubeenables
the
grou of
files
and
directoriesinside
a
physically
isolated
containerBenefitslocalized
reaction
to
faultsfast
recoveryconcurrent
file-system
updates45Using
New
Media46DRAM
ManagementLRU
block
replacementFlash
ManagementSegment
=
A
set
of
blocks/Erasing
unitSegment
list
(Free/Clean/Dirty)Segment
replacement
(FIFO
or
LRU)Disk
Management– Power
management
by
spin
up/downSource:FLASHCACHE
[HCSS’94]Using
New
MediaTo
reduce
the
power
consumption
ofdiskNVCacheTo
reduce
disk
power
consumption
by
combining
adaptive
diskspin-down
algorithmTo
extend
spin-down
periods
by
undertaking
i
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 春興古詩拼音版
- 集訓(xùn)14 中國近代史論述題30題(原卷版)
- 課題申報書:刑罰現(xiàn)代化視域下智慧矯正質(zhì)效提升機制研究
- 片上微波光子濾波系統(tǒng)優(yōu)化創(chuàng)新
- 混沌動力學(xué)在時空鎖模激光器中的應(yīng)用
- 探討高精度激光探測技術(shù)前沿
- 創(chuàng)新水下聲能收集方法研究
- 多障礙物下非均勻?qū)щ娊橘|(zhì)散射特性
- 我的家鄉(xiāng)連云港
- 2024電商支付系統(tǒng)服務(wù)合同
- 大學(xué)生旅游問卷調(diào)研報告
- 支原體檢驗報告
- 施工現(xiàn)場安全監(jiān)督要點
- 單位物業(yè)服務(wù)項目投標(biāo)方案(技術(shù)標(biāo))
- 患者突發(fā)昏迷應(yīng)急預(yù)案演練腳本-
- 危險性較大的分部分項工程清單 及安全管理措施
- 中職英語語文版(2023)基礎(chǔ)模塊1 Unit 1 The Joys of Vocational School 單元測試題(含答案)
- 工程預(yù)結(jié)算課件
- 酒店宴會合同范本
- 貨款互抵三方協(xié)議合同范本
- 七年級道德與法治論文2000字(合集六篇)
評論
0/150
提交評論