翻譯以.原文和在同一文件中前一致性或延遲性一種基于狀態(tài)機(jī)_第1頁(yè)
翻譯以.原文和在同一文件中前一致性或延遲性一種基于狀態(tài)機(jī)_第2頁(yè)
翻譯以.原文和在同一文件中前一致性或延遲性一種基于狀態(tài)機(jī)_第3頁(yè)
翻譯以.原文和在同一文件中前一致性或延遲性一種基于狀態(tài)機(jī)_第4頁(yè)
翻譯以.原文和在同一文件中前一致性或延遲性一種基于狀態(tài)機(jī)_第5頁(yè)
免費(fèi)預(yù)覽已結(jié)束,剩余42頁(yè)可下載查看

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

方法作者:XuWang,Hailong摘CAPPACELC宣稱(chēng)在分布式副本系統(tǒng)中存在一些性能評(píng)量的折RSMRSM-d,我們建立了概率模型去量化一致性和延遲性。一致性和延遲性都受d的影響,而d表示協(xié)關(guān)鍵詞:一致性,寫(xiě)沖突,延遲性, 引[4]進(jìn)一步解釋了CAP,并提出了一致性與可用性,一致性與延遲之間的兩種權(quán)衡。系統(tǒng)總是可寫(xiě)的并保持最終一致性,如Dynamo[5],Cassandra[6]。然而其他系統(tǒng)如Chubby[7],Zookeeper[8]必須有段很短的時(shí)間來(lái)從錯(cuò)誤中進(jìn)行恢復(fù)來(lái)保證各副本管弱一致性已經(jīng)在一些商業(yè)系統(tǒng)中運(yùn)用并且被許多最求高可用性和低延遲的開(kāi)發(fā)者bing搜索業(yè)務(wù),21.8%4.3%;由于延時(shí)從400ms增加到900ms,谷歌在搜索領(lǐng)域的業(yè)務(wù)量下降了25%;100ms的延遲增加使亞馬遜的銷(xiāo)售額下降了一個(gè)百分點(diǎn)[13]。因此在系統(tǒng)設(shè)計(jì)中延遲是一項(xiàng)很重要的考慮RSMs[14]Quorum[15]系統(tǒng)是分布式系統(tǒng)和數(shù)據(jù)庫(kù)領(lǐng)域中比較流行的兩種容WR兩種副本集合。W集合用于寫(xiě)操作,R用于讀操作。在隨機(jī)時(shí)間通過(guò)RSMsquorum系統(tǒng)提供RSMsquorumRSMs的寫(xiě)RSMs一致性和延遲之間,基于不一致性的定量結(jié)果做出一個(gè)合理理性的權(quán)衡。束[19,20]和限制副本的寫(xiě)分離[21,22]RSMs的寫(xiě)不在這篇論文中,提出了一個(gè)定量的概率分析方法來(lái)計(jì)算RSMs的不一致性和延一點(diǎn),調(diào)查和分析了三種代表性的RSM實(shí)現(xiàn),分別為非一致全序廣播[23,24],分調(diào)查和分析[27-29]。然后,我們提出了一種通用系統(tǒng)模型RSM-d,d的取值范圍為[1-n](n為副本數(shù)。D表示提交一個(gè)寫(xiě)操作之前協(xié)調(diào)者必須收到的ACKs數(shù)。模型表明(1)在d∈[1,[n/2]+1]時(shí)一致性隨著d的增加而增強(qiáng)(2)當(dāng)d大于[n/2]+1d的增加而單調(diào)遞增。第四,將寫(xiě)不一致性的量化結(jié)RSM-d的延遲聯(lián)系起來(lái),RSM-d的基礎(chǔ)上,我們?yōu)閷?xiě)沖突構(gòu)建了一個(gè)概率模型,寫(xiě)沖突是造成不一致性都受到一個(gè)共同因素的影響,這個(gè)共同因素就是d。 背RSMs的背景,RSMs致力于容錯(cuò)在分布式系統(tǒng)中得到ChandraToueg[30]已經(jīng)證明全序廣播和分布一致性問(wèn)題是可解決的,在這次研究中,我們也遵循這些假設(shè)。這就意味著我們討論的RSMs有2f+1個(gè)副本,能f個(gè)崩潰錯(cuò)誤,并且最終能夠偵測(cè)到崩潰的節(jié)點(diǎn)。然而,實(shí)際的錯(cuò)誤探測(cè)器在一定時(shí)間內(nèi)必須輸出是否崩潰的結(jié)果,而不是長(zhǎng)時(shí)間等待一個(gè)最終結(jié)論。[31]將在[23-29]調(diào)查中,RSMs通過(guò)非一致全序廣播,分布式一致和一致全序廣播實(shí)現(xiàn)。如圖2.1所示,這三種具有代表性的RSM實(shí)現(xiàn)時(shí)這樣實(shí)現(xiàn)的:一個(gè)獨(dú)一的序列號(hào)并且將其提交給所有的副本。當(dāng)寫(xiě)操作消息抵達(dá)副本,副本會(huì)記錄下這個(gè)寫(xiě)操作,并且根據(jù)序列號(hào)進(jìn)行提交,然后回復(fù)協(xié)調(diào)者。只要協(xié)調(diào)者受到第一個(gè)回復(fù),它就會(huì)向客戶(hù)端發(fā)送響應(yīng)信息。任意時(shí)刻如果協(xié)調(diào)者被認(rèn)RSMs的分布式一致,一旦協(xié)調(diào)者收到一個(gè)寫(xiě)操作命令,首先會(huì)1[n/2]+1n RSM-d系統(tǒng)模型,這個(gè)模型RSM模型的抽象表示。因RSMs,通過(guò)調(diào)查研究它們來(lái)獲取盡管已經(jīng)在第二部分介紹過(guò)了全序廣播算法,也僅是全序廣播綜合調(diào)查結(jié)果5外兩種是基于全序邏輯時(shí)鐘[16]的歷史交流和目的地協(xié)議實(shí)現(xiàn)。RSM和分布式一致的解決方案都是目的地協(xié)議。Paxos用于對(duì)單個(gè)寫(xiě)操作排序的協(xié)商,multi-paxos針3.1RSM-dPaxos算法進(jìn)行全序廣播。原因是:1)固定的定序器算法在所有的全序廣播里的延遲是最少的。2)Paxos在3)在動(dòng)態(tài)定序器中的標(biāo)示符,基于特權(quán)的算法,在通信Paxos算法相結(jié)合建立一個(gè)統(tǒng)一的系統(tǒng)RSM-d。這是切實(shí)可行的,因?yàn)槿驈V播和分布式一致是基本相等的。如圖2dd屬于[1;n]d副本也包括了協(xié)調(diào)器,也就是一果d=1,RSM-d降低為不均一全序廣播,協(xié)調(diào)器可以直接進(jìn)入提交階段。如果d=[n=2]+1,RSM-dPaxosd=n,RSM-d變成為一致的全序廣播。我們不只是涉及到了這三個(gè)代表性的RSMs,而且將深入探討其他的在過(guò)去被忽視的RSM-dRSMRSMs的假設(shè)。RSMs假設(shè)同時(shí)崩潰的效檢測(cè)器將開(kāi)始選擇新的協(xié)調(diào)器,并且只在歷史記錄的基礎(chǔ)上更改錯(cuò)誤。在RSM-d有多個(gè)協(xié)調(diào)器,如Multi-RingPaxos,他與Paxos相協(xié)調(diào)。 寫(xiě)沖突概RSM-d,如ACK數(shù)??≥[??/2]+1,所有的寫(xiě)計(jì)算被提交到他們普通序和障修復(fù)后,f??[??/2]有故障出現(xiàn),副本仍將保持一致性。然后如果??[??/21,ACK消息的圖4.1顯示了當(dāng)至少d節(jié)點(diǎn)接收到了“PROPOSE產(chǎn)生故障時(shí),寫(xiě)日志丟失的方案。假設(shè)????是一個(gè)寫(xiě)計(jì)算所有日志丟失的可能。如d(b)之下,????=????。4.1因?yàn)闅v史記錄只寶庫(kù)了日志歷史,所以????也是在歷史記錄中的一個(gè)寫(xiě)計(jì)算丟????= 如今我們不考慮條件(c,但(a)和(b)仍保留。這意味著我們?cè)试S提交的)????????=∑??????1????????[??????=????????????= ∑ ??????1???????? 現(xiàn)在我們進(jìn)一步的不考慮(b),并只考慮(a。因此我們應(yīng)該考慮到不同步的度作用是隨機(jī)數(shù)據(jù)下的????????。為了便捷,我們用以下公式表示:????=??????∑??????1?????d<x<n。明顯的可以看出方程式(2)可以被寫(xiě)成??????????。當(dāng)??????時(shí),寫(xiě)沖突的條件幾率也是如此。因此我們可以通過(guò)對(duì)每一個(gè)可能的??=??條件幾率之??????=∑??????????

=∑??????????????∑??????1????? 如上述的,在不同條件下協(xié)調(diào)器選擇和故障恢復(fù),等式(1(2)和(3)都代??????= =∑??????????4.3圖4.3顯示了寫(xiě)重復(fù)的場(chǎng)景。當(dāng)一個(gè)不準(zhǔn)確的故障檢測(cè)器對(duì)協(xié)調(diào)者存在錯(cuò)誤懷疑,至少存在一個(gè)調(diào)解節(jié)點(diǎn)。然而,如果dn2Pno表示沒(méi)有相同節(jié)點(diǎn)的回復(fù)確認(rèn)消息的兩個(gè)節(jié)點(diǎn)集合的概率。除去兩個(gè)協(xié)調(diào)者本身,其他兩個(gè)隨機(jī)(d1)節(jié)點(diǎn)集可能不重疊,且都是不同的節(jié)點(diǎn)。因此,Pno是兩個(gè)隨機(jī)(d1)節(jié)點(diǎn)集數(shù),沒(méi)有重疊和排除兩個(gè)協(xié)調(diào)者除以所有兩個(gè)隨機(jī)節(jié)點(diǎn)集(??? ??????

=??? ??? ???

?????????圖4.4所示,主要工作如下:在時(shí)距Te4.4分布函數(shù)分別為??和?????2???1是它們的兩個(gè)樣本,分別代表兩個(gè)連續(xù)心跳信息的延遲信息?!案北驹诙〞r(shí)器超時(shí)前沒(méi)有接收到新的心跳信息”意味著???2+????1+????2+????1+??,副本仍然認(rèn)為它崩潰了。兩個(gè)節(jié)點(diǎn)之間的這種錯(cuò)誤懷疑的概率由Pfs表示:????8=???????2+??>???1+=???????2????1>???=????8

??

??Var(T)T的方差。如果我們不知道T的分布,上述不等式可用????8的上限Pgfs中,對(duì)協(xié)調(diào)者錯(cuò)誤懷疑的概率可以通過(guò)加和n-f-1活躍副本的概率,乘以至少有一個(gè)活躍副本錯(cuò)誤懷疑協(xié)調(diào)者的n-f-1活躍副本來(lái)計(jì)算:

???

=∑(

) 1? 1?1?Pgfs的流傳式故障檢測(cè)器(由另一個(gè)符號(hào)Pgs-gfs表示)有一點(diǎn)不同: ??? ???1???8?????8=∑

) 1? Pwd表示寫(xiě)重復(fù)概率。寫(xiě)重復(fù)發(fā)生如果(1)協(xié)調(diào)者正常工作但(2)其他活認(rèn)信息不重疊,即,Pwd等于這三個(gè)事件的概率的乘積:????=1????? 但難以準(zhǔn)確計(jì)算Pno,因?yàn)樗Q于特定協(xié)調(diào)者選定的持續(xù)時(shí)間,故障恢復(fù)算法,[]??????=????+????=∑??????+????+1????? 雖然??????的最終的結(jié)果給出了,我們應(yīng)該在不同的情況下采用它。例如, 時(shí)正如圖5.1所示,我們已經(jīng)在即將使用的表5.2??????=????+??????+??????=??=????=????=0。若不考慮這種序。例如,????1≤????2≤?≤???????1。????k小的????????????????“COMMIT”消息從協(xié)調(diào)者到副本i所需時(shí)??????????????=?????1+?????? 當(dāng)d2????1=??????????????????????+?????? ??????????????????????+min(????)+????????+??5.1接下來(lái)我們關(guān)注信息的推遲而忽略登陸時(shí)間??????(或者可以認(rèn)為是時(shí)間非常短。因????和??????都是隨機(jī)變量T的樣本(T的定義在第IV部分,遵循概率密度函遵循概率密度函 ??????=????+??

???? ???????(f(t)IV部分)和累計(jì)分布函數(shù)G(t)。??????的pdf和cdf分別由??? 和????t1≤k≤n?1表示。直觀上來(lái)說(shuō),對(duì)于足夠小的正實(shí)數(shù)??,??????∈[t,t+ε],只要存在一個(gè)??????∈[tt+ε],使得??????取值(k-1)t,對(duì)于其他??????取值??1????????(??????∈[,+???= ) ?????[t, ???

???1???????? +ε ???2=???1??(????∈[t,t+ε]) ) ???? ???? +ε ??? ???=???1 )?????? ???1?????? +ε??? ???=???1

) ???1?????? +ε??? ???=???1 ) ???11? + ??? =??????????(????????[t,t+ε])=???1 ??? ???11? ???通過(guò)方程(7),我們可以猜想??????1????????≥1:????????+1????? = +min??????)???(?????1+min??????= ??????1以上兩個(gè)minLCi被抵消,因?yàn)樘峤浑A段沒(méi)有變化,并且????0=0f(t)t0時(shí),f(t)=0t0時(shí),g(t)=0,G(t)=0,G+∞=1。如果d=1,那么??????2????? =?????? = ???1 1? ???2= ???1? 0= 1? ???1???

1? ???1= 1? ?

????????00若假設(shè) 存在,那么?? =∫+∞ ??是一個(gè)常量。在這種情況下很容0以至于:??????2????? 如果??≥2

???????????1=0+∞1? ????????+1????? = )???????=∫ ???1 (???2) )?1(1? ???

??? ???

??? ?1 ( ) (1? ??= ) ? ? ???

???

???

=???

) 0

(1?

???

????類(lèi)似的,如果??

???? =??

??+1?

=(???1)∫???

?11? ??? ??≥讓???????=(???1

? (1?

?? ????????+1????? =????? ??≥ 顯然???????>0。所以方程(8)f(x)n的數(shù) 一致性和延遲之間的長(zhǎng)期運(yùn)行的RSMs,系統(tǒng)的一致性象征著有序提交的寫(xiě)操作比率,這能通過(guò)1-PwcE(Lw(d))進(jìn)行估算。在這部分,B是系統(tǒng)總效益(比如,收入或用戶(hù)體驗(yàn),BcBl是一致性和延遲對(duì)系統(tǒng)效益的影響值。舉例來(lái)說(shuō),B能看做網(wǎng)上商店的總收入,強(qiáng)一致性的事務(wù)性購(gòu)買(mǎi)會(huì)使收入增加,而高延遲將導(dǎo)致用戶(hù)減少和收入減少。ξ表示其他因素對(duì)總效益的????=????+??+??=??(1?????????)+??+??????????? +=??+??+??+?????????????+????????=?????????????+????????=??????????? ???????????+???????? ???????=??????????? ???其中??=??+??+??+??并且????=??????????+????????????????? 盡管??,??,??和??????1d的取值,對(duì)于一V(d)是上述等式的可變部分。V(d)取最小值時(shí)B(d)取最大值。????=??????????? ?????≤??????????? ?????′=????′=的d的值。B(1)=???????????1+???????? ??([2]+1)=???????????[2]+????=???????????通過(guò)上述方程組,我們能得到??,??,??d取值是一定的,而 實(shí)的失敗率,RNSanti-entropygossip-style故障檢測(cè)會(huì)使系統(tǒng)更復(fù)雜和更不確定。在本節(jié)中,我們關(guān)注的是驗(yàn)證我們的結(jié)果,從而減少RSM-d實(shí)驗(yàn)的假設(shè)變量數(shù)量的最低值,擴(kuò)大應(yīng)用范圍。因此不考慮anti-entropy和gossip-style的故障檢測(cè)。為????和????。因此我們估計(jì)??????=????????/10000000。與此同時(shí),我們計(jì)算延遲的平均值來(lái)估計(jì)????d的期望。s0.05(20ms,0.1(10ms)0.2(5ms被分別設(shè)置到(1000ms,2000ms)和(500ms,1000ms。在以上輸入?yún)?shù)下RMSE=0.0009%,stddev.=0.0052%.dn21,我們觀察到????和????為了證明方程(8均布fx)循=0:01;0:2;0:05;:1;:2,該分布已傳入相應(yīng)信息。我們執(zhí)行一個(gè)空寫(xiě)(沒(méi)有操作)保證日志記錄和寫(xiě)入提交的時(shí)間可以忽略。我們對(duì)每個(gè)??∈,9和??∈,??進(jìn)行測(cè)試,得到延遲時(shí)間的平均值并預(yù)測(cè)方程(8)給出的延遲期望值。在所有情況下將?????? ?????1 的觀測(cè)均0.013ms和t.d0.019ms,符合們對(duì)延遲的預(yù)期。dλ=0:01,Pc=2%,Te=1000ms,To=2000ms。7.1所示,xn(2<n<9)和相應(yīng)的應(yīng)答d(1<d<[n=2]),y軸表示的是寫(xiě)入沖突的概率。為了強(qiáng)調(diào)不同的差異,我們用log??????來(lái)代替??????。我們每步下降至少一個(gè)數(shù)量級(jí)。注意到當(dāng)d>[n/2]時(shí),寫(xiě)入沖突的概率??????為0,所以并d方程8揭示了寫(xiě)延遲與d的關(guān)系。在實(shí)驗(yàn)中,我們想要更直觀的展示出寫(xiě)延遲df(x)被設(shè)置為λ=0.01。同時(shí),還執(zhí)行一個(gè)空寫(xiě)入擾最終的結(jié)果,因?yàn)????1包括的所有值將被刪除?!蔥和相應(yīng)數(shù)量的ACKs數(shù)??∈[1,??/2],y軸表示代替????1的延遲????????????d的增大而增大。在這里我們關(guān)注??∈[2,??/2],實(shí)際上,在??∈[1,??]之間的任意值??????dd值的降低將導(dǎo)致n值的增大。原因是協(xié)調(diào)器會(huì)存在更多的ACKs去選擇最快的d。7.2d一致v.s.延正如第六節(jié)所定義的一致性等于1???????,現(xiàn)在我們結(jié)合以上兩個(gè)實(shí)驗(yàn)的結(jié)果,7.3所示,x軸表示代替????1的延遲??????,y軸表示1??????d3n3,5,7和其相應(yīng)∈[??/2]n和??1??/2],它表明,更強(qiáng)的一致性必然導(dǎo)致糟糕的7.3V.S.RSMs系統(tǒng)作為其銷(xiāo)售系統(tǒng)。雖然難的。但是,如果我們不想知道系統(tǒng)的精確值的好處????,我們只需要通過(guò)統(tǒng)計(jì)方法知道一致性效益比??和延遲效益比??,最后確定??′的最大收益值????′。價(jià)格是V。一方面,如果每個(gè)正常的(一致)購(gòu)買(mǎi)可以帶來(lái)20%的價(jià)格,和每一個(gè)異常(不一致)購(gòu)買(mǎi)帶來(lái)雙重?fù)p失的價(jià)格(如賠償訂單沖突)。產(chǎn)生效益的為20%????1????????2????1?1???????,因此收益率的一致性為??=2.2????。另一方面,用戶(hù)相1001%。也就是說(shuō),收益的比例延遲??為0.01%????。因此,我們有??=2.2????和??=0.01%????。7.4第六節(jié)表明,整體利益????的最大值為其變量????=????2.2??????d0.01%????????????? 有最小值。在??∈[1,??]+1]和??∈[2,9]2經(jīng)計(jì)算出????11????部分描述了變量整體利益????。例如,n=3,d=2,????值為0.125????45d=2時(shí)????有最小值,????Paxos實(shí)現(xiàn)。因 相關(guān)較具有代表性的實(shí)現(xiàn)是非一致全序廣播,PaxosRSMs已經(jīng)RSMs一致性已經(jīng)被大量研究,但依舊在最近吸引著人們的目光。根據(jù)CAP[3]和RSMs中,我們的工作通過(guò)考慮崩潰錯(cuò)誤時(shí)寫(xiě)沖突發(fā)生的概率來(lái)量化不一致 討Quorum系Quorum系統(tǒng)定義了兩類(lèi)副本集合,一類(lèi)是W,另一類(lèi)是讀R。它們能被配ONE(W=1),QUORUM(W=[n/2]+1ALL(W=n)或者是任意值。這和序,而RSMs為全序排列。所以一些在RSMs系統(tǒng)中引起寫(xiě)沖突的情況不一定適用于Quorum系統(tǒng)。在現(xiàn)今的實(shí)際quorum系統(tǒng)中,一般把第二種寫(xiě)看做一個(gè)新的寫(xiě)操RSMs在對(duì)同步提交寫(xiě)操作和副本歷史記錄狀態(tài)時(shí)懷疑協(xié)調(diào)者后會(huì)進(jìn)行協(xié)調(diào)者10現(xiàn)的變體,我們首先提出了一種叫做RSM-d的通用模型來(lái)統(tǒng)一描述大多數(shù)運(yùn)用參考.SummaryoftheAmazonEC2andAmazonRDSServiceDisruptionintheUSEastRegion.April2011J.Dean.Designs,lessons,andadvicefrombuildinglargedistributedsystems.KeynotefromLADIS2009.E.A.Brewer.TowardsRobustDistributedSystems.InPODC,pages7C7,D.J.Abadi.Consistencytradeoffsinmoderndistributeddatabasesystemdesign:CAPisonlypartofthestory.IEEEComputer,45(2):37C42,2012.G.DeCandia,D.Hastorun,M.Jampani,G.Kakulapati,A.Lakshman,A.Pilchin,S.Sivasubramanian,P.Vosshall,andW.Vogels.Dynamo:Amazonshighlyavailablekey-valuestore.InSOSP2007.TheApacheCassandraProject./[7]M.Burrows,”TheChubbylockserviceforloosely-coupleddistributedsystems,”inOSDI’06:Proceedingsofthe7thsymposiumonOperatingsystemsdesignandimplementation,2006,pp.P.Hunt,M.Konar,F.P.Junqueira,andB.Reed,ZooKeeper:Wait-freecoordinationforInternet-scalesystems,inUSENIXATC10:Proceedingsofthe2010USENIXAnnualTechnicalConference.USENIXAssociation,2010.W.Vogels.Eventuallyconsistent.Commun.ACM,52:40C44,JanuaryK.Birman.ReliableDistributedSystemsTechnologies,WebServicesandApplications.Springer,2005.E.SchurmanandJ.Brutlag.Performancerelatedchangesandtheiruserimpact.PresentedatVelocityWebPerformanceandOperationsConference,June2009.VelocityandtheBottomLine. /site/glinden/Home/StanfordDataMining.2006-11-29.ppt.29F.Schneider.Implementingfault-tolerantservicesusingthestatemachineapproach:atutorial.ACMComputingSurveys,22(4),1990.Merideth,MichaelG.,andMichaelK.Reiter.Selectedresultsfromthelatestdecadeofquorumsystemsresearch.Replication,SpringerBerlinHeidelberg,2010,185-206.Lamport,L.Time,clocks,andtheorderingofeventsinadistributedsystem.ACMCommunications,21(7),pp.558-565,1978.D.Malkhi,M.Reiter,A.Wool,andR.Wright.Probabilisticquorum.systems.InformationandCommunication,(170):184C206,2001.PeterBailis,ShivaramVenkataraman,JosephM.Hellerstein,MichaelFranklin,IonStoica.ProbabilisticallyBoundedStalenessforPracticalPartialQuorums.InVLDB2012S.S.Chawathe,H.Garcia-Molina,andJ.Widom.FlexibleConstraintManagementforAutonomousDistributedDatabases.IEEEDataEng.Bull.,17(2):23C27,1994.A.GuptaandS.Tiwari.Distributedconstraintmanagementforcollaborativeengineeringdatabases.InCIKM,pages655C664,1993.H.YuandA.Vahdat.DesignandEvaluationofaContinuousConsistencyModelforReplicatedServices.InOSDI,pages305C318,2000.S.Shah,K.Ramamritham,andP.J.Shenoy.ResilientandCoherencePreservingDisseminationofDynamicDataUsingCooperatingPeers.IEEETrans.Knowl.DataEng.,16(7):799C812,2004XavierDefago,AndreSchiper,andPeterUrban.Totalorderbroadcastandmulticastalgorithms:Taxonomyandsurvey.ACMComput.Surv.36,4(December2004)XavierDefago.Comparativeperformanceanalysisoforderingstrategiesinatomicbroadcastalgorithms.IEICETrans.onInformationandSystems.December2003L.Lamport.PaxosMadeSimple.ACMSIGACTNews,32(4):18C25,DecemberP.Marandi,M.Primi,N.Schiper,andF.Pedone.RingPaxos:Ahigh-throughputatomicbroadcastprotocol.InternationalConferenceonDependableSystemsandNetworks(DSN),2010,pp.527-536.L.Lamport.GeneralizedConsensusandPaxos.TechnicalReportMSRTR-2005-33,MicrosoftResearch,2005,B.KemmeandG.Alonso.Databasereplication:ataleofresearchcommunities.PVLDB,ParisaJaliliMarandiMarcoPrimiFernandoPedone.Multi-RingPaxos.InDSNChandra,T.,Toueg,S.:UnreliableFailureDetectorsforReliableDistributedSystems.JournaloftheACM43(2),225C267(1996).W.Chen.OntheQualityofServiceofFailureDetectors.IEEETransactionsonComputer.May2002.DiogoBecker,FlavioJunqueira,andMarcoSerafini.LeaderElectionforReplicatedServicesUsingApplicationScores.InMiddleware2011.B.Lampson.TheABCDsofPaxos.InProceedingsofthe20thACMSymposiumonPrinciplesofDistributedComputing(PODC01),page13.ACMPress,2001.IditKeidarandSergioRajsbaum.Onthecostoffault-tolerantconsensuswhentherearenofaults-atutorial.TechnicalReportMIT-LCS-TR-821,LaboratoryforComputerScience,MassachusettsInstituteTechnology,Cambridge,MA,02139,May2001.alsopublishedinSIGACTNews32(2)(June2001).A.Demers,D.Greene,C.Hauser,W.Irish,J.Larson,S.Shenker,H.Sturgis,D.Swinehart,andD.Terry.Epidemicalgorithmsforreplicateddatabasemaintenance.InPODC1987.MarinBertier,OlivierMarin,PierreSens,ImplementationandPerformanceEvaluationofanAdaptableFailureDetector,Proceedingsofthe2002InternationalConferenceonDependableSystemsandNetworks,p.354-363,June23-26,2002.M.RaynalandF.Tronel.GroupMembershipFailureDetection:ASimpleProtocolandItsProbabilisticAnalysis.DistributedSystemsEng.J.,vol.6,no.3,pp.95-102,R.vanRenesse,Y.Minsky,andM.Hayden.Agossip-stylefailuredetectionservice.InProceedingsofInternationalConferenceandDistributedSystemsPlatformsandOpenDistributedProcessing(IFIP),1998.J.Gray,P.Helland,P.ONeil,andD.Shasha.Thedangersofreplicationandasolution.InSIGMOD1996.B.F.Cooper,R.Ramakrishnan,U.Srivastava,A.Silberstein,P.Bohannon,H.-A.Jacobsen,N.Puz,D.Weaver,andR.Yerneni.PNUTS:Yahoo!sHostedDataServingPlatform.PVLDB,1:1277C1288,August2008.W.Lloyd,M.J.Freedmand,M.Kaminsky,andD.G.Andersen.Dontsettleforeventual:Scalablecausalconsistencyforwide-areastoragewithCOPS.InSOSP2011.ConsistencyorLatency?AQuantitativeAnalysisofReplicationSystemsBasedonReplicatedStateXuWang,HailongSun,TingDeng,JinpengSchoolofComputerScienceandEngineeringBeihangUniversityBeijing,ChinaEmail:{wangxu,sunhl,dengting,huaijp}—ExistingtheorieslikeCAPandPACELChaveclaimedthattherearetradeoffsbetweensomepairsofper-formancemeasuresindistributedreplicationsystems,suchasconsistencyandlatency.However,currentsystemstakeaveryvagueviewonhowtobalancethosetradeoffs,e.g.eventualconsistency.Inthiswork,weareconcernedwithprovidingaquantitativeanalysisonconsistencyandlatencyforwidely-usedreplicatedstatemachines(RSMs).BasedonourpresentedgenericRSMmodelcalledRSM-d,probabilisticmodelsarebuilttoquantifyconsistencyandlatency.Weshowthatbothareaffectedbyd,whichisthenumberofACKsreceivedbythecoordinatorbeforecommittingawriterequest.Andwefurtherdefineapayoffmodelthroughcombiningtheconsistencyandlatencymodels.Finally,withMonteCarlobasedsimulation,wevalidateourpresentedmodelsandshowtheeffectivenessofoursolutionsintermsofhowtoobtainanoptimaltradeoffbetweenconsistencyandlatency.—consistency;writeconflict;latency;replicatedPracticaldistributedsystemsinCloudandBigDatasolu-tionsoftenfacemanychallengessuchashighscalabilityandavailability,whicharebuiltonmassivecommoditycomput-ers,disks,networkdevicesandundercomplexmanagementtasks[1,2].Toservemoreusersandprovidehighenoughavailability,servicesanddataaretypicallyreplicatedacrossmultiplevirtualmachines(VMs),physicalmachinesandevengeographically-distributedclusters.TheCAPtheorem[3]showstheimpossibilityofobtainingallthreeofconsistency,availabilityandpartitiontolerancesimultaneously.AndPACELC[4]furtherinterpretsCAP,whichclaimstwotradeoffsincludingconsistency&availabilityandconsistency&latency.Ontheonehand,peoplemakeatradeoffbetweenconsistencyandavailability.Inordertoprovideextremelyhighavailability,manysystemsarealwayswritablewitheventualconsistency,suchasDynamo[5]andCassandra[6];whileothers,likeChubby[7]andZookeeper[8],mustexperienceashortperiodofunavailabilitytorecoverfromfailuresforguaranteeingstrongconsistencyofdifferentreplicas.Ontheotherhand,peoplefocusonthetradeoffbetweenconsistencyandlatency.Forexample,eventuallyconsistentreads,nomatterhowfasttheyare,cannotboundthestalenessasthenewestversionswillbeeventually[9];andstrongconsistencyleadstohigherlatencyfor

duetothewriteuniformity[10].Althoughweakconsistencyhasbeenusedinsomecommercialsystemsandisacceptablebymanypractitionersforhigheravailabilityandlowerlatency,itisnecessarytoboundtheinconsistencyandtoknowhowinconsistencytheweakconsistencyis.Sincelatencyimpactstheend-userexperienceofanIn-ternetapplication,ithas eanimportantsystemmetricformodernserviceproviders[11].Somestatisticalresultspre-sentedin[12]showthatend-usersareverysensitivetosystemlatency.Forinstance,atMicrosoftBing,a2-secondslowdownreducesqueries/userby1.8%andrevenue/userby4.3%;alatencyincreasefrom400msto900mswithGooglesearchresultsina25%dropoffinpagesearches;andhasa1%dropinsaleswitha100mslatencyincrease[13].Thereforelatencyisanimportantfactorinsystemdesign.Inthiswork,wefocusonthetradeoffbetweenconsistencyandlatencyindistributedreplicationprotocols.Twopopularfault-toleranceapproachesindistributedsys-temsanddatabasesarereplicatedstatemachines(RSMs)[14]andquorumsystems[15].RSMsdescribedesirablereplica-tionsemanticstomakeoperationscommittedintotalorder.QuorumsystemsdefinesetsofreplicasWandR,whereWisusedforwriteandRforread.Andtheycommitwritesbyvectorclocks[16]withsemanticreconciliation[5]incausalorder.Forquorum-likesystems,severalworks[17,18]boundreadstalenessfromtheaspectsofdataversions,staletimeandstalenessprobabilitytodescribetheinconsistency,andthendiscussthetradeoffbetweenconsistencyandlatency.However,littlehasbeendoneonquantifyinginconsistencyandlatencyforRSMs,whiletheyaredesignedtoproviderelativelystrongerconsistencythanquorumsystems.ForreadoperationsofRSMs,wecanborrowanalysistechniquesfromthequantitativereadstalenessinquorumsystems.ButforwriteoperationsofRSMs,weneedanewanalyticalandquantitativemethodtomeasurethedegreeofwriteinconsistency.ThenwecanrationallymakeatradeoffbetweenconsistencyandlatencyforRSMsbasedonthequantitativeresultofwriteAlthoughtherearesomeexistingalternativemodelstodescribewriteinconsistencyintheabsenceofcrashfailures,e.g.,writeconsistencyconstraint[19,20]andlimitedwritedivergencyofreplicas[21,22],wewanttoknowthedegreeofwriteinconsistencyforRSMswhenencountering978-1-4799-0181-4/13/$31.00?2013failures,ratherthanhowtosatisfyconstraintsorboundthemaximumdeviationofadataitem.ThereforewequantifywriteinconsistencyforRSMsbytheprobabilityofwriteconflictwhichrepresentstheprobabilityofawritecommittedinanabnormalorder.Inthispaper,wepresentaquantitativeprobabilisticanaly-sisapproachtomeasuringtheconsistencyandlatencyofRSM-s,andfurtherprovideasolutionformakingtradeoffbetweenconsistencyandlatencytoachievethemaximalsystembenefit.First,wehavesurveyedandanalyzedthreerepresentativeRSMimplementationsincludingnon-uniformtotalorderbroadcast[23,24],distributedconsensus(Paxos)[25,26]anduniformtotalorderbroadcast[23,24],aswellastheirvariants[27-29].ThenweproposeagenericsystemmodelRSM-d,whered∈[1,(nisthenumberofreplicas)representsthenumberofthatmustbereceivedbeforecommittingawrite.RSM-dcandescribethethreerepresentativeoneswhered=1,[n/2]+1,nrespectively.Second,wemeasurethewriteinconsistencyofRSM-dbasedonaprobabilisticmodel.Thewriteincon-sistencyisquantifiedbytheprobabilityofwriteconflictwhichrepresentstheprobabilityofanywritecommittedinanabnormalorder.Ourprobabilisticmodelshowsthat(1)consistencyincreaseswhendrisesifd∈[1,[n/2]+1]andstrongconsistencyisalwaysguaranteedifd∈[[n/2]+1,n].Third,weevaluatethelatencyofRSM-dthroughitsexpectationwhichshowsthatthelatencystrictlymono-tonicallyincreaseswithrespecttod.Fourth,combiningthequantitativeresultsofwriteinconsistencyandlatencyofRSM-d,weprovideasolutionfortradeoffbetweenconsistencyandlatencytoachievethemaximalsystembenefit.Finally,throughMonteCarlobasedeventdrivensimulations,wevalidateoursolutionfortradeoffbetweenconsistencyandlatencyfromtheaspectofoverallsystembenefit.WemakethecontributionsasWepresentagenericsystemmodelRSM-dforRSMstogiveanunifieddescriptionofmajorreplicationprotocolsusingRSMs;OnthebasisofRSM-d,webuildaprobabilisticmodelforwriteconflictthatisoneofthekeyfactorstocauseinconsistency,andaprobabilisticmodelforcharacterizinglatencyaswell.Wefindoutthatbothconsistencyandlatencyareaffectedbythecommonfactord.Wefurtherdefineapayoffmodeltomaketradeoffbetweenconsistencyandlatencyforachievingthemaximumsystembenefitthroughcombiningthewriteconflictmodelandlatencymodelfromtheperspectiveofaserviceprovider;WithMonteCarlobasedeventdrivensimulations,wevalidateourquantitativeresultsandshowtheeffectivenessofourpresentedsolutionsintermsofhowtoobtainanoptimaltradeoffbetweenconsistencyandlatency.Theremainderofthispaperisstructuredasfollows.SectionIIintroducesthebackground.SectionIIIpresentsoursystemmodel.SectionIVdescribeshowtoquantifytheprobabilityofwriteconflict.Thecalculationoflatency

presentedinSectionV.SectionVIshowsthetradeoffofconsistencyandlatency.SectionVIIprovidesourexperimentalresults.RelatedworkanddiscussionarepresentedinSectionInthissection,wepresentthebackgroundofRSMsarewidelyusedindistributedsystemswiththetargettotoleratefailures.Failuremodesfallintonon-Byzantine(fail-stop,crashandmessageloss)andByzantine(signedmessagesornot).Withthecomplexitythatatleast3f+1nodesfortoleratingfByzantinefailuresandthelowoccurrenceprobabilityofByzantinefailures,commonRSMsonlyconsidernon-Byzantineones.Andmessagelossfailurescanbeeasilyeliminatedbynetworkprotocols,sopractitionersusuallysaythatRSMscantoleratecrashfailures.ChandraandToueg[30]haveprovedthatthetotalorderbroadcastandthedistributedconsensusproblemsaresolvableandequivalenttoeachotherunderWfailuredetectorsandatmost[n/2]simultaneouscrashfailures.Inthiswork,wealsofollowtheseassumptions.ItmeansthatwediscussRSMswhichtolerateuptofcrashfailureswith2f+1replicasandcaneventuallydetectthecrashofnodes.However,practicalfailuredetectorsmustoutputaresult(crashornot)inaboundedtimeotherthanalong-waitingforeventualconclusions.[31]classifiesthequalityofserviceofpracticalfailuredetectorsintotwotypes:speed,i.e.,howfastafailuredetectordetectsacrash;andaccuracy,i.e.,howwellitavoidsfalsesuspicion.ThereforepracticalfailuredetectorsofRSMshaveaprobabilitytofalselysuspectonanormalAssurveyedin[23-29],RSMsaretypicallyimplementedbynon-uniformtotalorderbroadcast,distributedconsensusanduniformtotalorderbroadcast.AsshowninFigure1,thethreerepresentativeRSMimplementationsworkasfollows:Innon-uniformtotalorderbroadcast,onceacoordinatorreceivesawrite,itassignsthewriteauniquesequencenumberandthendirectlycommitsittoallreplicas.Whenthewritearrivesatareplica,thereplicawillrecordthewrite,andcommititaccordingtoitssequencenumberandthenrespondtothecoordinator.Oncethecoordinatorhasreceivedthefirstresponse,itwillreplytotheclient.Notethatanytimeacoordinatorissuspectedascrashedbyfailuredetectors,thenewcoordinatorwillbeelected[32]andrecoversfailuresbasedonthehistoricalcommitrecords.FordistributedconsensusbasedRSMs,onceacoordi-natorreceivesawrite,atfirstitpropagatesthewritewiththegivensequencenumbertoallreplicas.Afterloggingthewrite,everyreplicawillsendanacknowledgement(ACK)messagetothecoordinator.Untilatleast[n/2]+1ACKs(includingthecoordinatoritself)arereceivedbythecoordinator,itwillcommitthewrite.Whenthecoordinatorissuspectedascrashedbyfailuredetectors,thenewcoordinatorwillbeelectedandrecoverfailuresbasedonthehistoricallogrecords(notthehistoricalcommitrecords).Uniformtotalorderbroadcastisverysimilartodistribut-edconsensus,butthecoordinatormustwaitforallnACKstobereturned.Fig.1.ThreeRepresentativeRSMAssumingwithoutlossofgeneralitythecoordinatorsendsitselfanACKmessage,wecandiscoverthatanobviousdif-ferenceamongthethreerepresentativeRSMimplementationsisthenumberofACKs(denotedbyd)thatthecoordinatorwaitsforbeforecommittingawrite.Andthevaluesofdfornon-uniformtotalorderbroadcast,distributedconsensusanduniformtotalorderbroadcastare1,[n/2]+1andn,respectively.Thuswecaneasilyconcludethatthelatenciesofthemfromlowtohigharenon-uniformtotalorderbroadcast,distributedconsensusanduniformtotalorderbroadcast.Intermsofconsistency,distributedconsensus(Paxos)hasshoweditssafety(strongconsistency)in[33].Thatis,RSMsarealwaysconsistentif[n/2]+1≤d≤n.However,fornon-uniformtotalorderbroadcast,ifacommittingwriteislostinhistoricalcommitrecordsduetocrashfailures(i.e.,allcommittednodesincludingthecoordinatorcrash)orfailuredetectiononthecoordinatorhappens,thesequencenumberheldbythewriteorusedbythefalselysuspectedcoordinatorwillbereusedtokeepsystemliveness.Thesewillleadtowriteconflictandthenresultinwriteinconsistency.RSM-d:AGENERICRSMInthissection,wewillpresentoursystemmodelRSM-d,whichisthe ionofRSMimplementations.SinceourdesiredsystemmodelshouldcoverthethreerepresentativeRSMs,wehavesurveyedthemandtrytoderivethesystemAlthoughwepresent(non-uniformanduniform)totalorderbroadcastalgorithmsinsectionII,theyareonlyoneoffiveclasses.[23]providesacomprehensivesurveyfortotalorderbroadcast,consideringallfiveclassesoforderingmechanismsandbothnon-uniformanduniformalgorithms.Thefirstthreeorderingmechanisms:fixedsequencer(showninsectionII),movingsequencerandprivilege-basedarebuiltbasedonafixedsequenceroramovingtoken;Theothertwo:communicationhistoryanddestinationagreementareimplementedbasedonatotalorderlogicalclock[16].AndthesolutionofdistributedconsensusforRSMisthesamewiththedestinationagreement,wherePaxosisusedtomakeagreementontheorderforawrite,andmulti-paxos[25]forallwrites.Essentiallyweadda’sequencer’whichcancomparetheordersofanytwowritestocausalorder,andthengaintotalorder.Basedontheanalysisabove,weareconcernedwith

Fig.2.RSM-dtotalorderbroadcastwiththefixedsequencerandthePaxosalgorithm.Thereasonsare:(1)thefixedsequenceralgorithmhasthelowestlatencyamongalltotalorderbroadcastones(onpoint-to-pointnetworks)[24];(2)Paxosisprovedtobeoptimalingeneralconsensusalgorithms[34];(3)boththetokeninthemovingsequencerandprivilege-basedalgorithms,andthelogicalclockinthecommunicationhistoryanddesti-nationagreementalgorithms,canbeseenas”anotherfixedThenwecombinetotalorderbroadcastwiththefixedsequencerandthePaxosalgorithmtoanunifiedsystemmodelRSM-d,whichisfeasiblesincetotalorderbroadcastanddistributedconsensusareessentiallyequivalenttoeachother[30].AsdepictedinFigure2,RSM-dincludestheagreementphaseandthecommitphase.Intheagreementphase,anywritemustbeb

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論