![LSH-MoE:通過局部敏感哈希實現(xiàn)通信高效的專家混合模型訓練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第1頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo922.jpg)
![LSH-MoE:通過局部敏感哈希實現(xiàn)通信高效的專家混合模型訓練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第2頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9222.jpg)
![LSH-MoE:通過局部敏感哈希實現(xiàn)通信高效的專家混合模型訓練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第3頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9223.jpg)
![LSH-MoE:通過局部敏感哈希實現(xiàn)通信高效的專家混合模型訓練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第4頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9224.jpg)
![LSH-MoE:通過局部敏感哈希實現(xiàn)通信高效的專家混合模型訓練 LSH-MoE - Communication-efficient MoE Training via Locality-Sensitive Hashing_第5頁](http://file4.renrendoc.com/view14/M04/1B/13/wKhkGWc4kZOAdCFEAALLVgBJZbo9225.jpg)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
LSH-MoE:Communication-efficientMoETrainingviaLocality-SensitiveHashing
XiaonanNie1QibinLiu1FangchengFu1ShenhanZhu1XupengMiao2
XiaoyangLi3YangZhang3ShoudaLiu3BinCui1
arXiv:2411.08446v1[cs.DC]13Nov2024
1PekingUniversity2PurdueUniversity3ByteDance
1{xiaonan.nie,2101212782,ccchengff,shenhan.zhu,bin.cui}@.cn
2xupeng@3{lixiaoyang.x,zhangyang.elfin,liushouda}@
Abstract
Largertransformermodelsalwaysperformbetteronvarioustasksbutrequiremorecoststoscaleupthemodelsize.Toefficientlyenlargemodels,themixture-of-experts(MoE)architectureiswidelyadopted,whichconsistsofagatenetworkandaseriesofexpertsandkeepthetrainingcostconstantbyroutingtheinputdatatoafixednumberofexpertsinsteadofall.Inexistinglarge-scaleMoEtrainingsystems,expertswouldbedistributedamongdifferentGPUsforparallelization,andthusinputdatarequiresadditionalall-to-allcommunicationstoaccessthetargetexpertsandconductcorrespondingcomputations.However,uponevaluatingthetrainingprocessofthreemainstreamMoEmodelsoncommonlyusedGPUclusters,wefoundthattheall-to-allcommunicationratioaveragedaround45%,whichsignificantlyhinderstheefficiencyandscalabilityoftrainingMoEmodels.Inthispaper,weproposeLSH-MoE,acommunication-efficientMoEtrainingframe-workusinglocality-sensitivehashing(LSH).WefirstpresenttheproblemsofscalingMoEtraininginexistingsystemsandhighlightthepotentialofexploitingtokensimilaritytofacilitatedatacompression.Then,weintroduceanefficientLSH-basedcompressiontechnique,whichutilizesthecross-polytopehashingforrapidclusteringandimplementsaresidual-basederrorcompensationschemetoalleviatetheadverseimpactofcompression.Toverifytheeffectivenessofourmethods,weconductexperimentsonbothlanguagemodels(e.g.,RoBERTa,GPT,andT5)andvisionmodels(e.g.,Swin)forpre-trainingandfine-tuningtasks.Theresultsdemonstratethatourmethodsubstantiallyoutperformsitscounterparts
acrossdifferenttasksby1.28×-2.2×ofspeedup.
1Introduction
Inrecentyears,large-scalepre-trainedmodelshavesignificantlyadvancedtheperformanceof
deeplearningacrossvariouscomplextasks,includingcomputervision[8,
20],naturallanguage
processing[3,
7,
28],andmulti-modallearning[19]
.Commonlyreferredtoasfoundationmodels,
thesepre-trainedmodelsareprimarilybuiltonTransformerarchitectures[34]andundergoextensive
pre-trainingonlargedatasets,utilizingsubstantialGPUresources.OpenAIhasvalidatedthescaling
lawforlargelanguagemodels[15]andsuggeststhatincreasingthemodel’sparametersize,thevolume
oftrainingdata,andthedurationoftrainingcansignificantlyenhancethemodel’sperformance.However,thisapproachresultsinaconsiderableriseintrainingcosts,makingthedevelopmentoffoundationmodelsextremelyexpensive.
XiaonanNie,QibinLiu,FangchengFu,ShenhanZhu,andBinCuiarewiththeSchoolofComputerScienceandKeyLabofHighConfidenceSoftwareTechnologies(MOE),PekingUniversity.BinCuiisalsowiththeInstituteofComputationalSocialScience,PekingUniversity(Qingdao).
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).
2
Toreducethehighcomputationalcosts,thesparsemixture-of-experts(MoE)architectureisoftenadopted,whichcomprisesasparsegatenetworkandaseriesofexpertnetworks.Thisarchitectureroutesinputdatatoonlyasubsetofexperts,resultinginsparseactivationoftheexpertsandtherebyreducingthemodel’scomputationalFLOPs(floatpointoperations)aswellastrainingcosts.
ProminentmodelssuchasGoogle’sSwitch-Transformer[9],ST-MoE[41],Meta’sHashLayer[31]
andMistral-AI’smixtralmodels[14]havesuccessfullyimplementedthisdesign,demonstrating
improvementsinbothperformanceandefficiencywithMoEmodels.
Meanwhile,effectivelyscalingthetrainingofMoEmodelsacrosshundredsoreventhousandsofGPUsremainsasignificantchallenge.ResearchersfromGooglehaveproposedtheexpertparallelism
approach[17],whichreplicatesthegatingnetworkoneachGPUsanddistributesdifferentexperts
acrossmultipleGPUsforparallelprocessing.Specifically,eachinputtokenisinitiallyprocessedbythegatingnetworktoselecttheappropriateexpert,afterwhichitisroutedtothedesignatedexpertsviapeer-to-peer(P2P)networkcommunication.Oncethedesignatedexpertscompletetheircomputation,thetokenisreturnedtotheoriginalGPUforfurtherprocessingthroughanadditionalP2Pcommunication.SinceeachGPUtypicallyneedstoexchangedatawithmanyotherGPUs,theseP2Ptransmissionsresultsinanall-to-allcommunicationpattern.Moreover,becausethecomputationoftheexpertnetworkreliesontheoutcomesofthesecommunications,thecommunicationscannotbeeffectivelyoverlappedwithongoingcomputations.ThisdependencycreatesasignificantperformancebottleneckinmodeltrainingacrossmostcommonlyusedGPUclusters.Weconductedexperimentsonthreewidely-usedMoEmodels,includingRoBERTa-MoE,GPT-MoEandSwin-MoE,onfourA100servers,eachwithacross-machinebandwidthof200Gb/s.Theresults,asshowninFigure
3,
revealthatthetimecostofall-to-allcommunicationconstitutesanaverageof45%andcanreachupto67%ofthetotalmodeltrainingtime.
ExistingmethodstoimprovedistributedMoEtrainingonbandwidth-limitedclusterstacklecommuni-cationchallengesinvariousways.
TA-MoE[4]reducescross-machinecommunicationbyadjusting
thegatingnetworktofavorexpertsonthesameserver,whilePre-gatedMoE[13]reducesdependency
betweencommunicationandcomputationthroughapre-gatingmechanismthatplanstokenroutinginadvance.However,bothapproachesrequiremodificationstothegatingmechanismandmodelstructure,limitingtheiruniversalapplicability.
DeepSpeed-MoE[29]introducesPR-MoE,which
selectsoneexpertplusasharedexpert,halvingtheall-to-allcommunicationload.
SCoMoE[40]
organizesall-to-allcommunicationbystructuringdatatransfersalongdifferentdimensionsandcontrollingdatavolumesacrossnetworklevels,andalsoclusterstokenstoimproverouting.However,noneoftheseworksconsiderreducingtheAll-to-AllcommunicationvolumeinMoEtrainingbycompressingtheforwardactivations.Therefore,theycanbeintergratedwithourmethodforfurtherimprovement.
Inthispaper,wepresentLSH-MoE,acommunication-efficientMoEtrainingframeworkthatleverageslocality-sensitivehashingtogroupsimilartokens.Ourkeycontributionsareasfollows:
?WebeginbyidentifyingkeychallengesinscalingMoEtraininginexistingsystems,notingthatall-to-allcommunicationconstitutesanaverageof45%ofthetotaltrainingtime.Addi-tionally,weinvestigatethepotentialofusingtokensimilaritytofacilitatedatacompressiontoreducecommunicationcosts.
?WeproposeanefficientLSH-basedcompressiontechniquethatemployscross-polytopehashingforrapidclustering.Thisapproachtransmitsonlytheclusteringcentroids,sig-nificantlyreducingcommunicationcosts.Tofurtherenhanceaccuracy,weimplementaresidual-basederrorcompensationschemetomitigatethenegativeeffectsofcompression.
?Throughextensiveexperimentswithlanguagemodels(RoBERTa-MoE,GPT-MoE,andT5-MoE)andvisionmodels(Swin-MoE),acrossbothpre-trainingandfine-tuningtasks,wedemonstratethatourmethodmaintainsmodelqualitywhileachievingaspeedupof1.28×-2.2×inend-to-endtrainingtime.
2Background
2.1Mixtures-of-ExpertArchitecture
ToenhancethetrainingefficiencyofTransformermodels,Williametal.
(2022)[9]introduced
aninnovativeparadigm,thesparsemixture-of-eexperts(MoE)architecture,illustratedinFigure
1.
3
fx纟ΣEix
i∈GX
1st
Expert
E1(x)E2(x)E3(x)En(x)
nth
Expert
…
GatingNetwork
t!
Figure1:Mixture-of-ExpertsonasingleGPU.
AlltoAll
AlltoAll
x0
x1
Node0
GatingNetwork
GPU0
1st
Expert
GatingNetwork
GPU1
2nd
Expert
x2
x3
Node1
GatingNetwork
GPU2
3rd
Expert
GatingNetwork
GPU3
4th
Expert
Intra-nodeComm.Inter-nodeComm.
Figure2:TrainingMixture-of-ExpertsonmultipleGPUsasexpertparallelism.
2nd
Expert
3rd
Expert
G:RM→1,NK
Thisarchitectureeffectivelybalancesparametercapacityandtrainingcosts,andcomprisestwokeycomponents:anexpertnetwork(E)andasparsegatenetwork(G).ItisevidentthatMoEmodels,withanequalnumberofactiveparametersperinput,cansignificantlysurpasstheperformanceofdensemodels.Thisbreakthroughhasalsocatalyzedfurtherresearchandtheirapplicationacross
variousindustries,ashighlightedbynumeroussubsequentstudies[5,
14,
22,
23,
25,
30,
39]
.
TheexpertnetworkEiscomposedofmultiplespecializedandseparatenetworks,commonlyreferredtoasexperts,denotedas{EiNrepresentsthenumberofexperts.Additionally,Ei(x)denotestheoutputproducedwhentheinputxisprocessedbythei-thexpert.Eachexpertistrainedtoexcelinaspecificsub-task,suchasinmulti-tasklearning,ortohandlespecificsegmentsofdata,asseeninlanguagemodelingandmulti-modallearning,therebyincreasingtheoverallmodelcapacity.Infoundationalmodels,theMoElayeroftenservesasasubstituteforthetraditionalfeed-forwardnetwork(FFN)layer.WithineachMoElayer,eachFFNfunctionworksasanindividualexpert,significantlyenhancingthemodel’scapabilitytoprocessdiverseandcomplexdatainputs.
ThegatingnetworkGplaysacrucialroleinthesparseMoEarchitecture.Forexample,inaK-waygatedMoEsystem,thegatingnetworkoutputsasetofintegersasEquation
1
todeterminewhichexpertsshouldbeactivated.Thisdecisionisbasedonthecharacteristicsoftheinputitself,allowingforadynamicandefficientallocationofcomputationalresources.Byonlyprocessingeachinputtokenwithaselectedsubsetoftheexpertnetwork,theMoEmodelachievescomputationsparsity,effectivelydecouplingparametercapacityfromtrainingcosts.
G:RM→[1,N]K(1)
Throughtheintegrationofmultiplespecializedexperts,asdescribedbyEquation
2,thesparseMoE
modeliscapableofdeliveringmoreaccurateandefficientpredictionsasf(x).Thisisachievedbyleveragingthespecializedknowledgeembeddedwithineachexpert,combinedwiththestrategicinputallocationmanagedbythegatingnetwork.
纟i(2)
WhileMoE’sprimaryadvantageisdecouplingparametercapacityfromnetworkcost,akeychallengeliesinlearningthegatingparameterseffectively,astheoutput’ssparsitymakesitnon-differentiable.Consequently,muchoftheresearchintheMoEfieldhascenteredondevelopingmethodsforlearninggatingfunctions.
Thesemethodsfallintothreemaincategories,asoutlinedin[6]:routingvia
learnableweighting[9,
24,
30],deterministichashrouting[31],andreinforcementlearning-based
routing[2,
32,
33]
.TheseapproachesprimarilydifferinthedesignofthegatingnetworkGratherthantheexpertnetworkE,andthereforeallencountersimilarscalingchallenges.
2.2ChallengesofScalingMoEModelTraining
WhileMoEmodelswereinitiallydevelopedtofacilitateefficientscalingduringtraining,deployingtheselarge-scalemodelsinpracticalGPU-intensiveenvironmentsposessignificantchallengesin
4
100%
80%
60%
40%
20%0%
RoBERTa-MoE
GPT-MoE
Swin-MoE-L
(16GPUs)
(16GPUs)
(16GPUs)
回All-to-All團Others.
(a)16GPUs
100%
80%
60%
40%
20%0%
RoBERTa-MoE
(32GPUs)
GPT-MoE(32GPUs)
Swin-MoE-L(32GPUs)
All-to-AllOthers.
(b)32GPUs(double#GPUs)
100%
80%
60%
40%
20%0%
RoBERTa-MoEGPT-MoESwin-MoE-L
-Wide(16GPUs)-Wide(16GPUs)-Wide(16GPUs)
All-to-AllOthers.
(c)16GPUs(double#experts)
Figure3:Proportionofall-to-allcommunicationtimerelativetototaltrainingdurationacrossdifferentconfigurations:scalingthenumberoftrainingservers(Figure
3(b))andscalingtheparameter
sizeofmodels(Figure
3(c))
.
distributedcomputing.Specifically,theMoElayerharborsaconsiderablyhighernumberofparam-etersandrequiresadditionalmemory,yetitmaintainsalmostthesamecomputationaldemandsasthedenselayer.Thisleadstoauniquecomputedensity—definedastheratioofthelayer’sFLOPs(FloatingPointOperations)toitsnumberofparameters.Therefore,traditionalparallelismmethodssuchastensorparallelismandpipelineparallelismareinsufficientforachievingeffectiveparallelisminthescenariosofMoEtraining.
Toimprovetheefficiencyandscalabilityoftraininglarge-scaleMoEmodels,expertparallelism
[17]
hasbeenintroducedasaspecializedmodelparallelismstrategy.ThisapproachdistributesexpertswithinanMoElayeracrossmultipleGPUs,whileleveragingdataparallelismforreplicatingnon-MoElayers,thusefficientlymanagingthetrainingworkloadofMoEmodels.TheworkflowofdistributedtrainingforanMoElayerisdepictedinFigure
2.
Oncethetargetexpertforeachtokenisdetermined,anall-to-allcommunicationprocessistriggeredtodistributetokenstotheircorrespondingtargetexpertsforcomputations,denotedasEi(x).Subsequently,anotherroundofall-to-allcommunicationisexecutedtogathertheoutputsfromallexperts,whichproducestheMoElayer’soutput(representedasf(x),Equation
2)
.Subsequentoperationsinvolveexecutingthedata-parallelnon-MoElayers.
WefirstprofiledthetrainingprocessofthreepopularMoEmodelsemployingexpertparallelism(detailedinTable
1)onaclustercomprisedoffourA100machines,eachequippedwithaninter
-connectRDMAbandwidthof200Gb/s.Theproportionofall-to-allcommunicationtimerelativetothetotaltrainingdurationisillustratedinFigure
3(a).
Wethendoublethenumberofmachines,andthenumberofexpertstoincreasethemodelscale.TheresultsareshowninFigure
3(b)
and
3(c),respectively.
Ourfindingsrevealthatall-to-allcommunicationaccountedforasubstantialportionofthetotaltime:approximately30%inGPT-MoE(15B),40%inRoBERTa-MoE,and70%inSwin-MoE-L.Andthisoverheadremainsnearlyconstantinlargermodelsandatlargermachinescales.Theseresultshighlightasignificantbottleneckthathampersthescalabilityofthetrainingprocess.Consequently,thedurationofall-to-allcommunicationsubstantiallyconstrainstrainingwithexpertparallelism,leadingtoreducedoverallthroughputandlimitingthepotentialtoscaleupthenumberofexpertseffectively.
2.3Locality-SensitiveHashingAlgorithms
Locality-SensitiveHashing(LSH)isaprobabilisticmethodprimarilyusedtoapproximatenearestneighborsearchinhigh-dimensionalspaces,whichreducesthedimensionalityofdatabymappingsimilardatatothesame“buckets”withhighprobabilityusinghashfunctions.Thisapproachoffersasubstantialreductionincomputationalcomplexity,particularlybeneficialforlarge-scaledataapplications.ThekeyoperationsinLSHincluding:
MappingDataintoBuckets:ThecoreofLSHisafamilyofhashfunctionsthatmaximizetheprobabilityofnearbypointsintheoriginalspacestayingcloseinthehashedspace,whiledistantpointsarelikelytoendupindifferentbuckets.Eachhashfunctionhischaracterizedbytheproperty:P[h(x)=h(y)]=1?d(x,y)/D,whered(x,y)isthedistancebetweenpointsxandy,andDdenotesthediameterofthespace.Tomapsimilardataintothesamebucket,multiplehashfunctionsfromthisfamilyareselectedbasedonthespecificattributesofthedata(e.g.,Euclideandistance,cosinesimilarity)andthedesiredgranularityofthebuckets.Datapointsarethenhashedbythese
5
residuals
device0
廠
expert0
tokenscentroidsE(centroids)E(tokens)
3
4
1
2
expert1
All-To-All
All-To-All
LSH-BasedClustering
device1
Residual-basedErrorCompensation
expert2
tokenscentroidsresiduals
?=
E(centroids)residualsE(tokens)
+=
calculatethecentroids
average(1,,5)=
a
2
device2
buckets
centroids
a
b
hashfunction
tokenshash
1
2
3
4
5
6
Figure5:SchematicofMoEtrainingwithLocality-SensitiveHashing(LSH-MoE).
functions,andeachpointisassignedtobucketsaccordingtoitshashvalues,effectivelycategorizingsimilaritemstogetherforclustering.
CalculatingClusterCentroids:Bygroupingdatapointsintobucketsasdeterminedbytheirhash
values,datapointsareeffectivelyclustered.Eachbucketrepresentsaclusterofdatapointsandthe
centroidofeachclusteristhencalculatedasthemeanofallpointswithinthatcluster,formulatedas
Cj=ε1xi,whereCjisthecentroidofthej-thbucket,njisthenumberofpointsinthej-th
bucket,andxiarethedatapointsinthebucket.
3Methodology
3.1TheMotivationofTokenSimilarity
Toexplorethepotentialoptimizationforall-to-allcommunicationsinMoEtraining,weconductedanin-depthanalysisofthedatainvolvedintheseall-to-allcommunications,identifyingahighdegreeofsimilarity,termedtokensimilarity.Specifically,weappliedPrincipalComponentAnalysis(PCA)toreducethedimensionalityoftheinputtokensofall-to-allcommunicationsandobservedadistinctclusteringphenomenon,asillustratedintheFigure
4.
Ouranalysissuggeststhattheobservedsimilarityamongtokensmaystemfromtwoprimaryfactors:
?DataRelatedInfluences:Thesimilarityispartiallyduetothenatureofreal-worlddata,whichoftenadherestoZipf’s
Law[18]
.Thisresultsinaskeweddistribution,withcertaindataelementsappearmorefrequentlythanothers.
?ModelStructureRelatedInfluences:ThedesignofTrans-
formerarchitecture[34],especiallyitsattentionmecha
-nisms,significantlyimpactstokensimilarity.Inmodels
likeBERT[7],attentionlayersaredesignedtocaptureand
integratecontextinformationacrosstokens,thushomoge-nizingtokenrepresentationsandemphasizingtheirsharedsemanticrelationshipsatthesentencelevel.
Figure4:PrincipalCom-ponentAnalysis(PCA)Visu-alizationofinputtokensin-volvedinall-to-allcommuni-cation.
6
3.2LSH-MoE
MotivatedbytheTokenSimilarityobservedinSection
3.1,weintroduce
LSH-MoE,anovelMoE
trainingframeworkthatintegrateslocality-sensitivehashing(LSH)forrapidclusteringofinputtokens.Ourmethodtransmitsonlytheclusteringcentroids,significantlyreducingcommunicationvolumes.Tocounteractthenegativeeffectsofcompression,wealsoimplementaresidual-basederrorcompensationscheme.
AsdepictedinFigure
5,
LSH-MoEinitiallyemploys(1)anLSH-basedclusteringmethodtocompresstokensintocentriodsforsubsequentprocessing,effectivelyreducingcommunicationoverhead.Itthensequentiallyexecutes(2)all-to-allcommunication,expertcomputation,andanother(3)all-to-allcommunicationtoproducetheprocessedoutputsE(centriods).Finally,itintroduces(4)aresidual-basederrorcompensationmethodtoapproximatetheexpert-processedresultsE(tokens),byintegratingE(centriods)withresiduals.Meanwhile,wealsooutlinetheworkflowofourLSH-MoEframeworkintheAlgorithm
1
ofAppendix
A.1.
ThekeycomponentsofourLSH-MoEframeworkincludesanefficientLSH-basedclusteringalgorithmforrapidprocessingandanresidual-basederrorcompensationschemetominimizequalitydegradation.
EfficientLSH-basedClusteringAlgorithm.Sincethedatatobecompressed(theinputdataforall-to-allcommunication)isgenerateddynamicallyandinrealtime,pre-compressingitoroverlappingcompressiontimewithotherprocessingtasksisnotfeasible.Consequently,selectinganefficientonlinecompressionalgorithmiscrucial.Traditionalclusteringalgorithms,suchasK-Means,oftenencountercomputationalchallengesandefficiencylimitations.Locality-sensitivehashing(LSH)addresstheseissuesbyhashingsimilardatapointsintothesamebuckets,enablingfastersimilaritydetectioninhigh-dimensionalspaces.
NumerousLSHalgorithmshavebeendeveloped,eachemployingauniquehashingapproachformappingdataontobuckets.Weconductedexperimentstoevaluateseveralpopularhashingalgorithms,includingcross-polytopehashingandsphericalhashing.BasedonourevaluationsinSection
4.5,we
selectedcross-polytopehashingastheoptimalalgorithmforourapplication.Cross-polytopehashingstandsoutforitsmethodofmappinginputvectorstothenearestvertexonacross-polytope.Thisprocessisfacilitatedbyapplyingrandomlyrotatedcross-polytopes,whicheffectivelysegmentthesurfaceoftheunitsphere.Thealgorithmcanbemathematicallyrepresentedasfollows:
LSH(x)=argmaxi∈{±1,±2,...,±d}|Rx|i(3)
whereRisarandomrotationmatrix,disthedimensionalityofthespace,and|Rx|idenotestheabsolutevalueofthei-thcomponentoftherotatedvectorRx.
ThisformulaencapsulateshowtheinputvectorxistransformedbytherotationmatrixRandthenmappedtothenearestvertexofthecross-polytopebyselectingthedimensionithatmaximizestheabsolutevalueofthecomponentsofRx.Thismethodeffectivelysegmentsthehigh-dimensionalspaceandenhancestheclusteringefficiencybyrapidlyidentifyingsimilardatapoints.
Residual-basedErrorCompensationScheme.InourLSH-MoEframework,wecompresstheintermediateactivationvalueswithinthemodelnetwork.Unlikegradientcompression,thisprocessdoesnottolerateerrorswell.Therefore,itisessentialtominimizecompression-inducederrorstoensureminimalimpactonmodelperformance.Toaddressthis,weimplementanovelresidual-basedgradientcompensationstrategy,outlinedasfollows:
1.Wefirstcapturetheresidualforeachdatapointrelativetoitsclustercentroid,definedbytheequation:
2.Aftertheexpertnetworkcomputesoutputsfortheclustercenters,thefinalstepistorestoretheprocessedresultforeachtokenbyaddingbackthepreviouslyrecordedresidual:
Yij←{E(clusterj)+?Clusterjk|k=1,2,...,Nj}.(5)
Thiserrorcompensationschemeeffectivelymitigatespotentialaccuracylosscausedbydatacompres-sioninall-to-allcommunication,ensuringthefidelityandrobustnessoftheLSH-MoEframework.TheexperimentalresultsinSection
4
showthatimplementingthiscompensationmechanismenables
7
Table1:Modelsforevaluation,where“-”indicatesthatthevaluesaredifferentacrosslayers.
Model
#Layer
dmodel
dffn
#Experts
#Params.(MoE)
#Params.(Total)
RoBERTa-MoE
12
768
3072
16
302M
394M
T5-MoE
16
1024
16384
16
8594M
9288M
GPT-MoE(15B)
12
768
3072
512
14507M
14629M
GPT-MoE(52B)
24
1024
4096
512
51539M
51740M
Swin-MoE-L
24
-
-
32
-
946M
themodeltrainedwithLSH-MoEtoachieveanaccuracycomparabletothatofamodeltrainedwithoutcompression.Thisoutcomehighlightstheeffectivenessofourproposederrorcompensationstrategyinpreservingmodelperformancedespitethechallengesposedbydatacompressioninall-to-allcommunication.
3.3ScalabilityAnalysisofLSH-MoE
Toeffectivelydemonstratethescalabilityofourapproach,particularlyintermsofitsapplicabilitytobothlargermodelsandlargercomputationalclusters,weconductedatheoreticalanalysis.ThisanalysisprimarilyfocusesonthecomputationoverheadandthecommunicationcostsassociatedwithMixtureofExperts(MoE),specificallyconsideringall-to-allcommunicationoverhead.Wederivedtheratioofcommunicationtimetocomputationtime,highlightinghowthisratioevolvesasboththescaleoftheserversandthemodelsizeincrease.Thisrelationshipiscrucialforunderstandingscalabilityandcanbeformallyexpressedasfollows:
wherekrepresentsthenumberofexpertsactivatedpertoken,FLOPsandBinterdenotetheGPU’scomputationabilityandthenetworkperformance,wisthenumberofGPUservers,andhisthehiddensizeofmodel.Notably,thefirstterm,.Additionally,scalingMoEmodelstypicallyemphasizesincreasingthenumberoflayersandexperts,
whilethegrowthinhiddensize(h)tendstobegradual,asseeninmodelslikeSwitch-Transformer[9]
.Consequently,whenboththemodelscaleandthenumberoftrainingserversgrow,theproportionofall-to-allcommunicationtimeremainsnearlyunchanged.ThisinsightunderpinsthescalabilityoftheLSH-MoEmethod,demonstratingitsrobustnessinlarger-scalesettingsandsupportingitspotentialinfuturelarge-scaleapplications.Foradetailedderivation,pleaserefertoAppendix
A.2.
4Experiment
4.1Implementation
OurLSH-MoEcomprisesadatacompression/restorationcomponentandacommunicationcompo-nent.WeutilizePyTorch1.11fordevelopingtheLSHclusteringandNCCLforimplementingthecommunication.Additionally,ourmethodisframework-independentandcanbeeasilyappliedto
otherMoEtrainingframeworkssuchasHetu-MoE[21,
26],DeepSpeed-MoE[29],andTutel[12]
.
4.2BenchmarksandDatasets
Ourevaluationsareconductedbyscalingpre-trainedmodelsequippedwithMoEarchitectureacrossvariousapplicationdomains.ThisincludesmodelslikeRoBERTa-MoE,T5-MoEandGPT-MoEinnaturallanguageprocessing(NLP),aswellasSwin-MoEincomputervision(CV).Amongthesemodels,RoBERTa-MoEandT5-MoEareevaluatedonpre-trainingtask,whileGPT-MoEandSwin-MoEundergofine-tuningevaluationbasedontheirofficialopen-sourcedmodelcheckpoints
1
2.
Wealsoevaluatedthezero-shotaccuracyofthepre-trainedT5-MoE.ModelconfigurationsaredetailedinTable
1.
1/facebookr
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 親子教育項目居間合同樣本
- 電影院裝修服務(wù)合同范本
- 農(nóng)藥購銷合同樣本
- 三農(nóng)村公共環(huán)境治理體系建設(shè)指南
- 生產(chǎn)管理實務(wù)操作流程詳解
- 網(wǎng)絡(luò)教育技術(shù)運用與發(fā)展趨勢研究報告與指導書
- 鋼化玻璃采購合同書
- 購買豆腐的合同
- 2025年陽江b2貨運上崗證模擬考試
- 小學三年級上冊口算練習500題
- 娛樂直播行業(yè)發(fā)展趨勢
- 國際學校幼升小面試試題
- 火電廠消防培訓課件
- 精神障礙患者的生活技能培訓
- 《系統(tǒng)解剖學》期末考試復習題庫大全-下(多選題匯總)
- 廈門弘愛醫(yī)院硼中子俘獲治療系統(tǒng)項目環(huán)境影響報告
- 酒店招標文件內(nèi)容
- 員工調(diào)薪申請單模板
- 《中國高考評價體系》解讀(化學學科)
- 大學課程中的可持續(xù)發(fā)展目標(SDGs)教育-第1篇
- 企業(yè)人員測評理論與方法
評論
0/150
提交評論