google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第1頁(yè)
google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第2頁(yè)
google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第3頁(yè)
google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第4頁(yè)
google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第5頁(yè)
已閱讀5頁(yè),還剩47頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

2024-9-20

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

AviralKumar*+,1,VincentZhuang*+,1,RishabhAgarwal*,1,YiSu*,1,JDCo-Reyes1,AviSingh1,KateBaumli1,ShariqIqbal1,ColtonBishop1,RebeccaRoelofs1,LeiMZhang1,KayMcKinney1,DishaShrivastava1,Cosmin

Paduraru1,GeorgeTucker1,DoinaPrecup1,FeryalBehbahani?,1andAleksandraFaust?,11GoogleDeepMind,*EqualContribution,+Randomlyorderedviacoinflip,?Jointlysupervised.

arXiv:2409.12917v1[cs.LG]19Sep2024

Self-correctionisahighlydesirablecapabilityoflargelanguagemodels(LLMs),yetithasconsistentlybeenfoundtobelargelyineffectiveinmodernLLMs.Existingapproachesfortrainingself-correctioneitherrequiremultiplemodelsorrelyonamorecapablemodelorotherformsofsupervision.Tothisend,wedevelopamulti-turnonlinereinforcementlearning(RL)approach,SCoRe,thatsignificantlyimprovesanLLM’sself-correctionabilityusingentirelyself-generateddata.TobuildSCoRe,wefirstshowthatvariantsofsupervisedfine-tuning(SFT)onofflinemodel-generatedcorrectiontracesareinsufficientforinstillingself-correctionbehavior.Inparticular,weobservethattrainingviaSFTeithersuffersfromadistributionmismatchbetweenthetrainingdataandthemodel’sownresponsesorimplicitlyprefersonlyacertainmodeofcorrectionbehaviorthatisoftennoteffectiveattesttime.SCoReaddressesthesechallengesbytrainingunderthemodel’sowndistributionofself-generatedcorrectiontracesandusingappropriateregularizationtosteerthelearningprocessintolearningaself-correctionstrategythatiseffectiveattesttimeasopposedtosimplyfittinghigh-rewardresponsesforagivenprompt.ThisregularizationprescribesrunningafirstphaseofRLonabasemodeltogenerateapolicyinitializationthatislesssusceptibletocollapseandthenusingarewardbonustoamplifyself-correctionduringtraining.WhenappliedtoGemini1.0Proand1.5Flashmodels,wefindthatSCoReachievesstate-of-the-artself-correctionperformance,improvingthebasemodels’self-correctionby15.6%and9.1%respectivelyontheMATHandHumanEvalbenchmarks.

1.Introduction

Largelanguagemodels(LLMs)haveproventobeausefultoolinreasoningandscientificdomainssuchasmathematicalproblem-solvingandcoding(

Lozhkovetal.,

2024;

Shaoetal.,

2024;

Team,

2024

).AnaspirationalpropertyofLLMsinsuchsettingsistoabletoimplementalgorithms:strategiesthathelptheLLMtousecomputationandinteractiontoimproveitsresponseonthetest-timequery.ModernLLMslargelydonotimplementalgorithmsreliably:forinstance,consideraproblemsettingthatrequiresmodelstodetectandrevise(or“self-correct”)theirownresponsestoagiventest-timequery,soastobeabletoeventuallyarriveatthebest-possiblefinalresponse.Thissortofself-correctioncapabilityhasbeenshownbyseveralrecentworkstobeseverelylackingincurrentLLMs,especiallyintheabsenceofexternalinput(alsoreferredtoasintrinsicself-correction)(

Huangetal.,

2023;

Kamoietal.,

2024)

.

TomakeprogresstowardstheeventualgoalofteachingLLMstoimplementalgorithmstohandlechallenginginputs,westudyaspecialinstanceoftrainingLLMstoimplementself-correctionstrategiestofixtheirmistakes“on-the-fly”.Thisshouldbepossible:onmanyquerieswherecurrentLLMsfail,theystillcontaintheunderlying“knowledge”neededtoarriveatthecorrectresponsebutareunabletocorrectlyelicitanddrawinferencesabouttheirownknowledgewhenneeded(

Snelletal.,

2024

).Forexample,strongLLMscanoftensuccessfullycompleteasub-partofamathproofwhenpromptedwiththeremainder,butmaynotbeabletocompleteitfromscratch.Inasimilarvein,leveragingtheirpreviousresponsesshould,inprinciple,enableLLMstoimprovetheirsubsequentones.Nevertheless,

Correspondingauthor(s):[vincentzhuang,aviralkumar,rishabhagarwal,yisumtv]@

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

2

ation

+4.

4%

DirectGenerSelf-Correctio

Δ(SC-Direc

nt)

+0.

4%+1.

8%

-11.

2%

70

TestAccuracy(%)

65

60

55

50

45

40

Gemini1.5Flash:MATH

Self-Consistency@K

(SC)

70%

68%

66%

64%

62%

60%

BaseModelSTaRSFTSCoRe

ScalingInferenceCompute:MATH

ParallelSamples

Sequential(Self-Correct)

2122232425

Numberofsamples(K)

Figure1∣Left:SCoReachievesstate-of-the-artself-correctionperformanceonMATH;Right:SCoReinference-timescaling:spendingsamplesonsequentialself-correctionbecomesmoreeffectivethanonlyonparalleldirectsamples(Section

6.2

).

self-correctionhasremainedelusive,highlightingtheneedforgoingbeyondexistingtrainingparadigms.

HowcanweinstillLLMswithself-correctionabilities?Priorattemptstowardself-correctingLLMseitherrelyonprompt-engineering(

Kimetal.,

2023;

Madaanetal.,

2023

)orfine-tuningmodelsspecificallyforself-

correction(Havrillaetal.,

2024b;

Quetal.,

2024;

Wellecketal.,

2023;

Yuanetal.,

2024)

.Whiletheformerclassofapproachesoftenfailtoeffectivelyperformmeaningfulintrinsicself-correction,existingfine-tuningbasedapproachesrequirerunningmultiplemodelsuponinference,e.g.,aseparateverifieror

refinementmodel(Havrillaetal.,

2024b;

Wellecketal.,

2023),orrequireoracle“teacher”supervisionto

guidetheprocessofself-correction(

Quetal.,

2024),withoutwhichself-correctiondoesnotnecessarily

outperformindependentuncorrelatedattemptsattheproblem.Wedevelopanapproachthatiseffectiveatself-correctionwithoutanyoftheaforementionedrequirements.Ourapproach,Self-CorrectionviaReinforcementLearning(SCoRe),trainsonlyasinglemodelthatcanbothproducearesponsetoareasoningproblemandalsocorrecterrorsdespitenotreceivinganyoraclefeedback.Moreimportantly,SCoReteachesthisabilitytomodelsentirelybytrainingonself-generateddata,withoutanyoracle.

Webeginbystudyingthefailuremodesofexistingfine-tuningbasedstrategiesinthissetting.Weobservethatrunningsupervisedfine-tuningonmulti-turnself-correctiontracescoupledwithrejectionsampling(i.e.,a“multi-turn”variantofSTaR(

Zelikmanetal.,

2022

))oftenamplifiesthemodel’sbiastonotmakeanyerrorcorrections.Aminimaleditstrategyappearssomewhatoptimalasitinhibitsthemodelfromlearningtomakecorrectresponsesworseinthesecondattempt,eventhoughitdoesnotinstillself-correctionabilitiestothemodel.IfthetrainingdatasetforSFTisalteredtoexplicitlydown-weightcertaincorrectiontracesthatonlymakeminoredits,thentheresultingtrainingisabletoavoidcollapse.However,itsuffersfromthecurseofdistributionalshift:acorrectionstrategylearnedbytrainingonoff-policydatadoesnotnecessarilyenablethemodeltobesucceedatcorrectingitsownmistakes.

HowdoesSCoRework?SCoReaddressestheaforementionedchallengeswithSFTbyutilizingonlinemulti-turnreinforcementlearning(RL).Concretely,SCoRerunsmulti-turnRLonself-generateddatatoavoidchallengeswithdistributionmismatchbetweentrainingandinference.Toavoidthefailuremodeoflearningaminimaleditstrategywhentrainingonon-policydata,wetrainSCoReintwostages,witheachstageregularizingthelearningprocesstonotcollapseitsbehavior.ThefirststagereplacesSFTinconventionalLLMfine-tuningworkflowsbytrainingamodelinitializationthatoptimizescorrection

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

3

performancewhileconstrainingthefirstattempttobeclosetothebasemodel.Thesecondstagerunsmulti-turnRLtooptimizerewardatbothattempts,whileusingarewardbonustermthatencouragesimprovingresponsesfromthefirstattempttothesecond.Boththeinitializationandtherewardbonusensurethatthemodelcannotsimplylearntoproducethebestfirst-attemptresponseandonlyminorlyeditit.Overall,SCoReisabletoelicitknowledgefromthebasemodeltoenablepositiveself-correction.

OurmaincontributionisSCoRe,amulti-turnRLapproachforteachingLLMshowtocorrecttheirownmistakes.Tothebestofourknowledge,SCoReisthefirstapproachtoattainsignificantlypositiveintrinsicself-correction:relativetobaseGeminimodels,ourmethodattainsanabsolute15.6%gainonself-correctionforreasoningproblemsfromMATH(

Hendrycksetal.,

2021)andanabsolute

9.1%gainoncodingproblemsfromHumanEval(

Chenetal.,

2021

).WeadditionallymotivatethedesignofSCoRebyextensivelystudyingthefailuremodesofbaselineapproaches,whichbroadlyindicatethatreinforcementlearningmayplayanessentialroleinself-learnedself-correction.

2.RelatedWork

Priorworksstudyself-correctionforLLMsunderavarietyofassumptionsandproblemsettings.Themostprominentproblemsettingsincludeproblemswhereexternalinputtokensfromanenvironmentis

available,fore.g.,agentictasks(Liuetal.,

2023),coderepair(Jainetal.,

2024)

,andtooluse(Chen

etal.,

2023

).Whileself-correctionwithexternalfeedbackispossiblewithstrongproprietarymodels(

Pan

etal.,

2023

),eventhestrongestmodelsstruggleinthesubstantiallymorechallengingsettingwhennoexternalinputisavailable(

Kamoietal.,

2024

).Thissettingiscalledintrinsicself-correction.Priorworkthatattemptstoamplifyintrinsiccorrectionabilitiesarelargelybasedonpromptingandfine-tuning.

Promptingforintrinsicself-correction.RecentworkdemonstratesthatLLMsstruggletoself-correcttheirreasoningerrorswithoutexternalfeedbackandna?velyrunningself-correctioncandegradeperfor-mance(

Huangetal.,

2023;

Quetal.,

2024;

Tyenetal.,

2024;

Zhengetal.,

2024)

.Theseexperimentalstudiesareatoddswithpriorwork(

Kimetal.,

2023;

Madaanetal.,

2023;

Shinnetal.,

2023

)andlargely

stemfrommismatchedassumptionsonthesetting(Kamoietal.,

2024)

.Forexample,

Kimetal.

(2023

);

Shinnetal.

(2023)useoracleground-truthanswersduringself-correctionthatmaynotbeavailable

generally.

Madaanetal.

(2023)useweakpromptsforinitialresponses,therebyperhapsoverestimatethe

improvementpossiblebyself-correction.Thisindicatesthatthereisnomajorworkshowingsuccessfulintrinsicself-correctionviapromptingalone.Inthecontextofcodeself-repair,

Olaussonetal.

(2023

)showthatevenwhenstrongmodelsarepromptedwithsomeformofpartialfeedback,e.g.,showingtest-casesbutnotthedesiredoutcomesonthosetest-cases,theyareoftenunabletocorrecttheirmistakes.Samplingmultipleresponsesinparallelattainsmuchbetterresultsin

Olaussonetal.

(2023

).

Fine-tuningforintrinsicself-correction.Toaddresstheissueswithpromptingoff-the-shelfmodelsalone,severalworksrunsupervisedfine-tuning(SFT)orweightedSFTontheLLMtogeneratearevisiongivenaninitialresponse.Nonethelesstypicalworksinthisliteraturerelyonoraclefeedback:e.g.,

obtainingrevisionsdirectlyfromhumanannotators(Saundersetal.,

2022)orstrongermodels(

Qu

etal.,

2024;

Yeetal.,

2023

).Ourworkaimstotrainforself-correctionentirelywithouttheuseofbiggermodelsorhumans,whenthelearneritselfisaskedtogenerateitsowntrainingdata.Similartothesepriorworks,weassumeaccesstoarewardfunctionforevaluatingmodel-generatedoutputs,similarto

Akyüreketal.

(2023

);

Wellecketal.

(2023

);

Zhangetal.

(2024

).Otherapproachesbuildpipelines

withmultiplemodelsforself-correction(e.g.,GLoRE(Havrillaetal.,

2024b),Self-

correct(Wellecketal.,

2023),or

Akyüreketal.

(2023

);

Pauletal.

(2023

)).Whilethiscanleadtogoodresults,thesepipelinesdonotquitetackleself-correctionandrequiresystemdesignforservingmultiplemodelsatdeployment.

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

4

Standardtraining

Loss

response(y)

Foundationmodel

prompt(x)

TrainingviaSCoRe(Ours)

Loss

Improved

response(y2)

firstattempt(y1)

FoundationmodeltrainedwithScoRe

prompt(x)Optionalexternal

input(p)

Theproblemsettingofself-correction.SCoRetrainsamodeltonotjustproducethebestpossibleresponse,butinsteadaimstotrainthemodeltoproducethebestfinalresponseinthefinalattempt.Inthesecondturn,extrainputintheformofaninstructionaskingthemodeltocorrectitselformodel-generatedmaybeprovided.

Problem.Ifn≡2(mod7),thenfindtheremain-derwhen(n+2)(n+4)(n+6)isdividedby7.

SCoReturn1solution(incorrect).Sincen≡2(mod7),wecanwriten=7k+2forsomeintegerk.Substituting,weget

(n+2)(n+4)(n+6)=(7k+4)(7k+6)(7k+8)

≡4?6?8(mod7)≡192(mod7)

≡1(mod7).

FinalAnswer:Thefinalansweris1.Ihopeitiscorrect.

Self-correctioninstruction.Theremightbeanerrorinthesolutionabovebecauseoflackofun-derstandingofthequestion.Pleasecorrecttheerror,ifany,andrewritethesolution.

SCoReturn2solution(correct).Sincen≡2(mod7),wecanwriten=7k+2forsomeintegerk.Substituting,weget

(n+2)(n+4)(n+6)=(7k+4)(7k+6)(7k+8)≡4?6?8(mod7)

≡4?6?1(mod7)≡24(mod7)

≡3(mod7).

FinalAnswer:Thefinalansweris3.Ihopeitiscorrect.

Figure2∣Anexampletraceandtheproblemsettingofself-correction.

Multi-turnRLforLLMs.Ourapproachutilizesamulti-turnpolicygradientapproachfortrainingforself-correction,whichextendsthesingle-turnapproachof

Ahmadianetal.

(2024

)andcanbeviewedasaninstantiationofthehierarchicalRLframeworkfrom

Zhouetal.

(2024

).Generally,priorworkattheintersectionofLLMsandmulti-turnRLbuildsvalue-

based(Farebrotheretal.,

2024

;

Shanietal.,

2024;

Snelletal.,

2022;

Zhouetal.,

2024),policy-based(Shaoetal.,

2024;

Xiongetal.,

2024),and

model-based(

Hongetal.,

2024)approaches

.WhilethislineofworkbuildsmachinerytodoRL(i.e.,optimizerewards)inamulti-turnMarkovdecisionprocess(MDP),ourprimarycontributioninthispaperistodeviseaformalization,forlearningself-correctionbehaviorinsteadoftheRLmachineryitself.

Self-correctionwithexternalfeedback.Manyworksstudyself-correctionwithadditionalfeedbackfromtheenvironment,mostcommonlyinthesettingofcodegeneration,whereunittestresultsorcompiler

executionfeedbackareavailable(Chenetal.,

2024;

Jainetal.,

2024;

Olaussonetal.,

2023)

.Largelytheseworkspromptmodelstoreasonaboutcodeexecution;

Nietal.

(2024)proposeaself-training

methodthatleveragesexecutiontraces,thoughonlyevaluateitoncorrectingafixeddatasetoferrors.

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

5

3.PreliminariesandProblemSetup

OurgoalistodevelopanapproachfortrainingLLMstoimprovetheirownpredictionsbyentirely

trainingonself-generateddata.Asdiscussedsofar,wesituateourselvesintheintrinsicself-correction

setting(

Huangetal.,

2023),wheremodelsattempttocorrecttheirinitialresponses

withoutanyexternal

feedback.Concretely,givenadatasetD={(xi,y)}1ofproblemsxiandoracleresponsesy,wewill

trainanLLMpolicyπθ(?∣[x,1:l,p1:l])that,giventheproblemx,previouslmodelattempts1:latthe

problem,andauxiliaryinstructionsp1:l(e.g.,instructiontofindamistakeandimprovetheresponse),

solvestheproblemxascorrectlyaspossible.Thisformalismisakintothemulti-turnMDPin

Quetal.

(2024

).Moreover,weassumeaccesstoarewardfunction/verifier(y,y*),suchasastring-matching

basedanswercheckingfunction)thatevaluatescorrectnessofresponseybycomparingwiththeoracle

responsey*.Critically,wedonotassumeaccesstosuchafunctionattest-timeandthemodelitselflearns

todeducewhethertherewasamistakeandcorrectsit,asisoftenthecaseine.g.mathematicalreasoning

problems.AnexampleandoverviewofourproblemsettingisgiveninFigure

2.

Weaimtofindamodelπ(□∣?)(whichwewillalsorefertoasapolicy)mappingasequenceofinputtokens?toasequenceofoutputtokens□thatmaximizesthecorrectnessrewardobtainedfromtheverifierattheendofl+1turns.Formally,thiscanbewrittenasthefollowingmulti-stepRLobjective:

max

πθ

εx,y*~D,l+1~πθ(?∣[x,0:l,p1:l])[(l+1,y*)].

(1)

Crucially,notethatunlikestandardSFTorprevalentRLfine-tuningworkflowsthattrainthepolicyπtodirectlyproduceanoptimalresponseforaninputx,Equation

1

trainsπovermultipleturns/attemptssimultaneously,whereintermediateturnresponses1:laresupervisedindirectlywiththefinalrewards.

AbaseRLapproachforfine-tuningLLMs.OurRLtoolkitisbasedonon-policypolicygradient.Thesemethods,suchasREINFORCEwithaKL-divergencepenaltyagainstafixedmodel(

Ahmadianetal.,

2024

),arewidelyusedinRLfine-tuningofLLMs,primarilyinsettingofsingle-turnRLfromhumanfeedback.Formally,suchpolicygradientapproachestrainapolicyπθ(?∣x)tooptimize:

maxθ

Ext,yt~πθ(?∣xt)[(yt,y*)?β1DKL(πθ(?∣xt)∣∣πref(?∣xt))],

(2)

whereπrefisareferenceanchorpolicy,typicallychosentobeapre-trainedorSFTpolicy.

Metrics.Formeasuringself-correctionperformance,wereportandanalyzethefollowingmetrics:(1)Accuracy@t1:themodel’saccuracyatthefirstattempt;(2)Accuracy@t2:themodel’saccuracyatthesecondattempt,(3)Δ(t1,t2):thenetimprovementinmodelaccuracybetweenthefirstandsecondattempts,whichmeasurestheefficacyofself-correction,(4)Δi→c(t1,t2):thefractionofproblemsthatareincorrectinthefirstattemptbutbecomecorrectatthesecondattempt,whichmeasureshowmanynewproblemscanself-correctionsolve;and(5)Δc→i(t1,t2):thefractionofproblemsthatarecorrectinthefirstattemptbutbecomeincorrectatthesecondattempt,whichmeasureshowwellthemodelisabletounderstandwhatmakesaresponsecorrect.

4.SupervisedFine-TuningonSelf-GeneratedDataisInsufficientforSelf-Correction

Perhapsanaturalapproachtotrainforself-correctionistoutilizesomeformofsupervisedfine-tuningondatacollectedfromabasemodel.Variantsofthisrecipehavebeenshowntoscalewellinsingle-

turnreasoningproblems(Havrillaetal.,

2024a;

Singhetal.,

2023;

Zelikmanetal.,

2022)

.CansuchSFT-basedapproachesbeeffectiveforself-correctionaswell?

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

6

Table1∣Self-correctionperformanceaftertrainingonDSTaRandDSFT.Forbothapproaches,wefindthatthegapbetweensecond-attemptandfirst-attemptperformance(Δ(t1,t2))iseitheroverlynegativeorverysmall.Inaddition,bothapproacheserroneouslymodifyacorrectresponsetobeincorrect,i.e.,reflectedinahighΔc→i(t1,t2)andalowΔi→c(t1,t2).

Method

Accuracy@t1

Accuracy@t2

Δ(t1,t2)

Δi→c(t1,t2)

Δc→i(t1,t2)

Basemodel

52.6%

41.4%

-11.2%

4.6%

15.8%

STaRDStaR

55.4%

41.2%

-14.2%

5.4%

19.6%

Pair-SFTDSFT

52.4%

54.2%

1.8%

5.4%

3.6%

Table2∣Self-correctionperformanceaftertrainingonDTaRandDFT.PerformanceimprovesforSTaRindicatingthata

highercoveragedatasethelpsimproveperformance,butnotforSFTwheretraningontraceswherebothresponsesarecorrectforcesthemodeltosimplynotmakeanychangestoitsfirst-attemptresponse,nomatterhowcorrectorincorrectthatis.

Method

Accuracy@t1

Accuracy@t2

Δ(t1,t2)

Δi→c(t1,t2)

Δc→i(t1,t2)

Basemodel

52.6%

41.4%

-11.2%

4.6%

15.8%

STaRDtaR

53.6%

54.0%

0.4%

2.6%

2.2%

Pair-SFTDFT

55.0%

55.0%

0%

0%

0%

Inthissection,weperformanempiricalstudytoanswerthisquestion.Westudytwoapproaches:

STaR(Zelikmanetal.,

2022)andanapproachakinto

Wellecketal.

(2023)thattrainsonlyonemodel

.Wedonotuselearnedprocessoroutcomeverifierstoguidecorrectiontraces,sooursetupdiffersfromSFTin

Snelletal.

(2024

).Wefindthatsuchmethodsimprovesubstantiallycomparedtothebasemodel’sself-correctionbehavior,butstillfailtoattainapositiveself-correctionrateandproduceaworsesecondattemptcomparedtotheirfirstattempt.Byprobingtrainedmodels,wefindthatthesefailureslargelystemfromsupervisedfine-tuningamplifyingtheinitialbiasofthebasemodelresultinginonlyminorchangestoitsfirst-attemptresponse.Whilethesefailurescanbeaddressedifadifferentdistributionoverinitialresponsesisusedfortraining,doingsofailstoinduceeffectiveself-correctionbehaviorunderthemodel’sownresponsedistribution.Eitherway,learningisaffectedbydistributionshiftoramplificationofthebasemodel’sbias.TheseobservationsmotivatethedesignofourmethodinSection

5.

4.1.AnalysisSetup:MethodsandDatasetConstruction

Methods.Wepromptoff-the-shelfmodelstoobtainalargenumberoftwo-turnself-correctiontraces.TheSTaRapproach,analogoustoReSTEM

(Singhetal.,

2023

),filtersthesetrajectoriestoonlyretainthosethatsuccessfullyreviseincorrectresponsesandrunsSFTontheresultingdataset.Incontrast,

Wellecketal.

(2023

)usethebasemodeldatafromabovetoconstructsetsofcorrectandincorrectresponsesandthengenerates“synthetic”repairtracesbypairingincorrectresponseswithcorrectones.WestudyavariantoftheirmethodwecallPair-SFT,whichdoesnottrainaseparatecorrectormodelanddoesnotaugmentthisinitialdatasetwithmulti-turntraces.

Datasetconstruction.WeperformourstudyontheMATHdataset,andgenerateself-correctiontracesbypromptingtheGemini1.5Flash(

Reidetal.,

2024

)usingtemperature1.0.Weconstructdatasetsfor

STaRandPair-SFTasfollows:(1)DSTaR:={(xi,,)}1,whereandcorrespondtoincorrect

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

7

(a)HistogramsofeditdistanceratiosonMATH500.

(b)STaReditdistanceratios.

(c)Pair-SFTeditdistanceratios.

Figure3∣Editdistancebetweenfirst-attemptandsecond-attemptresponsesobtainedfromfine-tunedmodels,ourapproach(SCoRe)andthebasemodel.Observethatwhiletrainingonself-generatederrorcorrectiontracesinheritsthebi-modaldistributionofeditsasthebasemodel,SFTtendstobequiteconservative.

DSFT:={(xi,,y)}1,whereyisarandomcorrectresponseforproblemx,randomlysampled

andcorrectrespon?sesappearingw?ithinasinglesequenceofattemptsfromthecurrentmodel,and(2)

fromthesetofallfirst-turnandsecond-turnresponsesproducedbythemodel.Wethenransupervisedfine-tuningonbothofthesedatasets:following

Singhetal.

(2023),werepeat3iterationsofcollecting

andrunningSFTonDSTaR,butonly1epochonDSFTgiventhelargedatasetsize.

4.2.EmpiricalFindings

Weplottheself-correctionperformanceoftheGemini1.5Flashbeforeandafterrunningfine-tuningonDSTaR(3iterations)andDSFTinTable

1.

WefindthatalthoughΔ(t1,t2)issubstantiallyhigherforPair-SFTrelativetothebasemodel,thereisstilllittlebenefittodoingself-correction(1.8%gain).ByconsideringΔi→candΔc→i,wefindthatSFTmainlyhelpsbyreducingthenumberofcorrectproblemsthataremistakenlychangedtoincorrectafterrevision,anddoesnotsignificantlyincreasethefractionofincorrectfirstattemptsthatarecorrectlyrepaired.Thisresultisconsistentwithpriorstudiesonintrinsicself-correctionthathavefoundnegligibleorevennegativeΔ(t1,t2)(

Huangetal.,

2023;

Quetal.,

2024)

.

WealsofindthatunlikePair-SFT,trainingonDSTaRdoesnotreduceΔc→i,indicatingthattheSTaRpolicydoesnothaveaclearunderstandingofwhenandwhennottomakemodifications.WehypothesizethatthisdiscrepancyisduetothedatadistributionsofDSFTandDSTaR:theformercoversamuchmorediversespaceofrevisiontrajectoriesduetothenatureofrandompairing.Observingthis,wealso

trainedonanextendedversionofDTaR(andalsoDFT),whichadditionallypresentsmoretupleswith

bothcorrectresponses.Wewouldexpecttheadditionofsuch“correct-to-correct”datatopreventthemodelfromerroneouslyrevisingacorrectresponseand,attheveryleast,restrictthemodificationofacorrectresponseintoonlyanothercorrectresponse.AsshowninTable

2

,perhapsinterestingly,wefindthatincludingsuchdatahasoppositeeffectsonSTaRandSFT:forSTaR,inclusionofthisdatahelpssubstantially,thoughitstillresultsinbarelyanymeaningfulself-correctionperformance.Ontheotherhand,forSFT,inclusionofthisdataoverlybiasesthemodeltonotchangeitsansweratall.

TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning

8

Divingdeeper:analyzingself-correctionbehavior.WealsovisualizedhowtheSTaRandSFTmodelsedittheirresponses.Inparticular,wemeasurededitdistanceratio,definedastheeditdistancebe-tweentheresponsesnormalizedbythetotallengthofboththeresponses,tosum-marizetheextenttowhichmodelsmod-ifytheirfirst-attemptresponse.AsshowninFigure

3a,

whilethebasemodelsome-timesmakessubstantiallylargeeditstotheoriginalresponse,modelsfine-tunedonDSTaRandDSFTareoverlyconserva-tive,andoftenmakenoeditsatall.WewillshowinSection

5

thatourproposedmethodSCoReisabletoavoidamplifyingthisbiasofnotmakingchanges,withoutanyexplicittrainingforcontrollinghowmuchtoeditsolutions.

Val

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論