![google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第1頁(yè)](http://file4.renrendoc.com/view12/M08/2E/04/wKhkGWb5acSAebkpAARfUWV2TAY041.jpg)
![google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第2頁(yè)](http://file4.renrendoc.com/view12/M08/2E/04/wKhkGWb5acSAebkpAARfUWV2TAY0412.jpg)
![google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第3頁(yè)](http://file4.renrendoc.com/view12/M08/2E/04/wKhkGWb5acSAebkpAARfUWV2TAY0413.jpg)
![google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第4頁(yè)](http://file4.renrendoc.com/view12/M08/2E/04/wKhkGWb5acSAebkpAARfUWV2TAY0414.jpg)
![google deepMind:通過(guò)強(qiáng)化學(xué)習(xí)訓(xùn)練語(yǔ)言模型進(jìn)行自我糾正_第5頁(yè)](http://file4.renrendoc.com/view12/M08/2E/04/wKhkGWb5acSAebkpAARfUWV2TAY0415.jpg)
版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
2024-9-20
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
AviralKumar*+,1,VincentZhuang*+,1,RishabhAgarwal*,1,YiSu*,1,JDCo-Reyes1,AviSingh1,KateBaumli1,ShariqIqbal1,ColtonBishop1,RebeccaRoelofs1,LeiMZhang1,KayMcKinney1,DishaShrivastava1,Cosmin
Paduraru1,GeorgeTucker1,DoinaPrecup1,FeryalBehbahani?,1andAleksandraFaust?,11GoogleDeepMind,*EqualContribution,+Randomlyorderedviacoinflip,?Jointlysupervised.
arXiv:2409.12917v1[cs.LG]19Sep2024
Self-correctionisahighlydesirablecapabilityoflargelanguagemodels(LLMs),yetithasconsistentlybeenfoundtobelargelyineffectiveinmodernLLMs.Existingapproachesfortrainingself-correctioneitherrequiremultiplemodelsorrelyonamorecapablemodelorotherformsofsupervision.Tothisend,wedevelopamulti-turnonlinereinforcementlearning(RL)approach,SCoRe,thatsignificantlyimprovesanLLM’sself-correctionabilityusingentirelyself-generateddata.TobuildSCoRe,wefirstshowthatvariantsofsupervisedfine-tuning(SFT)onofflinemodel-generatedcorrectiontracesareinsufficientforinstillingself-correctionbehavior.Inparticular,weobservethattrainingviaSFTeithersuffersfromadistributionmismatchbetweenthetrainingdataandthemodel’sownresponsesorimplicitlyprefersonlyacertainmodeofcorrectionbehaviorthatisoftennoteffectiveattesttime.SCoReaddressesthesechallengesbytrainingunderthemodel’sowndistributionofself-generatedcorrectiontracesandusingappropriateregularizationtosteerthelearningprocessintolearningaself-correctionstrategythatiseffectiveattesttimeasopposedtosimplyfittinghigh-rewardresponsesforagivenprompt.ThisregularizationprescribesrunningafirstphaseofRLonabasemodeltogenerateapolicyinitializationthatislesssusceptibletocollapseandthenusingarewardbonustoamplifyself-correctionduringtraining.WhenappliedtoGemini1.0Proand1.5Flashmodels,wefindthatSCoReachievesstate-of-the-artself-correctionperformance,improvingthebasemodels’self-correctionby15.6%and9.1%respectivelyontheMATHandHumanEvalbenchmarks.
1.Introduction
Largelanguagemodels(LLMs)haveproventobeausefultoolinreasoningandscientificdomainssuchasmathematicalproblem-solvingandcoding(
Lozhkovetal.,
2024;
Shaoetal.,
2024;
Team,
2024
).AnaspirationalpropertyofLLMsinsuchsettingsistoabletoimplementalgorithms:strategiesthathelptheLLMtousecomputationandinteractiontoimproveitsresponseonthetest-timequery.ModernLLMslargelydonotimplementalgorithmsreliably:forinstance,consideraproblemsettingthatrequiresmodelstodetectandrevise(or“self-correct”)theirownresponsestoagiventest-timequery,soastobeabletoeventuallyarriveatthebest-possiblefinalresponse.Thissortofself-correctioncapabilityhasbeenshownbyseveralrecentworkstobeseverelylackingincurrentLLMs,especiallyintheabsenceofexternalinput(alsoreferredtoasintrinsicself-correction)(
Huangetal.,
2023;
Kamoietal.,
2024)
.
TomakeprogresstowardstheeventualgoalofteachingLLMstoimplementalgorithmstohandlechallenginginputs,westudyaspecialinstanceoftrainingLLMstoimplementself-correctionstrategiestofixtheirmistakes“on-the-fly”.Thisshouldbepossible:onmanyquerieswherecurrentLLMsfail,theystillcontaintheunderlying“knowledge”neededtoarriveatthecorrectresponsebutareunabletocorrectlyelicitanddrawinferencesabouttheirownknowledgewhenneeded(
Snelletal.,
2024
).Forexample,strongLLMscanoftensuccessfullycompleteasub-partofamathproofwhenpromptedwiththeremainder,butmaynotbeabletocompleteitfromscratch.Inasimilarvein,leveragingtheirpreviousresponsesshould,inprinciple,enableLLMstoimprovetheirsubsequentones.Nevertheless,
Correspondingauthor(s):[vincentzhuang,aviralkumar,rishabhagarwal,yisumtv]@
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
2
ation
+4.
4%
DirectGenerSelf-Correctio
Δ(SC-Direc
nt)
+0.
4%+1.
8%
-11.
2%
70
TestAccuracy(%)
65
60
55
50
45
40
Gemini1.5Flash:MATH
Self-Consistency@K
(SC)
70%
68%
66%
64%
62%
60%
BaseModelSTaRSFTSCoRe
ScalingInferenceCompute:MATH
ParallelSamples
Sequential(Self-Correct)
2122232425
Numberofsamples(K)
Figure1∣Left:SCoReachievesstate-of-the-artself-correctionperformanceonMATH;Right:SCoReinference-timescaling:spendingsamplesonsequentialself-correctionbecomesmoreeffectivethanonlyonparalleldirectsamples(Section
6.2
).
self-correctionhasremainedelusive,highlightingtheneedforgoingbeyondexistingtrainingparadigms.
HowcanweinstillLLMswithself-correctionabilities?Priorattemptstowardself-correctingLLMseitherrelyonprompt-engineering(
Kimetal.,
2023;
Madaanetal.,
2023
)orfine-tuningmodelsspecificallyforself-
correction(Havrillaetal.,
2024b;
Quetal.,
2024;
Wellecketal.,
2023;
Yuanetal.,
2024)
.Whiletheformerclassofapproachesoftenfailtoeffectivelyperformmeaningfulintrinsicself-correction,existingfine-tuningbasedapproachesrequirerunningmultiplemodelsuponinference,e.g.,aseparateverifieror
refinementmodel(Havrillaetal.,
2024b;
Wellecketal.,
2023),orrequireoracle“teacher”supervisionto
guidetheprocessofself-correction(
Quetal.,
2024),withoutwhichself-correctiondoesnotnecessarily
outperformindependentuncorrelatedattemptsattheproblem.Wedevelopanapproachthatiseffectiveatself-correctionwithoutanyoftheaforementionedrequirements.Ourapproach,Self-CorrectionviaReinforcementLearning(SCoRe),trainsonlyasinglemodelthatcanbothproducearesponsetoareasoningproblemandalsocorrecterrorsdespitenotreceivinganyoraclefeedback.Moreimportantly,SCoReteachesthisabilitytomodelsentirelybytrainingonself-generateddata,withoutanyoracle.
Webeginbystudyingthefailuremodesofexistingfine-tuningbasedstrategiesinthissetting.Weobservethatrunningsupervisedfine-tuningonmulti-turnself-correctiontracescoupledwithrejectionsampling(i.e.,a“multi-turn”variantofSTaR(
Zelikmanetal.,
2022
))oftenamplifiesthemodel’sbiastonotmakeanyerrorcorrections.Aminimaleditstrategyappearssomewhatoptimalasitinhibitsthemodelfromlearningtomakecorrectresponsesworseinthesecondattempt,eventhoughitdoesnotinstillself-correctionabilitiestothemodel.IfthetrainingdatasetforSFTisalteredtoexplicitlydown-weightcertaincorrectiontracesthatonlymakeminoredits,thentheresultingtrainingisabletoavoidcollapse.However,itsuffersfromthecurseofdistributionalshift:acorrectionstrategylearnedbytrainingonoff-policydatadoesnotnecessarilyenablethemodeltobesucceedatcorrectingitsownmistakes.
HowdoesSCoRework?SCoReaddressestheaforementionedchallengeswithSFTbyutilizingonlinemulti-turnreinforcementlearning(RL).Concretely,SCoRerunsmulti-turnRLonself-generateddatatoavoidchallengeswithdistributionmismatchbetweentrainingandinference.Toavoidthefailuremodeoflearningaminimaleditstrategywhentrainingonon-policydata,wetrainSCoReintwostages,witheachstageregularizingthelearningprocesstonotcollapseitsbehavior.ThefirststagereplacesSFTinconventionalLLMfine-tuningworkflowsbytrainingamodelinitializationthatoptimizescorrection
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
3
performancewhileconstrainingthefirstattempttobeclosetothebasemodel.Thesecondstagerunsmulti-turnRLtooptimizerewardatbothattempts,whileusingarewardbonustermthatencouragesimprovingresponsesfromthefirstattempttothesecond.Boththeinitializationandtherewardbonusensurethatthemodelcannotsimplylearntoproducethebestfirst-attemptresponseandonlyminorlyeditit.Overall,SCoReisabletoelicitknowledgefromthebasemodeltoenablepositiveself-correction.
OurmaincontributionisSCoRe,amulti-turnRLapproachforteachingLLMshowtocorrecttheirownmistakes.Tothebestofourknowledge,SCoReisthefirstapproachtoattainsignificantlypositiveintrinsicself-correction:relativetobaseGeminimodels,ourmethodattainsanabsolute15.6%gainonself-correctionforreasoningproblemsfromMATH(
Hendrycksetal.,
2021)andanabsolute
9.1%gainoncodingproblemsfromHumanEval(
Chenetal.,
2021
).WeadditionallymotivatethedesignofSCoRebyextensivelystudyingthefailuremodesofbaselineapproaches,whichbroadlyindicatethatreinforcementlearningmayplayanessentialroleinself-learnedself-correction.
2.RelatedWork
Priorworksstudyself-correctionforLLMsunderavarietyofassumptionsandproblemsettings.Themostprominentproblemsettingsincludeproblemswhereexternalinputtokensfromanenvironmentis
available,fore.g.,agentictasks(Liuetal.,
2023),coderepair(Jainetal.,
2024)
,andtooluse(Chen
etal.,
2023
).Whileself-correctionwithexternalfeedbackispossiblewithstrongproprietarymodels(
Pan
etal.,
2023
),eventhestrongestmodelsstruggleinthesubstantiallymorechallengingsettingwhennoexternalinputisavailable(
Kamoietal.,
2024
).Thissettingiscalledintrinsicself-correction.Priorworkthatattemptstoamplifyintrinsiccorrectionabilitiesarelargelybasedonpromptingandfine-tuning.
Promptingforintrinsicself-correction.RecentworkdemonstratesthatLLMsstruggletoself-correcttheirreasoningerrorswithoutexternalfeedbackandna?velyrunningself-correctioncandegradeperfor-mance(
Huangetal.,
2023;
Quetal.,
2024;
Tyenetal.,
2024;
Zhengetal.,
2024)
.Theseexperimentalstudiesareatoddswithpriorwork(
Kimetal.,
2023;
Madaanetal.,
2023;
Shinnetal.,
2023
)andlargely
stemfrommismatchedassumptionsonthesetting(Kamoietal.,
2024)
.Forexample,
Kimetal.
(2023
);
Shinnetal.
(2023)useoracleground-truthanswersduringself-correctionthatmaynotbeavailable
generally.
Madaanetal.
(2023)useweakpromptsforinitialresponses,therebyperhapsoverestimatethe
improvementpossiblebyself-correction.Thisindicatesthatthereisnomajorworkshowingsuccessfulintrinsicself-correctionviapromptingalone.Inthecontextofcodeself-repair,
Olaussonetal.
(2023
)showthatevenwhenstrongmodelsarepromptedwithsomeformofpartialfeedback,e.g.,showingtest-casesbutnotthedesiredoutcomesonthosetest-cases,theyareoftenunabletocorrecttheirmistakes.Samplingmultipleresponsesinparallelattainsmuchbetterresultsin
Olaussonetal.
(2023
).
Fine-tuningforintrinsicself-correction.Toaddresstheissueswithpromptingoff-the-shelfmodelsalone,severalworksrunsupervisedfine-tuning(SFT)orweightedSFTontheLLMtogeneratearevisiongivenaninitialresponse.Nonethelesstypicalworksinthisliteraturerelyonoraclefeedback:e.g.,
obtainingrevisionsdirectlyfromhumanannotators(Saundersetal.,
2022)orstrongermodels(
Qu
etal.,
2024;
Yeetal.,
2023
).Ourworkaimstotrainforself-correctionentirelywithouttheuseofbiggermodelsorhumans,whenthelearneritselfisaskedtogenerateitsowntrainingdata.Similartothesepriorworks,weassumeaccesstoarewardfunctionforevaluatingmodel-generatedoutputs,similarto
Akyüreketal.
(2023
);
Wellecketal.
(2023
);
Zhangetal.
(2024
).Otherapproachesbuildpipelines
withmultiplemodelsforself-correction(e.g.,GLoRE(Havrillaetal.,
2024b),Self-
correct(Wellecketal.,
2023),or
Akyüreketal.
(2023
);
Pauletal.
(2023
)).Whilethiscanleadtogoodresults,thesepipelinesdonotquitetackleself-correctionandrequiresystemdesignforservingmultiplemodelsatdeployment.
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
4
Standardtraining
Loss
response(y)
Foundationmodel
prompt(x)
TrainingviaSCoRe(Ours)
Loss
Improved
response(y2)
firstattempt(y1)
FoundationmodeltrainedwithScoRe
prompt(x)Optionalexternal
input(p)
Theproblemsettingofself-correction.SCoRetrainsamodeltonotjustproducethebestpossibleresponse,butinsteadaimstotrainthemodeltoproducethebestfinalresponseinthefinalattempt.Inthesecondturn,extrainputintheformofaninstructionaskingthemodeltocorrectitselformodel-generatedmaybeprovided.
Problem.Ifn≡2(mod7),thenfindtheremain-derwhen(n+2)(n+4)(n+6)isdividedby7.
SCoReturn1solution(incorrect).Sincen≡2(mod7),wecanwriten=7k+2forsomeintegerk.Substituting,weget
(n+2)(n+4)(n+6)=(7k+4)(7k+6)(7k+8)
≡4?6?8(mod7)≡192(mod7)
≡1(mod7).
FinalAnswer:Thefinalansweris1.Ihopeitiscorrect.
Self-correctioninstruction.Theremightbeanerrorinthesolutionabovebecauseoflackofun-derstandingofthequestion.Pleasecorrecttheerror,ifany,andrewritethesolution.
SCoReturn2solution(correct).Sincen≡2(mod7),wecanwriten=7k+2forsomeintegerk.Substituting,weget
(n+2)(n+4)(n+6)=(7k+4)(7k+6)(7k+8)≡4?6?8(mod7)
≡4?6?1(mod7)≡24(mod7)
≡3(mod7).
FinalAnswer:Thefinalansweris3.Ihopeitiscorrect.
Figure2∣Anexampletraceandtheproblemsettingofself-correction.
Multi-turnRLforLLMs.Ourapproachutilizesamulti-turnpolicygradientapproachfortrainingforself-correction,whichextendsthesingle-turnapproachof
Ahmadianetal.
(2024
)andcanbeviewedasaninstantiationofthehierarchicalRLframeworkfrom
Zhouetal.
(2024
).Generally,priorworkattheintersectionofLLMsandmulti-turnRLbuildsvalue-
based(Farebrotheretal.,
2024
;
Shanietal.,
2024;
Snelletal.,
2022;
Zhouetal.,
2024),policy-based(Shaoetal.,
2024;
Xiongetal.,
2024),and
model-based(
Hongetal.,
2024)approaches
.WhilethislineofworkbuildsmachinerytodoRL(i.e.,optimizerewards)inamulti-turnMarkovdecisionprocess(MDP),ourprimarycontributioninthispaperistodeviseaformalization,forlearningself-correctionbehaviorinsteadoftheRLmachineryitself.
Self-correctionwithexternalfeedback.Manyworksstudyself-correctionwithadditionalfeedbackfromtheenvironment,mostcommonlyinthesettingofcodegeneration,whereunittestresultsorcompiler
executionfeedbackareavailable(Chenetal.,
2024;
Jainetal.,
2024;
Olaussonetal.,
2023)
.Largelytheseworkspromptmodelstoreasonaboutcodeexecution;
Nietal.
(2024)proposeaself-training
methodthatleveragesexecutiontraces,thoughonlyevaluateitoncorrectingafixeddatasetoferrors.
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
5
3.PreliminariesandProblemSetup
OurgoalistodevelopanapproachfortrainingLLMstoimprovetheirownpredictionsbyentirely
trainingonself-generateddata.Asdiscussedsofar,wesituateourselvesintheintrinsicself-correction
setting(
Huangetal.,
2023),wheremodelsattempttocorrecttheirinitialresponses
withoutanyexternal
feedback.Concretely,givenadatasetD={(xi,y)}1ofproblemsxiandoracleresponsesy,wewill
trainanLLMpolicyπθ(?∣[x,1:l,p1:l])that,giventheproblemx,previouslmodelattempts1:latthe
problem,andauxiliaryinstructionsp1:l(e.g.,instructiontofindamistakeandimprovetheresponse),
solvestheproblemxascorrectlyaspossible.Thisformalismisakintothemulti-turnMDPin
Quetal.
(2024
).Moreover,weassumeaccesstoarewardfunction/verifier(y,y*),suchasastring-matching
basedanswercheckingfunction)thatevaluatescorrectnessofresponseybycomparingwiththeoracle
responsey*.Critically,wedonotassumeaccesstosuchafunctionattest-timeandthemodelitselflearns
todeducewhethertherewasamistakeandcorrectsit,asisoftenthecaseine.g.mathematicalreasoning
problems.AnexampleandoverviewofourproblemsettingisgiveninFigure
2.
Weaimtofindamodelπ(□∣?)(whichwewillalsorefertoasapolicy)mappingasequenceofinputtokens?toasequenceofoutputtokens□thatmaximizesthecorrectnessrewardobtainedfromtheverifierattheendofl+1turns.Formally,thiscanbewrittenasthefollowingmulti-stepRLobjective:
max
πθ
εx,y*~D,l+1~πθ(?∣[x,0:l,p1:l])[(l+1,y*)].
(1)
Crucially,notethatunlikestandardSFTorprevalentRLfine-tuningworkflowsthattrainthepolicyπtodirectlyproduceanoptimalresponseforaninputx,Equation
1
trainsπovermultipleturns/attemptssimultaneously,whereintermediateturnresponses1:laresupervisedindirectlywiththefinalrewards.
AbaseRLapproachforfine-tuningLLMs.OurRLtoolkitisbasedonon-policypolicygradient.Thesemethods,suchasREINFORCEwithaKL-divergencepenaltyagainstafixedmodel(
Ahmadianetal.,
2024
),arewidelyusedinRLfine-tuningofLLMs,primarilyinsettingofsingle-turnRLfromhumanfeedback.Formally,suchpolicygradientapproachestrainapolicyπθ(?∣x)tooptimize:
maxθ
Ext,yt~πθ(?∣xt)[(yt,y*)?β1DKL(πθ(?∣xt)∣∣πref(?∣xt))],
(2)
whereπrefisareferenceanchorpolicy,typicallychosentobeapre-trainedorSFTpolicy.
Metrics.Formeasuringself-correctionperformance,wereportandanalyzethefollowingmetrics:(1)Accuracy@t1:themodel’saccuracyatthefirstattempt;(2)Accuracy@t2:themodel’saccuracyatthesecondattempt,(3)Δ(t1,t2):thenetimprovementinmodelaccuracybetweenthefirstandsecondattempts,whichmeasurestheefficacyofself-correction,(4)Δi→c(t1,t2):thefractionofproblemsthatareincorrectinthefirstattemptbutbecomecorrectatthesecondattempt,whichmeasureshowmanynewproblemscanself-correctionsolve;and(5)Δc→i(t1,t2):thefractionofproblemsthatarecorrectinthefirstattemptbutbecomeincorrectatthesecondattempt,whichmeasureshowwellthemodelisabletounderstandwhatmakesaresponsecorrect.
4.SupervisedFine-TuningonSelf-GeneratedDataisInsufficientforSelf-Correction
Perhapsanaturalapproachtotrainforself-correctionistoutilizesomeformofsupervisedfine-tuningondatacollectedfromabasemodel.Variantsofthisrecipehavebeenshowntoscalewellinsingle-
turnreasoningproblems(Havrillaetal.,
2024a;
Singhetal.,
2023;
Zelikmanetal.,
2022)
.CansuchSFT-basedapproachesbeeffectiveforself-correctionaswell?
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
6
Table1∣Self-correctionperformanceaftertrainingonDSTaRandDSFT.Forbothapproaches,wefindthatthegapbetweensecond-attemptandfirst-attemptperformance(Δ(t1,t2))iseitheroverlynegativeorverysmall.Inaddition,bothapproacheserroneouslymodifyacorrectresponsetobeincorrect,i.e.,reflectedinahighΔc→i(t1,t2)andalowΔi→c(t1,t2).
Method
Accuracy@t1
Accuracy@t2
Δ(t1,t2)
Δi→c(t1,t2)
Δc→i(t1,t2)
Basemodel
52.6%
41.4%
-11.2%
4.6%
15.8%
STaRDStaR
55.4%
41.2%
-14.2%
5.4%
19.6%
Pair-SFTDSFT
52.4%
54.2%
1.8%
5.4%
3.6%
Table2∣Self-correctionperformanceaftertrainingonDTaRandDFT.PerformanceimprovesforSTaRindicatingthata
highercoveragedatasethelpsimproveperformance,butnotforSFTwheretraningontraceswherebothresponsesarecorrectforcesthemodeltosimplynotmakeanychangestoitsfirst-attemptresponse,nomatterhowcorrectorincorrectthatis.
Method
Accuracy@t1
Accuracy@t2
Δ(t1,t2)
Δi→c(t1,t2)
Δc→i(t1,t2)
Basemodel
52.6%
41.4%
-11.2%
4.6%
15.8%
STaRDtaR
53.6%
54.0%
0.4%
2.6%
2.2%
Pair-SFTDFT
55.0%
55.0%
0%
0%
0%
Inthissection,weperformanempiricalstudytoanswerthisquestion.Westudytwoapproaches:
STaR(Zelikmanetal.,
2022)andanapproachakinto
Wellecketal.
(2023)thattrainsonlyonemodel
.Wedonotuselearnedprocessoroutcomeverifierstoguidecorrectiontraces,sooursetupdiffersfromSFTin
Snelletal.
(2024
).Wefindthatsuchmethodsimprovesubstantiallycomparedtothebasemodel’sself-correctionbehavior,butstillfailtoattainapositiveself-correctionrateandproduceaworsesecondattemptcomparedtotheirfirstattempt.Byprobingtrainedmodels,wefindthatthesefailureslargelystemfromsupervisedfine-tuningamplifyingtheinitialbiasofthebasemodelresultinginonlyminorchangestoitsfirst-attemptresponse.Whilethesefailurescanbeaddressedifadifferentdistributionoverinitialresponsesisusedfortraining,doingsofailstoinduceeffectiveself-correctionbehaviorunderthemodel’sownresponsedistribution.Eitherway,learningisaffectedbydistributionshiftoramplificationofthebasemodel’sbias.TheseobservationsmotivatethedesignofourmethodinSection
5.
4.1.AnalysisSetup:MethodsandDatasetConstruction
Methods.Wepromptoff-the-shelfmodelstoobtainalargenumberoftwo-turnself-correctiontraces.TheSTaRapproach,analogoustoReSTEM
(Singhetal.,
2023
),filtersthesetrajectoriestoonlyretainthosethatsuccessfullyreviseincorrectresponsesandrunsSFTontheresultingdataset.Incontrast,
Wellecketal.
(2023
)usethebasemodeldatafromabovetoconstructsetsofcorrectandincorrectresponsesandthengenerates“synthetic”repairtracesbypairingincorrectresponseswithcorrectones.WestudyavariantoftheirmethodwecallPair-SFT,whichdoesnottrainaseparatecorrectormodelanddoesnotaugmentthisinitialdatasetwithmulti-turntraces.
Datasetconstruction.WeperformourstudyontheMATHdataset,andgenerateself-correctiontracesbypromptingtheGemini1.5Flash(
Reidetal.,
2024
)usingtemperature1.0.Weconstructdatasetsfor
STaRandPair-SFTasfollows:(1)DSTaR:={(xi,,)}1,whereandcorrespondtoincorrect
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
7
(a)HistogramsofeditdistanceratiosonMATH500.
(b)STaReditdistanceratios.
(c)Pair-SFTeditdistanceratios.
Figure3∣Editdistancebetweenfirst-attemptandsecond-attemptresponsesobtainedfromfine-tunedmodels,ourapproach(SCoRe)andthebasemodel.Observethatwhiletrainingonself-generatederrorcorrectiontracesinheritsthebi-modaldistributionofeditsasthebasemodel,SFTtendstobequiteconservative.
DSFT:={(xi,,y)}1,whereyisarandomcorrectresponseforproblemx,randomlysampled
andcorrectrespon?sesappearingw?ithinasinglesequenceofattemptsfromthecurrentmodel,and(2)
fromthesetofallfirst-turnandsecond-turnresponsesproducedbythemodel.Wethenransupervisedfine-tuningonbothofthesedatasets:following
Singhetal.
(2023),werepeat3iterationsofcollecting
andrunningSFTonDSTaR,butonly1epochonDSFTgiventhelargedatasetsize.
4.2.EmpiricalFindings
Weplottheself-correctionperformanceoftheGemini1.5Flashbeforeandafterrunningfine-tuningonDSTaR(3iterations)andDSFTinTable
1.
WefindthatalthoughΔ(t1,t2)issubstantiallyhigherforPair-SFTrelativetothebasemodel,thereisstilllittlebenefittodoingself-correction(1.8%gain).ByconsideringΔi→candΔc→i,wefindthatSFTmainlyhelpsbyreducingthenumberofcorrectproblemsthataremistakenlychangedtoincorrectafterrevision,anddoesnotsignificantlyincreasethefractionofincorrectfirstattemptsthatarecorrectlyrepaired.Thisresultisconsistentwithpriorstudiesonintrinsicself-correctionthathavefoundnegligibleorevennegativeΔ(t1,t2)(
Huangetal.,
2023;
Quetal.,
2024)
.
WealsofindthatunlikePair-SFT,trainingonDSTaRdoesnotreduceΔc→i,indicatingthattheSTaRpolicydoesnothaveaclearunderstandingofwhenandwhennottomakemodifications.WehypothesizethatthisdiscrepancyisduetothedatadistributionsofDSFTandDSTaR:theformercoversamuchmorediversespaceofrevisiontrajectoriesduetothenatureofrandompairing.Observingthis,wealso
trainedonanextendedversionofDTaR(andalsoDFT),whichadditionallypresentsmoretupleswith
bothcorrectresponses.Wewouldexpecttheadditionofsuch“correct-to-correct”datatopreventthemodelfromerroneouslyrevisingacorrectresponseand,attheveryleast,restrictthemodificationofacorrectresponseintoonlyanothercorrectresponse.AsshowninTable
2
,perhapsinterestingly,wefindthatincludingsuchdatahasoppositeeffectsonSTaRandSFT:forSTaR,inclusionofthisdatahelpssubstantially,thoughitstillresultsinbarelyanymeaningfulself-correctionperformance.Ontheotherhand,forSFT,inclusionofthisdataoverlybiasesthemodeltonotchangeitsansweratall.
TrainingLanguageModelstoSelf-CorrectviaReinforcementLearning
8
Divingdeeper:analyzingself-correctionbehavior.WealsovisualizedhowtheSTaRandSFTmodelsedittheirresponses.Inparticular,wemeasurededitdistanceratio,definedastheeditdistancebe-tweentheresponsesnormalizedbythetotallengthofboththeresponses,tosum-marizetheextenttowhichmodelsmod-ifytheirfirst-attemptresponse.AsshowninFigure
3a,
whilethebasemodelsome-timesmakessubstantiallylargeeditstotheoriginalresponse,modelsfine-tunedonDSTaRandDSFTareoverlyconserva-tive,andoftenmakenoeditsatall.WewillshowinSection
5
thatourproposedmethodSCoReisabletoavoidamplifyingthisbiasofnotmakingchanges,withoutanyexplicittrainingforcontrollinghowmuchtoeditsolutions.
Val
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年學(xué)校體育器材設(shè)施租賃合同
- 2025年企業(yè)內(nèi)部股權(quán)認(rèn)購(gòu)合同范本
- 2025年跨區(qū)域金融協(xié)同發(fā)展策劃框架協(xié)議
- 2025年醫(yī)療設(shè)備租賃與維護(hù)合作協(xié)議
- 2025年勞保服裝定制合同樣本
- 2025年企業(yè)合作社交媒體代運(yùn)營(yíng)合同
- 2025年建筑工程策劃環(huán)境風(fēng)險(xiǎn)評(píng)估合作協(xié)議
- 2025年中期票據(jù)發(fā)行保證合同樣本
- 2025年中介電子商務(wù)合同
- 2025年農(nóng)村耕地整合策劃協(xié)同協(xié)議
- 微電網(wǎng)運(yùn)行與控制策略-深度研究
- 2025南網(wǎng)科研院系統(tǒng)內(nèi)招聘13人易考易錯(cuò)模擬試題(共500題)試卷后附參考答案
- 關(guān)于合同知識(shí)的全面解讀
- 物業(yè)管理車輛出入管理制度
- 2025年施工項(xiàng)目部《春節(jié)節(jié)后復(fù)工復(fù)產(chǎn)》工作實(shí)施方案 (3份)-75
- 五四制青島版三年級(jí)數(shù)學(xué)下學(xué)期教學(xué)計(jì)劃
- 礦山安全生產(chǎn)工作總結(jié)
- 2024年常德職業(yè)技術(shù)學(xué)院?jiǎn)握新殬I(yè)適應(yīng)性測(cè)試題庫(kù)
- 監(jiān)護(hù)人考試20241208練習(xí)試題附答案
- 人教版PEP三年級(jí)到六年級(jí)單詞以及重點(diǎn)句型
- ABB工業(yè)機(jī)器人應(yīng)用技術(shù) 課件 2.6系統(tǒng)輸入輸出與IO信號(hào)的關(guān)聯(lián)
評(píng)論
0/150
提交評(píng)論