大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models

上傳人：策*** IP屬地：山西上傳時(shí)間：2024-10-26 格式：DOCX 頁數(shù)：55 大小：741.55KB 積分：19.9 舉報(bào) 版權(quán)申訴

大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第2頁

大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第3頁

大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第4頁

大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第5頁

已閱讀5頁，還剩50頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

ASurveyonDataSynthesisandAugmentationforLargeLanguageModels

KeWang

onecall@

HangzhouInnovationInstitute,

BeihangUniversity

JiahuiZhu

zhujh224@

HangzhouInnovationInstitute,

BeihangUniversity

MinjieRen

rmj_rmj@

HangzhouInnovationInstitute,

BeihangUniversity

ZemingLiu

arXiv:2410.12896v1[cs.CL]16Oct2024

zmliu@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

ShiweiLi

shiweili93@

HangzhouInnovationInstitute,

BeihangUniversity

ZongyeZhang

zhangzongye@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

ChenkaiZhang

zhangchenkai@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

XiaoyuWu

zf2306113@

HangzhouInnovationInstitute,

BeihangUniversity

QiqiZhan

zhanqiqi@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

QingjieLiu

qingjie.liu@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

Abstract

ThesuccessofLargeLanguageModels(LLMs)isinherentlylinkedtotheavailabilityofvast,diverse,andhigh-qualitydatafortrainingandevaluation.However,thegrowthrateofhigh-qualitydataissig-nificantlyoutpacedbytheexpansionoftrainingdatasets,leadingtoaloomingdataexhaustioncrisis.Thisunderscorestheurgentneedtoenhancedataefficiencyandexplorenewdatasources.Inthiscon-text,syntheticdatahasemergedasapromisingsolution.Currently,datagenerationprimarilyconsistsoftwomajorapproaches:dataaugmentationandsynthesis.Thispapercomprehensivelyreviewsandsummarizesdatagenerationtechniquesthroughoutthelifecy-cleofLLMs,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Further-more,Wediscussthecurrentconstraintsfacedbythesemethodsandinvestigatepotentialpathwaysforfuturedevelopmentandresearch.Ouraspirationistoequipresearcherswithaclearunder-standingofthesemethodologies,enablingthemtoswiftlyidentifyappropriatedatagenerationstrategiesintheconstructionofLLMs,whileprovidingvaluableinsightsforfutureexploration.

1Introduction

Inrecentyears,largelanguagemodels(LLMs)havedemonstratedunparalleledcapabilitiesacrossawidearrayoftasks

[9,

68,

166],

firmlyestablishingthemselvesasthebackboneofgeneralartifi-cialintelligence(AI)systems.Thesemodelsachievesignificantimprovementsinnaturallanguageprocessing

[234,

262,

264],com

-putervision

[100,

207,

239],andotherresearchfields

[36,

163,

229],

consistentlypushingtheboundariesofwhatAIcanachieve.The

YunhongWangyhwang@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

successofLLMsislargelyattributedtotheirabilitytocaptureintricatepatternsandrelationshipswithinvastamountsofdata,allowingthemtoperformcomplextaskssuchasnaturallanguageinference

[39,

134],visualquestionanswering

[151,

158],andvision

-and-languagenavigation

[125,

178]withremarkableproficiency.

However,theperformanceofLLMsishighlydependentonthequalityandvolumeofthedatatheyaretrainedon

[2,

57,

58]

.Withtheexponentialgrowthinmodelsize—nowreachingbillionsoreventrillionsofparameters

[105,

168,

268]

—thereisanincreasingdemandforlarge-scale,diverse,andhigh-qualitydatatoensurerobustgeneralizationacrossvarioustasksanddomains.Obtainingsuchdataposessignificantchallengesduetothehighcostsofdatacollectionandtheproblemsintroducedbyprivacyconcerns.Ad-ditionally,thegrowthrateofhigh-qualitydatalagsfarbehindtherapidlyincreasingsizeoftrainingdatasets.Ifthistrendcontinues,theavailabledatawilleventuallybedepleted,implyingthatwithoutsignificantimprovementsindataefficiencyorthediscoveryofnewdatasources,thegrowthofLLMsmayslowdownconsiderably.Giventheseimpendinglimitations,datasynthesisandaugmen-tationtechniquesbecomeessentialtoextendingthelifespanandgeneralizationofLLMs.Traditionaldatasynthesisandaugmenta-tiontechniques

[34,

98,

135,

194],suchasimagerotation,cropping,

flipping,andrule-basednaturallanguagegeneration,havebeenwidelyusedtoaddressthesedatalimitations.Althoughtheseap-proachesimprovedatadiversityandaddressdatascarcitytosomeextent,theystillstruggletofullycapturethecomplexitiesofreal-worlddata

[55],generatedataatscale

[233],anddefendagainst

Figure1:StatisticsofthepublicationsrelatedtoLLM-orienteddatasynthesisandaugmentationtechnologies,groupedbythepublicationyearandvenue.

adversarialexamples

[162],limitingtheireffectivenessfortraining

LLMs.

Toovercomethesechallenges,researchershaveincreasinglyturnedtoLLM-orienteddatasynthesisandaugmentationtech-niques,recognizingtheabilityofLLMstomodelcomplexpatternsfromlargedatasetsandgeneratesyntheticdatathatcloselymir-rorreal-worlddistributionswhileintroducingvaluablevariations

[37,

175,

260]

.Thesestudiesreducetherelianceonmanuallycu-rateddatasetsandenablethegenerationofhigh-quality,diversedatathatmeetstheevolvingdemandsofLLMsthroughouttheirlifecycleandfunctions.Tocapturethebreadthoftheseefforts,wecollectedpapersrelatedtoLLM-orienteddatasynthesisandaug-mentationbysearchingGoogleScholarusingkeywordssuchas"datasynthesis,""dataaugmentation,"and"largemodels."Figure

illustratesthepublicationtrendsbyyearandvenue,reflectingtheincreasinginterestinthisfield.AsofOctober2024,weidenti-fied250uniquepublicationscoveringdiverseresearchtopicsandvenues.Summarizingtheseeffortsprovidescriticalinsightsintotheprogressandchallengesthatremain,offeringafoundationforfutureresearch.Despitetheseadvancements,severalkeychal-lengesremaininLLM-orienteddatasynthesisandaugmentation.Themisuseofsyntheticdataposesrisks,particularlyinspreadingmisinformationandraisingethicalconcernsaroundmanipulat-ingpublicopinion.Additionally,syntheticdataoftenintroducesambiguitywhenaligningAImodelswithhumanvalues,poten-tiallyleadingtobiasedoutcomes.Evaluatingmodelstrainedonsyntheticdataisalsocomplex,astraditionalbenchmarksmaynotfullycapturethenuancesofthisdata.Ensuringreliabilityisan-otherconcern,asbiasesandinaccuraciesfromoriginaldatasetscanpersistinsyntheticdata,limitingitsgeneralizationacrossdomains.Moreover,thecomputationaldemandsofLLMs,alongwithchal-lengesinhandlinglesscommonlanguagesornovelinstructions,complicatebroaderapplications.Finally,thelackofaunifiedframe-workfororganizingandcomparingthemethodsproposedinbothacademiaandindustryremainsabarrierforresearchersnavigatingthisrapidlyevolvingfield.

Thissurveyaimstoaddressthesegapsbyprovidingacompre-hensiveoverviewofLLM-orienteddatasynthesisandaugmenta-tiontechniques.AsshowninFigure

2,unlikeprevioussurveys

[43,

140,

147,

214,

271],whichprimarilyfocusonapplyingthese

methodstosupportspecificdownstreamtasksorparticularstagesofLLMs,ourworkemphasizesthedirectroleofLLM-orientedtech-niquesinimprovingtheoverallperformanceofLLMsacrossvariousstagesoftheirlifecycleandcorefunctions.Incontrasttothework

[137],whichfocusesonpracticesforsyntheticdatagenerationto

addresschallengeslikedatascarcityandprivacy,oursurveyex-tendsbeyondpracticalguidancebycategorizingmethodsaimedatimprovingLLMperformanceholistically.WeexaminenotonlydatagenerationbutalsohowthesetechniquesenhanceLLMsacrossallstagesandfunctions,offeringamoreintegrated,data-centricframe-workforadvancingLLMs.Specifically,wesystematicallyreviewandcategorizeexistingresearchfromtwokeyperspectives:thelifecycleofLLMs(frompre-trainingtofine-tuningandapplication)andtheircorefunctions(understanding,logic,memory,andgener-ation).Byframingthediscussionaroundthesedualperspectives,weofferclearerinsightsintothedevelopment,interconnections,andpracticalapplicationsofdifferentapproaches.Moreover,weidentifycriticalchallenges,exploreemergingresearchdirections,andhighlightpotentialbreakthroughsthatcouldfurtherdrivead-vancementsinLLMperformancethroughdata-centricmethods.

Thecontributionsofthissurveyaresummarizedasfollows:

?FirstSurvey:Toourknowledge,wepresentthefirstcom-prehensivesurveyfocusedonadvancingLLMsthroughdatasynthesisandaugmentation,systematicallycoveringtheen-tirelifecyclestagesandcorefunctionsofLLMs.Thissurveyprovidesanin-depthanalysisofcurrentmethodologiesandhighlightstheuniquechallengesateachstage.

?Newtaxonomy:Weintroduceaninnovativeorganizationalframeworkthatcategorizesexistingresearchfromtwokeyperspectives:thelifecyclestagesofLLMsandtheircorefunctions.Thistaxonomyoffersaclearerunderstandingoftheprogression,interconnections,andapplicabilityof

Figure2:Acomparisonbetweenexistingsurveysondatasynthesisandaugmentationtechniquesandourwork.PrevioussurveysprimarilyfocusonLLM-baseddatasynthesisandaugmentationmethodsaimedatsupportingdownstreamtasks.Incontrast,ourworkemphasizesLLM-orienteddatasynthesisandaugmentation,systematicallycoveringthefulllifecycleofLLMs—fromdatapreparationtoapplications—andaddressingcoreLLMfunctionssuchasunderstandingandgeneration,withtheultimategoalofimprovingLLMsthemselvesthroughdata-centrictechniques.

differentapproaches,providingvaluableinsightsintobothdevelopmentalandfunctionalaspectsofLLM-orienteddatasynthesisandaugmentation.

?Newfrontiers:Weidentifycriticalchallenges,andexploreemergingresearchdirections,andpotentialbreakthroughsinLLM-orienteddatasynthesisandaugmentation.Thisdiscus-sionaimstoinspirefutureresearchandguidedevelopmentsindata-centrictechniquesforLLMadvancement.

?Abundantresources:Weorganizeandmaintainadedi-catedrepositorytosupportongoingresearchandcollabo-rationinLLM-orienteddatasynthesisandaugmentation.Thisresourceincludesacuratedcollectionofrelatedpapers,multipleleaderboardstrackingthelatestadvancements,andregularupdatestofosterinnovation,guidefutureresearchdirections,andacceleratebreakthroughsinthefield.

ByofferingacomprehensiveoverviewofLLM-orienteddatasynthesisandaugmentationapproaches,thissurveyaimstoclarifythecurrentstateofthefieldandinspirefutureresearchdirectionsthatcanfurtherenhanceLLMcapabilitiesthroughdatasynthesisandaugmentationmethodologies.

Weorganizetheremainderofthissurveyasfollows:Section2categorizestheprimaryareasofLLM-orienteddatasynthesisandaugmentation,providinganoverviewofthefoundationaltech-niques.Section3discussesthecurrentLLM-orienteddatasynthesisandaugmentationmethodsfromtheperspectiveofthefulllife-cycleofLLMs,detailinghowthesetechniquesareemployedatdifferentstagesofmodeldevelopment.InSection4,wereviewthesemethodsfromtheviewpointofcoreLLMfunctions,exploringhowdatasynthesisandaugmentationenhancekeycapabilitiessuchasunderstanding,logic,memory,andgeneration.Section5delvesintotheevaluationstrategiesforLLM-orienteddatasynthe-sisandaugmentation,addressingbenchmarks,evaluationmetrics,

andleaderboardsusedtoassessandcomparetheeffectivenessofexistingapproaches.Finally,Section6providesinsightsintochal-lengesandemergingtrendsinLLM-orienteddatasynthesisandaugmentation,offeringrecommendationsforfutureresearchdirec-tionsthatcancontributetothecontinuedadvancementofLLMsthroughdatasynthesisandaugmentationmethodologies.

2Taxonomy

Datagenerationmethodsplayapivotalroleinaddressingdatascarcityandimbalance,therebyimprovingmodelperformanceandgeneralization.AsshowninFig.

4,wesummarizethedevelopment

andevolutionofdataaugmentationandsynthesistechniquesinrecentyears.Thissectionprimarilyintroducesthecurrentclassi-ficationofdatagenerationmethods,distinguishingbetweendataaugmentation,whichenhancesexistingdatasamplesthroughtransformations,anddatasynthesis,whichcreatesentirelynewsamplesfromscratchorbasedongenerativemodels.Bothmethodsdifferintheirapproachtoacquiringdatabutaimtoexpanddatasets.Furthermore,dataaugmentationandsynthesismethodscanbecat-egorizedintosubclassesfrommultipledimensions.Eachapproachhasuniquestrengthsandapplications,enablingresearcherstotailortheirdatagenerationstrategiestospecificneedsandgoals.

2.1DataAugmentation

Dataaugmentation,atypeofgenerationapproachfromdatatodata,generallyinvolvesmanipulatingtheoriginaldatatoincreaseitsdiversityandquantitywithoutsignificantlyalteringitsessen-tialcharacteristics.Techniquesusedindataaugmentationarede-signedtoenhancetherichnessofexistingdatasamplesthroughtransformationsorperturbations.Acrossdifferentmodalities,dataaugmentationtechniquesoftenexhibitsimilarities.Forinstance,inimagedata,augmentationoperationsencompassmosaic

[90],

DataLabeling

DataReformation

DataAugmentation

(§2.1)

T-SciQ

[205]

,ChatGPT-based

[3,

63,

275]

Mosaic

[90]

,CORE

[45]

,ALIA

[51]

,ChatAug

[37]

Co-Annotation

Taxonomy(§2)

GeneralModelDistillation

Co-annotating

[116]

,ToolCoder

[259]

DataSynthesis(§2.2)

DomainModelDistillation

TinyStories

[53]

,Phi-1

[67,

120]

,Alpagasus

[22]

,WizardLM

[223]

Minerva

[108]

,DeepSeek-Prover

[220]

,WizardCoder

[146]

ModelSelf-Improvement

Rephrasing

[150]

,Self-instruct

[210]

,SPIN

[26]

,SelTDA

[94]

GeneralModelDistillation

DataPreparation

(§3.1)

Dialogic

[122]

,MathInstruct

[244]

,Genixer

[266]

,Magpie

[227]

,MMIQC

[131]

,Genie

[236]

,Case2Code

[180]

,UltraChat

[44]

DataAugmentation

Disco

[27]

,GPT3Mix

[237]

,CoAnnotating

[116]

,ALIA

[51]

,FullAnno

[74]

,Dialgen

[142]

,TinyGSM

[128]

,AMPS

[77]

GeneralModelDistillation

Pretraining(§3.2)

Phi-1

[67]

,SciLitLLM

[118]

,TRAIT

[123]

,AnyGPT

[251]

,Phi-1.5

[120]

,TinyDialogues

[56]

ModelSelf-Improvement

VILA-2

[54]

DataAugmentation

WRAP

[150]

,KMLM

[133]

,bioR

[276]

,Physics-based

[134]

DataSynthesisandAugmentationforLargeLanguageModels:ASurvey

GeneralModelDistillation

LAB

[191]

,LLM2LLM

[107]

,GLAN

[111]

,Clingen

[226]

Baize

[222]

,Evol-Instruct

[223]

,HuaTuo

[204]

,NExT-GPT

[219]

Finetuning(§3.3)

ModelSelf-Improvement

STaR

[248]

,REST

[66]

,Self-Translate

[170]

,Self-Instruct

[210]

,RFT

[242]

,CodeRL

[104]

,REST-EM

[187]

,DeepSeekProver

[220]

DataAugmentation

FullLifecycle

ofLM(§3)

MathGenie

[144]

,DISC-MedLLM

[10]

,MetaMath

[238]

Symboltuning

[213]

,Llama-3-UltraMedical

[258]

,Llemma

[6]

GeneralModelDistillation

Instruction-Tuning

(§3.4)

Alpaca

[196]

,Vicuna

[29]

,Orca

[154]

,Baize

[222]

,LLaVA

[130]

ModelSelf-Improvement

Self-Instruct

[210]

,SPIN

[26]

,CAI

[8]

,Toolformer

[177]

DataAugmentation

T-SciQ

[205]

,CORE

[45]

,ChatAug

[37]

,ToolCoder

[259]

GeneralModelDistillation

ULTRAFEEDBACK

[35]

,HelpSteer

[212]

,LEMA

[4]

DomainModelDistillation

Preference

Alignment(§3.5)

ModelSelf-Improvement

BAD

[225]

,BEAVERTAILS

[85]

,PRM800K

[124]

,WebGPT

[156]

OAIF

[69]

,SELF-JUDGE

[235]

,SALMON

[193]

,SteerLM

[47]

DataAugmentation

Starling-7B

[273]

,UltraInteract

[240]

,CriticBench

[126]

Math

MetaMath

[238]

,MammoTH

[244]

,STaR

[248]

,Galactia

[197]

,DeepSeekProver

[220]

,WizardMath

[145]

Science

SciLitLLM

[118]

,ChemLLM

[254]

,SciGLM

[253]

,Galactia

[197]

Code

Applications(§3.6)

WizardCoder

[146]

,MagicCoder

[215]

,CodeAlpaca

[17]

,CodeLLama

[173]

,Phi-1

[67]

,Phi-1.5

[120]

Medical

DISC-MedLLM

[10]

,HuatuoGPT

[20,

256]

,ChatCounselor

[132]

,ClinGen

[226]

Law

DISC-LawLLM

[243]

,LawyerLLaMA

[82]

,LawGPT

[272]

,WisdomInterrogatory

[270]

Understanding(§4.1)

Alpaca

[196]

,WizardLM

[223]

,WRAP

[150]

,LLaVA

[130]

,ChartLlama

[73]

,Genixer

[266]

Logic(§4.2)

Functionality(§4)

Memory(§4.3)

ReSTEM

[187]

,Case2Code

[180]

,MathInstruct

[244]

,MMIQC

[131]

,STaR

[248]

,SelTDA

[94]

Quiet-STaR

[247]

,AutoKG

[274]

,PersonaHub

[16]

,AceCoder

[113]

,RepoCoder

[255]

Generation(§4.4)

Synthesizingand

AugmentingMethod

(§5.1)

Genie

[236]

,UltraMedical

[258]

,HuaTuo

[204]

,TinyStories

[53]

,DIALOGIC

[122]

,ALIA

[51]

d-RLAIF

[106]

,LLM2LLM

[107]

,Wizardmath

[145]

,STaR

[248]

,SciGLM

[253]

,ChemLLM

[254]

DataQuality(§5.2)

ImpactofDataSynthesisand

Augmentation(§5.3)

LLMs4Synthesis

[62]

,CoRAL

[217]

,FORD

[221]

,LTGC

[267]

Challengesand

Limitations(§5)

DataDreamer

[159]

,HARMONIC

[209]

ImpactonDifferentApplicationsand

Tasks(§??)

FutureDirections

(§5.5)

PANDA

[127]

,REGA

[206]

TabSynDex

[32]

,CoLa-Diff

[87]

,WizardCoder

[146]

,WebGPT

[156]

Figure3:Themaincontentflowandcategorizationofthissurvey.

flipping

[184],copy-pasting

[61],addingnoise

[149],pairing

[84]and

soforth.Similarly,intextdata,augmentationoperationsinvolvesynonymreplacement

[95],copy-pasting

[185],etc

.Moreover,to

catertothedemandsofmultimodallearning,existingresearchhasaddressedcross-modalinformationalignmentduringdataaug-mentation.MixGen

[75]generatesnewtrainingsamplesbylinearly

interpolatingimagesandconcatenatingtextsequencesfromtwoex-istingimage-textpairs.Thesemanticrelationshipwithinthenewlygeneratedimage-textpairremainsconsistentandmatched.Re-cently,intherapidlyadvancinglandscapeofLLMs,dataaugmenta-tionhasemergedasacornerstoneforbolsteringmodelperformancethroughthediversificationoftrainingexemplars,circumventingthenecessityforextensiveadditionaldatagathering.Fromadata-centricperspective,wesystematicallycategorizeexistingresearchondataaugmentationintothreedistinctcategories:datalabel-ing

[3,

63,

94,

136,

198,

275],

datareformation[45,

51,

143,

237],

andco-annotation

[11,

43,

116]

2.1.1DataLabeling.DatalabelingendeavorstoleveragethecomprehensivelanguageunderstandingcapabilitiesofLLMstoannotatevastunlabeleddatasets.Thismethodologyisparticularlybeneficialinfieldsthatpossessasubstantialunlabeleddatacorpus,encompassingdomainssuchascross-lingualprocessingandmulti-modallearning

[3,

63,

275],wheretheautomationofannotationcan

significantlyexpeditethedatapreparationprocess.Recentresearchstudiesthezero-shotannotationabilityofLLMs,suchasGPT-4onlabelingpoliticalTwitter

[198]

.Moreover,Khanetal.

[94]focuson

Visualquestionanswering(VQA)tasksbygeneratingpseudo-labeldatafromunlabeledimagesbyutilizingtheSelTDAframework.

2.1.2DataReformation.Datareformationinvolvestransform-ingandrestructuringexistingdataintoabroaderspectrumofvaria-tions,therebyfacilitatingmorefine-graineddataaugmentation

[45,

51]

.Thisapproachaimstoenrichthetraininglandscapewithdi-

verseyetpertinentexamples,enhancingthemodel’srobustnessandgeneralizationcapabilities.Classicmethodssuchasrotation

[92],

colorchanneltransformation

[64],andsynonymreplacement

[95]

arecommonlyused.Recently,approachesutilizingLLMshavealsoemerged.Forexample,Chenetal.

[27]proposeDisco,anapproach

thatharnessesLLMstoproducelarge-scale,high-qualitycounter-factualdata.

2.1.3Co-Annotation.Co-annotationdesignatesthecollabora-tiveeffortbetweenhumanannotatorsandLLMsintheannota-tionprocess

[11]

.Byintegratingthestrengthsofbothannotationmethodologies,co-annotationnotonlymitigatesannotationcostsbutalsoconcurrentlyenhancesannotationperformance,fosteringamoreefficientandeffectiveapproachtodataannotation.Lietal.

[116]introduceCoAnnotating,aframeworkthatstrategically

assignsdatapointsforannotationeithertohumansortoLLMs,basedonanassessmentoftheLLM’sannotationuncertainty.

2.2DataSynthesis

Datasynthesis,ontheotherhand,aimstocreateentirelynewdatafromscratchorbasedongenerativemodels,whicharesimilartothedistributionofrealdata.Inrecentyears,withtheexplosionandadvancementsingenerativeAI

[13,

41,

42,

78,

139,

161,

169],

therehavebeensignificantstridesinthequalityandgenerationefficiencyofsyntheticdata.BasedontherequirementsofLMs,thispapercategorizesdatasynthesismethodsintothreemaintypes:

generalmodeldistillation[22,

53,

120,

263,

266],

domainmodel

distillation[108,

145,

146,

215],and

modelself-improvement[54,

150,

210,

248]

2.2.1GeneralModelDistillation.Amongthese,generalmodeldistillationinvolvesleveragingpowerfulgeneralmodels,typicallyfeaturinglargerparametersandsuperiorperformance,suchasSta-bleVicuna,ChatGPT,andGPT-4,togeneratedatasetsthatcanen-hancethecapabilitiesofweakermodels.Therearevariouswaystoemploythesepowerfulmodels,suchasusingpredefinedtemplatestogeneratetinystories

[53]andleveragingtheLLMsthemselves

toevaluatethequalityofthegenerateddata.Phi-1anditsseries

[67,

120]havedemonstratedthatasmallamountofhigh-quality

datacanalsotrainapowerfulmodel,byleveragingthecompre-hensivegenerationoftextbooksandexercisesfromGPT-3.5.Someothermethodshavealsoachievedperformanceimprovementsbygeneratinginstructiondatasetsandfine-tuningmodelsafterim-provingthequalityofthesedatasets

[22,

80,

196]

2.2.2DomainModelDistillation.Domainmodeldistillationpertainstotheutilizationofmodelsthataretailoredtogeneratedatawithinaparticulardomain.Thisapproachisoftennecessarywhengeneralmodelsfailtomeetthespecificneedsofindustryapplications.Forinstance,inthecontextofcodeprogramming,do-mainmodeldistillationcanbeemployedtogenerateinstructional

datatailoredtospecificcodingtasks

[146,

215].Intherealmofmath

-ematics,methodssuchasMinerva

[108]andDeepSeekMath

[220]

aredesignedtogeneratesolutionstomathematicalproblemswhileensuringtheiraccuracyanddiversity.Additionally,therealmofindustrydataoftenpresentsbarriers,suchaslimiteddatascalesandtheinaccessibilityofdatawithinspecificenterpriseswithinthedomain.Thesefactorsnecessitatetheadoptionofdomain-specificmodelsthatcaneffectivelyaddresstheuniquechallengesposedbythesescenarios.

2.2.3ModelSelf-Improvement.Modelself-improvementreferstotheprocesswhereamodelgenerateshigher-qualitydatatoen-hanceitscapabilities.Forinstance,leveragingexistinginstructionstoadjustthemodelandpromptingittoparaphrasedocumentsonthewebinspecificstyles,suchasWikipedia-styleorQA-style,canbeusedtojointlypre-trainLLMsforbothauthenticandsyntheticparaphrasingtasks

[150]

.Self-Instruct

[210]enhancesLMsthem

-selvesbyautogeneratingandrefininginstructionaldata,boostingperformancewithminimalhumanintervention.

3DataSynthesisandAugmentationintheFullLifecycleofLLM

FromtheperspectiveofthefulllifecycleofLLM,Wedividetheexistinginvestigationsintosixstages,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Thepresentsectionintroducesrelevantresearchineachstage.

3.1DataPreparation

Inthedatapreparationphase,datasynthesisandaugmentationaimtogeneratediverseandhigh-qualitydatasetsforthetrainingofLLMs,addressingthechallengeofthescarcityofreal-worlddata.AccordingtothetaxonomydiscussedinSection2,Wedividethepresentsubsectionintogeneralmodeldistillationanddataaugmentation.

Figure4:Illustrationoftheevolutionarystepsinthedevelopmentofdatasynthesisandaugmentationtechniquesforlargemodels.

3.1.1GeneralModelDistillation.ThiswayaimstoleveragethepowerfulcapabilitiesofgeneralLLMstodistillhigh-qualitydata.Accordingtotheapproachanddatamodality,wefurtherdi-videdgeneralmodeldistillationintofivecategories:synthesizefromseeds,synthesizereasoningsteps,synthesizewithcontrollability,synthesizefromscratch,andsynthesizemultimodaldata.

SynthesizefromSeeds.Tosynthesizedatasetsforspecifictasks,promptingLLMswithasmallnumberofrelevantexamplescaneffectivelyproducehigh-qualitydatasetsatalowcost.Forinstance,toinvestigate“howsmallcananLLMbetoachievecertaincapabil-ities”,TinyStories

[53]isconstructedbyinstructinganLLMtogen

-eratestoriesthatcombinethreewordsrandomlychosenfrom1500basicwords,anditcanbeusedtotrainandevaluatelanguagemod-els.Basedonthecollectedlarge-scalefunctions,Case2Code

[180]

incorporatesLLMstogeneratesuitableinputsforthesefunctionsandutilizesthecodeinterpretertocalculatethe

人人文庫> 全部分類> 應(yīng)用文書 > 研究報(bào)告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models

文檔簡介

溫馨提示

最新文檔

評論

大語言模型的數(shù)據(jù)合成與增強(qiáng)綜述 A Survey on Data Synthesis and Augmentation for Large Language Models

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔