大語言模型的數(shù)據(jù)合成與增強綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第1頁
大語言模型的數(shù)據(jù)合成與增強綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第2頁
大語言模型的數(shù)據(jù)合成與增強綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第3頁
大語言模型的數(shù)據(jù)合成與增強綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第4頁
大語言模型的數(shù)據(jù)合成與增強綜述 A Survey on Data Synthesis and Augmentation for Large Language Models_第5頁
已閱讀5頁,還剩50頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1

ASurveyonDataSynthesisandAugmentationforLargeLanguageModels

KeWang

onecall@

HangzhouInnovationInstitute,

BeihangUniversity

JiahuiZhu

zhujh224@

HangzhouInnovationInstitute,

BeihangUniversity

MinjieRen

rmj_rmj@

HangzhouInnovationInstitute,

BeihangUniversity

ZemingLiu

arXiv:2410.12896v1[cs.CL]16Oct2024

zmliu@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

ShiweiLi

shiweili93@

HangzhouInnovationInstitute,

BeihangUniversity

ZongyeZhang

zhangzongye@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

ChenkaiZhang

zhangchenkai@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

XiaoyuWu

zf2306113@

HangzhouInnovationInstitute,

BeihangUniversity

QiqiZhan

zhanqiqi@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

QingjieLiu

qingjie.liu@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

Abstract

ThesuccessofLargeLanguageModels(LLMs)isinherentlylinkedtotheavailabilityofvast,diverse,andhigh-qualitydatafortrainingandevaluation.However,thegrowthrateofhigh-qualitydataissig-nificantlyoutpacedbytheexpansionoftrainingdatasets,leadingtoaloomingdataexhaustioncrisis.Thisunderscorestheurgentneedtoenhancedataefficiencyandexplorenewdatasources.Inthiscon-text,syntheticdatahasemergedasapromisingsolution.Currently,datagenerationprimarilyconsistsoftwomajorapproaches:dataaugmentationandsynthesis.Thispapercomprehensivelyreviewsandsummarizesdatagenerationtechniquesthroughoutthelifecy-cleofLLMs,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Further-more,Wediscussthecurrentconstraintsfacedbythesemethodsandinvestigatepotentialpathwaysforfuturedevelopmentandresearch.Ouraspirationistoequipresearcherswithaclearunder-standingofthesemethodologies,enablingthemtoswiftlyidentifyappropriatedatagenerationstrategiesintheconstructionofLLMs,whileprovidingvaluableinsightsforfutureexploration.

1Introduction

Inrecentyears,largelanguagemodels(LLMs)havedemonstratedunparalleledcapabilitiesacrossawidearrayoftasks

[9,

68,

166],

firmlyestablishingthemselvesasthebackboneofgeneralartifi-cialintelligence(AI)systems.Thesemodelsachievesignificantimprovementsinnaturallanguageprocessing

[234,

262,

264],com

-putervision

[100,

207,

239],andotherresearchfields

[36,

163,

229],

consistentlypushingtheboundariesofwhatAIcanachieve.The

YunhongWangyhwang@

StateKeyLaboratoryofVirtual

RealityTechnologyandSystems,

BeihangUniversity

successofLLMsislargelyattributedtotheirabilitytocaptureintricatepatternsandrelationshipswithinvastamountsofdata,allowingthemtoperformcomplextaskssuchasnaturallanguageinference

[39,

134],visualquestionanswering

[151,

158],andvision

-and-languagenavigation

[125,

178]withremarkableproficiency.

However,theperformanceofLLMsishighlydependentonthequalityandvolumeofthedatatheyaretrainedon

[2,

57,

58]

.Withtheexponentialgrowthinmodelsize—nowreachingbillionsoreventrillionsofparameters

[105,

168,

268]

—thereisanincreasingdemandforlarge-scale,diverse,andhigh-qualitydatatoensurerobustgeneralizationacrossvarioustasksanddomains.Obtainingsuchdataposessignificantchallengesduetothehighcostsofdatacollectionandtheproblemsintroducedbyprivacyconcerns.Ad-ditionally,thegrowthrateofhigh-qualitydatalagsfarbehindtherapidlyincreasingsizeoftrainingdatasets.Ifthistrendcontinues,theavailabledatawilleventuallybedepleted,implyingthatwithoutsignificantimprovementsindataefficiencyorthediscoveryofnewdatasources,thegrowthofLLMsmayslowdownconsiderably.Giventheseimpendinglimitations,datasynthesisandaugmen-tationtechniquesbecomeessentialtoextendingthelifespanandgeneralizationofLLMs.Traditionaldatasynthesisandaugmenta-tiontechniques

[34,

98,

135,

194],suchasimagerotation,cropping,

flipping,andrule-basednaturallanguagegeneration,havebeenwidelyusedtoaddressthesedatalimitations.Althoughtheseap-proachesimprovedatadiversityandaddressdatascarcitytosomeextent,theystillstruggletofullycapturethecomplexitiesofreal-worlddata

[55],generatedataatscale

[233],anddefendagainst

2

Figure1:StatisticsofthepublicationsrelatedtoLLM-orienteddatasynthesisandaugmentationtechnologies,groupedbythepublicationyearandvenue.

adversarialexamples

[162],limitingtheireffectivenessfortraining

LLMs.

Toovercomethesechallenges,researchershaveincreasinglyturnedtoLLM-orienteddatasynthesisandaugmentationtech-niques,recognizingtheabilityofLLMstomodelcomplexpatternsfromlargedatasetsandgeneratesyntheticdatathatcloselymir-rorreal-worlddistributionswhileintroducingvaluablevariations

[37,

175,

260]

.Thesestudiesreducetherelianceonmanuallycu-rateddatasetsandenablethegenerationofhigh-quality,diversedatathatmeetstheevolvingdemandsofLLMsthroughouttheirlifecycleandfunctions.Tocapturethebreadthoftheseefforts,wecollectedpapersrelatedtoLLM-orienteddatasynthesisandaug-mentationbysearchingGoogleScholarusingkeywordssuchas"datasynthesis,""dataaugmentation,"and"largemodels."Figure

1

illustratesthepublicationtrendsbyyearandvenue,reflectingtheincreasinginterestinthisfield.AsofOctober2024,weidenti-fied250uniquepublicationscoveringdiverseresearchtopicsandvenues.Summarizingtheseeffortsprovidescriticalinsightsintotheprogressandchallengesthatremain,offeringafoundationforfutureresearch.Despitetheseadvancements,severalkeychal-lengesremaininLLM-orienteddatasynthesisandaugmentation.Themisuseofsyntheticdataposesrisks,particularlyinspreadingmisinformationandraisingethicalconcernsaroundmanipulat-ingpublicopinion.Additionally,syntheticdataoftenintroducesambiguitywhenaligningAImodelswithhumanvalues,poten-tiallyleadingtobiasedoutcomes.Evaluatingmodelstrainedonsyntheticdataisalsocomplex,astraditionalbenchmarksmaynotfullycapturethenuancesofthisdata.Ensuringreliabilityisan-otherconcern,asbiasesandinaccuraciesfromoriginaldatasetscanpersistinsyntheticdata,limitingitsgeneralizationacrossdomains.Moreover,thecomputationaldemandsofLLMs,alongwithchal-lengesinhandlinglesscommonlanguagesornovelinstructions,complicatebroaderapplications.Finally,thelackofaunifiedframe-workfororganizingandcomparingthemethodsproposedinbothacademiaandindustryremainsabarrierforresearchersnavigatingthisrapidlyevolvingfield.

Thissurveyaimstoaddressthesegapsbyprovidingacompre-hensiveoverviewofLLM-orienteddatasynthesisandaugmenta-tiontechniques.AsshowninFigure

2,unlikeprevioussurveys

[43,

140,

147,

214,

271],whichprimarilyfocusonapplyingthese

methodstosupportspecificdownstreamtasksorparticularstagesofLLMs,ourworkemphasizesthedirectroleofLLM-orientedtech-niquesinimprovingtheoverallperformanceofLLMsacrossvariousstagesoftheirlifecycleandcorefunctions.Incontrasttothework

[137],whichfocusesonpracticesforsyntheticdatagenerationto

addresschallengeslikedatascarcityandprivacy,oursurveyex-tendsbeyondpracticalguidancebycategorizingmethodsaimedatimprovingLLMperformanceholistically.WeexaminenotonlydatagenerationbutalsohowthesetechniquesenhanceLLMsacrossallstagesandfunctions,offeringamoreintegrated,data-centricframe-workforadvancingLLMs.Specifically,wesystematicallyreviewandcategorizeexistingresearchfromtwokeyperspectives:thelifecycleofLLMs(frompre-trainingtofine-tuningandapplication)andtheircorefunctions(understanding,logic,memory,andgener-ation).Byframingthediscussionaroundthesedualperspectives,weofferclearerinsightsintothedevelopment,interconnections,andpracticalapplicationsofdifferentapproaches.Moreover,weidentifycriticalchallenges,exploreemergingresearchdirections,andhighlightpotentialbreakthroughsthatcouldfurtherdrivead-vancementsinLLMperformancethroughdata-centricmethods.

Thecontributionsofthissurveyaresummarizedasfollows:

?FirstSurvey:Toourknowledge,wepresentthefirstcom-prehensivesurveyfocusedonadvancingLLMsthroughdatasynthesisandaugmentation,systematicallycoveringtheen-tirelifecyclestagesandcorefunctionsofLLMs.Thissurveyprovidesanin-depthanalysisofcurrentmethodologiesandhighlightstheuniquechallengesateachstage.

?Newtaxonomy:Weintroduceaninnovativeorganizationalframeworkthatcategorizesexistingresearchfromtwokeyperspectives:thelifecyclestagesofLLMsandtheircorefunctions.Thistaxonomyoffersaclearerunderstandingoftheprogression,interconnections,andapplicabilityof

3

Figure2:Acomparisonbetweenexistingsurveysondatasynthesisandaugmentationtechniquesandourwork.PrevioussurveysprimarilyfocusonLLM-baseddatasynthesisandaugmentationmethodsaimedatsupportingdownstreamtasks.Incontrast,ourworkemphasizesLLM-orienteddatasynthesisandaugmentation,systematicallycoveringthefulllifecycleofLLMs—fromdatapreparationtoapplications—andaddressingcoreLLMfunctionssuchasunderstandingandgeneration,withtheultimategoalofimprovingLLMsthemselvesthroughdata-centrictechniques.

differentapproaches,providingvaluableinsightsintobothdevelopmentalandfunctionalaspectsofLLM-orienteddatasynthesisandaugmentation.

?Newfrontiers:Weidentifycriticalchallenges,andexploreemergingresearchdirections,andpotentialbreakthroughsinLLM-orienteddatasynthesisandaugmentation.Thisdiscus-sionaimstoinspirefutureresearchandguidedevelopmentsindata-centrictechniquesforLLMadvancement.

?Abundantresources:Weorganizeandmaintainadedi-catedrepositorytosupportongoingresearchandcollabo-rationinLLM-orienteddatasynthesisandaugmentation.Thisresourceincludesacuratedcollectionofrelatedpapers,multipleleaderboardstrackingthelatestadvancements,andregularupdatestofosterinnovation,guidefutureresearchdirections,andacceleratebreakthroughsinthefield.

ByofferingacomprehensiveoverviewofLLM-orienteddatasynthesisandaugmentationapproaches,thissurveyaimstoclarifythecurrentstateofthefieldandinspirefutureresearchdirectionsthatcanfurtherenhanceLLMcapabilitiesthroughdatasynthesisandaugmentationmethodologies.

Weorganizetheremainderofthissurveyasfollows:Section2categorizestheprimaryareasofLLM-orienteddatasynthesisandaugmentation,providinganoverviewofthefoundationaltech-niques.Section3discussesthecurrentLLM-orienteddatasynthesisandaugmentationmethodsfromtheperspectiveofthefulllife-cycleofLLMs,detailinghowthesetechniquesareemployedatdifferentstagesofmodeldevelopment.InSection4,wereviewthesemethodsfromtheviewpointofcoreLLMfunctions,exploringhowdatasynthesisandaugmentationenhancekeycapabilitiessuchasunderstanding,logic,memory,andgeneration.Section5delvesintotheevaluationstrategiesforLLM-orienteddatasynthe-sisandaugmentation,addressingbenchmarks,evaluationmetrics,

andleaderboardsusedtoassessandcomparetheeffectivenessofexistingapproaches.Finally,Section6providesinsightsintochal-lengesandemergingtrendsinLLM-orienteddatasynthesisandaugmentation,offeringrecommendationsforfutureresearchdirec-tionsthatcancontributetothecontinuedadvancementofLLMsthroughdatasynthesisandaugmentationmethodologies.

2Taxonomy

Datagenerationmethodsplayapivotalroleinaddressingdatascarcityandimbalance,therebyimprovingmodelperformanceandgeneralization.AsshowninFig.

4,wesummarizethedevelopment

andevolutionofdataaugmentationandsynthesistechniquesinrecentyears.Thissectionprimarilyintroducesthecurrentclassi-ficationofdatagenerationmethods,distinguishingbetweendataaugmentation,whichenhancesexistingdatasamplesthroughtransformations,anddatasynthesis,whichcreatesentirelynewsamplesfromscratchorbasedongenerativemodels.Bothmethodsdifferintheirapproachtoacquiringdatabutaimtoexpanddatasets.Furthermore,dataaugmentationandsynthesismethodscanbecat-egorizedintosubclassesfrommultipledimensions.Eachapproachhasuniquestrengthsandapplications,enablingresearcherstotailortheirdatagenerationstrategiestospecificneedsandgoals.

2.1DataAugmentation

Dataaugmentation,atypeofgenerationapproachfromdatatodata,generallyinvolvesmanipulatingtheoriginaldatatoincreaseitsdiversityandquantitywithoutsignificantlyalteringitsessen-tialcharacteristics.Techniquesusedindataaugmentationarede-signedtoenhancetherichnessofexistingdatasamplesthroughtransformationsorperturbations.Acrossdifferentmodalities,dataaugmentationtechniquesoftenexhibitsimilarities.Forinstance,inimagedata,augmentationoperationsencompassmosaic

[90],

4

DataLabeling

DataReformation

DataAugmentation

(§2.1)

T-SciQ

[205]

,ChatGPT-based

[3,

63,

275]

Mosaic

[90]

,CORE

[45]

,ALIA

[51]

,ChatAug

[37]

Co-Annotation

Taxonomy(§2)

GeneralModelDistillation

Co-annotating

[116]

,ToolCoder

[259]

DataSynthesis(§2.2)

DomainModelDistillation

TinyStories

[53]

,Phi-1

[67,

120]

,Alpagasus

[22]

,WizardLM

[223]

Minerva

[108]

,DeepSeek-Prover

[220]

,WizardCoder

[146]

ModelSelf-Improvement

Rephrasing

[150]

,Self-instruct

[210]

,SPIN

[26]

,SelTDA

[94]

GeneralModelDistillation

DataPreparation

(§3.1)

Dialogic

[122]

,MathInstruct

[244]

,Genixer

[266]

,Magpie

[227]

,MMIQC

[131]

,Genie

[236]

,Case2Code

[180]

,UltraChat

[44]

DataAugmentation

Disco

[27]

,GPT3Mix

[237]

,CoAnnotating

[116]

,ALIA

[51]

,FullAnno

[74]

,Dialgen

[142]

,TinyGSM

[128]

,AMPS

[77]

GeneralModelDistillation

Pretraining(§3.2)

Phi-1

[67]

,SciLitLLM

[118]

,TRAIT

[123]

,AnyGPT

[251]

,Phi-1.5

[120]

,TinyDialogues

[56]

ModelSelf-Improvement

VILA-2

[54]

DataAugmentation

WRAP

[150]

,KMLM

[133]

,bioR

[276]

,Physics-based

[134]

DataSynthesisandAugmentationforLargeLanguageModels:ASurvey

GeneralModelDistillation

LAB

[191]

,LLM2LLM

[107]

,GLAN

[111]

,Clingen

[226]

,

Baize

[222]

,Evol-Instruct

[223]

,HuaTuo

[204]

,NExT-GPT

[219]

Finetuning(§3.3)

ModelSelf-Improvement

STaR

[248]

,REST

[66]

,Self-Translate

[170]

,Self-Instruct

[210]

,RFT

[242]

,CodeRL

[104]

,REST-EM

[187]

,DeepSeekProver

[220]

DataAugmentation

FullLifecycle

ofLM(§3)

MathGenie

[144]

,DISC-MedLLM

[10]

,MetaMath

[238]

,

Symboltuning

[213]

,Llama-3-UltraMedical

[258]

,Llemma

[6]

GeneralModelDistillation

Instruction-Tuning

(§3.4)

Alpaca

[196]

,Vicuna

[29]

,Orca

[154]

,Baize

[222]

,LLaVA

[130]

ModelSelf-Improvement

Self-Instruct

[210]

,SPIN

[26]

,CAI

[8]

,Toolformer

[177]

DataAugmentation

T-SciQ

[205]

,CORE

[45]

,ChatAug

[37]

,ToolCoder

[259]

GeneralModelDistillation

ULTRAFEEDBACK

[35]

,HelpSteer

[212]

,LEMA

[4]

DomainModelDistillation

Preference

Alignment(§3.5)

ModelSelf-Improvement

BAD

[225]

,BEAVERTAILS

[85]

,PRM800K

[124]

,WebGPT

[156]

OAIF

[69]

,SELF-JUDGE

[235]

,SALMON

[193]

,SteerLM

[47]

DataAugmentation

Starling-7B

[273]

,UltraInteract

[240]

,CriticBench

[126]

Math

MetaMath

[238]

,MammoTH

[244]

,STaR

[248]

,Galactia

[197]

,DeepSeekProver

[220]

,WizardMath

[145]

Science

SciLitLLM

[118]

,ChemLLM

[254]

,SciGLM

[253]

,Galactia

[197]

Code

Applications(§3.6)

WizardCoder

[146]

,MagicCoder

[215]

,CodeAlpaca

[17]

,CodeLLama

[173]

,Phi-1

[67]

,Phi-1.5

[120]

Medical

DISC-MedLLM

[10]

,HuatuoGPT

[20,

256]

,ChatCounselor

[132]

,ClinGen

[226]

Law

DISC-LawLLM

[243]

,LawyerLLaMA

[82]

,LawGPT

[272]

,WisdomInterrogatory

[270]

Understanding(§4.1)

Alpaca

[196]

,WizardLM

[223]

,WRAP

[150]

,LLaVA

[130]

,ChartLlama

[73]

,Genixer

[266]

Logic(§4.2)

Functionality(§4)

Memory(§4.3)

ReSTEM

[187]

,Case2Code

[180]

,MathInstruct

[244]

,MMIQC

[131]

,STaR

[248]

,SelTDA

[94]

Quiet-STaR

[247]

,AutoKG

[274]

,PersonaHub

[16]

,AceCoder

[113]

,RepoCoder

[255]

Generation(§4.4)

Synthesizingand

AugmentingMethod

(§5.1)

Genie

[236]

,UltraMedical

[258]

,HuaTuo

[204]

,TinyStories

[53]

,DIALOGIC

[122]

,ALIA

[51]

d-RLAIF

[106]

,LLM2LLM

[107]

,Wizardmath

[145]

,STaR

[248]

,SciGLM

[253]

,ChemLLM

[254]

DataQuality(§5.2)

ImpactofDataSynthesisand

Augmentation(§5.3)

LLMs4Synthesis

[62]

,CoRAL

[217]

,FORD

[221]

,LTGC

[267]

Challengesand

Limitations(§5)

DataDreamer

[159]

,HARMONIC

[209]

ImpactonDifferentApplicationsand

Tasks(§??)

FutureDirections

(§5.5)

PANDA

[127]

,REGA

[206]

TabSynDex

[32]

,CoLa-Diff

[87]

,WizardCoder

[146]

,WebGPT

[156]

Figure3:Themaincontentflowandcategorizationofthissurvey.

flipping

[184],copy-pasting

[61],addingnoise

[149],pairing

[84]and

soforth.Similarly,intextdata,augmentationoperationsinvolvesynonymreplacement

[95],copy-pasting

[185],etc

.Moreover,to

catertothedemandsofmultimodallearning,existingresearchhasaddressedcross-modalinformationalignmentduringdataaug-mentation.MixGen

[75]generatesnewtrainingsamplesbylinearly

5

interpolatingimagesandconcatenatingtextsequencesfromtwoex-istingimage-textpairs.Thesemanticrelationshipwithinthenewlygeneratedimage-textpairremainsconsistentandmatched.Re-cently,intherapidlyadvancinglandscapeofLLMs,dataaugmenta-tionhasemergedasacornerstoneforbolsteringmodelperformancethroughthediversificationoftrainingexemplars,circumventingthenecessityforextensiveadditionaldatagathering.Fromadata-centricperspective,wesystematicallycategorizeexistingresearchondataaugmentationintothreedistinctcategories:datalabel-ing

[3,

63,

94,

136,

198,

275],

datareformation[45,

51,

143,

237],

andco-annotation

[11,

43,

116]

.

2.1.1DataLabeling.DatalabelingendeavorstoleveragethecomprehensivelanguageunderstandingcapabilitiesofLLMstoannotatevastunlabeleddatasets.Thismethodologyisparticularlybeneficialinfieldsthatpossessasubstantialunlabeleddatacorpus,encompassingdomainssuchascross-lingualprocessingandmulti-modallearning

[3,

63,

275],wheretheautomationofannotationcan

significantlyexpeditethedatapreparationprocess.Recentresearchstudiesthezero-shotannotationabilityofLLMs,suchasGPT-4onlabelingpoliticalTwitter

[198]

.Moreover,Khanetal.

[94]focuson

Visualquestionanswering(VQA)tasksbygeneratingpseudo-labeldatafromunlabeledimagesbyutilizingtheSelTDAframework.

2.1.2DataReformation.Datareformationinvolvestransform-ingandrestructuringexistingdataintoabroaderspectrumofvaria-tions,therebyfacilitatingmorefine-graineddataaugmentation

[45,

51]

.Thisapproachaimstoenrichthetraininglandscapewithdi-

verseyetpertinentexamples,enhancingthemodel’srobustnessandgeneralizationcapabilities.Classicmethodssuchasrotation

[92],

colorchanneltransformation

[64],andsynonymreplacement

[95]

arecommonlyused.Recently,approachesutilizingLLMshavealsoemerged.Forexample,Chenetal.

[27]proposeDisco,anapproach

thatharnessesLLMstoproducelarge-scale,high-qualitycounter-factualdata.

2.1.3Co-Annotation.Co-annotationdesignatesthecollabora-tiveeffortbetweenhumanannotatorsandLLMsintheannota-tionprocess

[11]

.Byintegratingthestrengthsofbothannotationmethodologies,co-annotationnotonlymitigatesannotationcostsbutalsoconcurrentlyenhancesannotationperformance,fosteringamoreefficientandeffectiveapproachtodataannotation.Lietal.

[116]introduceCoAnnotating,aframeworkthatstrategically

assignsdatapointsforannotationeithertohumansortoLLMs,basedonanassessmentoftheLLM’sannotationuncertainty.

2.2DataSynthesis

Datasynthesis,ontheotherhand,aimstocreateentirelynewdatafromscratchorbasedongenerativemodels,whicharesimilartothedistributionofrealdata.Inrecentyears,withtheexplosionandadvancementsingenerativeAI

[13,

41,

42,

78,

139,

161,

169],

therehavebeensignificantstridesinthequalityandgenerationefficiencyofsyntheticdata.BasedontherequirementsofLMs,thispapercategorizesdatasynthesismethodsintothreemaintypes:

generalmodeldistillation[22,

53,

120,

263,

266],

domainmodel

distillation[108,

145,

146,

215],and

modelself-improvement[54,

150,

210,

248]

.

2.2.1GeneralModelDistillation.Amongthese,generalmodeldistillationinvolvesleveragingpowerfulgeneralmodels,typicallyfeaturinglargerparametersandsuperiorperformance,suchasSta-bleVicuna,ChatGPT,andGPT-4,togeneratedatasetsthatcanen-hancethecapabilitiesofweakermodels.Therearevariouswaystoemploythesepowerfulmodels,suchasusingpredefinedtemplatestogeneratetinystories

[53]andleveragingtheLLMsthemselves

toevaluatethequalityofthegenerateddata.Phi-1anditsseries

[67,

120]havedemonstratedthatasmallamountofhigh-quality

datacanalsotrainapowerfulmodel,byleveragingthecompre-hensivegenerationoftextbooksandexercisesfromGPT-3.5.Someothermethodshavealsoachievedperformanceimprovementsbygeneratinginstructiondatasetsandfine-tuningmodelsafterim-provingthequalityofthesedatasets

[22,

80,

196]

.

2.2.2DomainModelDistillation.Domainmodeldistillationpertainstotheutilizationofmodelsthataretailoredtogeneratedatawithinaparticulardomain.Thisapproachisoftennecessarywhengeneralmodelsfailtomeetthespecificneedsofindustryapplications.Forinstance,inthecontextofcodeprogramming,do-mainmodeldistillationcanbeemployedtogenerateinstructional

datatailoredtospecificcodingtasks

[146,

215].Intherealmofmath

-ematics,methodssuchasMinerva

[108]andDeepSeekMath

[220]

aredesignedtogeneratesolutionstomathematicalproblemswhileensuringtheiraccuracyanddiversity.Additionally,therealmofindustrydataoftenpresentsbarriers,suchaslimiteddatascalesandtheinaccessibilityofdatawithinspecificenterpriseswithinthedomain.Thesefactorsnecessitatetheadoptionofdomain-specificmodelsthatcaneffectivelyaddresstheuniquechallengesposedbythesescenarios.

2.2.3ModelSelf-Improvement.Modelself-improvementreferstotheprocesswhereamodelgenerateshigher-qualitydatatoen-hanceitscapabilities.Forinstance,leveragingexistinginstructionstoadjustthemodelandpromptingittoparaphrasedocumentsonthewebinspecificstyles,suchasWikipedia-styleorQA-style,canbeusedtojointlypre-trainLLMsforbothauthenticandsyntheticparaphrasingtasks

[150]

.Self-Instruct

[210]enhancesLMsthem

-selvesbyautogeneratingandrefininginstructionaldata,boostingperformancewithminimalhumanintervention.

3DataSynthesisandAugmentationintheFullLifecycleofLLM

FromtheperspectiveofthefulllifecycleofLLM,Wedividetheexistinginvestigationsintosixstages,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Thepresentsectionintroducesrelevantresearchineachstage.

3.1DataPreparation

Inthedatapreparationphase,datasynthesisandaugmentationaimtogeneratediverseandhigh-qualitydatasetsforthetrainingofLLMs,addressingthechallengeofthescarcityofreal-worlddata.AccordingtothetaxonomydiscussedinSection2,Wedividethepresentsubsectionintogeneralmodeldistillationanddataaugmentation.

6

Figure4:Illustrationoftheevolutionarystepsinthedevelopmentofdatasynthesisandaugmentationtechniquesforlargemodels.

3.1.1GeneralModelDistillation.ThiswayaimstoleveragethepowerfulcapabilitiesofgeneralLLMstodistillhigh-qualitydata.Accordingtotheapproachanddatamodality,wefurtherdi-videdgeneralmodeldistillationintofivecategories:synthesizefromseeds,synthesizereasoningsteps,synthesizewithcontrollability,synthesizefromscratch,andsynthesizemultimodaldata.

SynthesizefromSeeds.Tosynthesizedatasetsforspecifictasks,promptingLLMswithasmallnumberofrelevantexamplescaneffectivelyproducehigh-qualitydatasetsatalowcost.Forinstance,toinvestigate“howsmallcananLLMbetoachievecertaincapabil-ities”,TinyStories

[53]isconstructedbyinstructinganLLMtogen

-eratestoriesthatcombinethreewordsrandomlychosenfrom1500basicwords,anditcanbeusedtotrainandevaluatelanguagemod-els.Basedonthecollectedlarge-scalefunctions,Case2Code

[180]

incorporatesLLMstogeneratesuitableinputsforthesefunctionsandutilizesthecodeinterpretertocalculatethe

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論