版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1
ASurveyonDataSynthesisandAugmentationforLargeLanguageModels
KeWang
onecall@
HangzhouInnovationInstitute,
BeihangUniversity
JiahuiZhu
zhujh224@
HangzhouInnovationInstitute,
BeihangUniversity
MinjieRen
rmj_rmj@
HangzhouInnovationInstitute,
BeihangUniversity
ZemingLiu
arXiv:2410.12896v1[cs.CL]16Oct2024
zmliu@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
ShiweiLi
shiweili93@
HangzhouInnovationInstitute,
BeihangUniversity
ZongyeZhang
zhangzongye@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
ChenkaiZhang
zhangchenkai@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
XiaoyuWu
zf2306113@
HangzhouInnovationInstitute,
BeihangUniversity
QiqiZhan
zhanqiqi@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
QingjieLiu
qingjie.liu@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
Abstract
ThesuccessofLargeLanguageModels(LLMs)isinherentlylinkedtotheavailabilityofvast,diverse,andhigh-qualitydatafortrainingandevaluation.However,thegrowthrateofhigh-qualitydataissig-nificantlyoutpacedbytheexpansionoftrainingdatasets,leadingtoaloomingdataexhaustioncrisis.Thisunderscorestheurgentneedtoenhancedataefficiencyandexplorenewdatasources.Inthiscon-text,syntheticdatahasemergedasapromisingsolution.Currently,datagenerationprimarilyconsistsoftwomajorapproaches:dataaugmentationandsynthesis.Thispapercomprehensivelyreviewsandsummarizesdatagenerationtechniquesthroughoutthelifecy-cleofLLMs,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Further-more,Wediscussthecurrentconstraintsfacedbythesemethodsandinvestigatepotentialpathwaysforfuturedevelopmentandresearch.Ouraspirationistoequipresearcherswithaclearunder-standingofthesemethodologies,enablingthemtoswiftlyidentifyappropriatedatagenerationstrategiesintheconstructionofLLMs,whileprovidingvaluableinsightsforfutureexploration.
1Introduction
Inrecentyears,largelanguagemodels(LLMs)havedemonstratedunparalleledcapabilitiesacrossawidearrayoftasks
[9,
68,
166],
firmlyestablishingthemselvesasthebackboneofgeneralartifi-cialintelligence(AI)systems.Thesemodelsachievesignificantimprovementsinnaturallanguageprocessing
[234,
262,
264],com
-putervision
[100,
207,
239],andotherresearchfields
[36,
163,
229],
consistentlypushingtheboundariesofwhatAIcanachieve.The
YunhongWangyhwang@
StateKeyLaboratoryofVirtual
RealityTechnologyandSystems,
BeihangUniversity
successofLLMsislargelyattributedtotheirabilitytocaptureintricatepatternsandrelationshipswithinvastamountsofdata,allowingthemtoperformcomplextaskssuchasnaturallanguageinference
[39,
134],visualquestionanswering
[151,
158],andvision
-and-languagenavigation
[125,
178]withremarkableproficiency.
However,theperformanceofLLMsishighlydependentonthequalityandvolumeofthedatatheyaretrainedon
[2,
57,
58]
.Withtheexponentialgrowthinmodelsize—nowreachingbillionsoreventrillionsofparameters
[105,
168,
268]
—thereisanincreasingdemandforlarge-scale,diverse,andhigh-qualitydatatoensurerobustgeneralizationacrossvarioustasksanddomains.Obtainingsuchdataposessignificantchallengesduetothehighcostsofdatacollectionandtheproblemsintroducedbyprivacyconcerns.Ad-ditionally,thegrowthrateofhigh-qualitydatalagsfarbehindtherapidlyincreasingsizeoftrainingdatasets.Ifthistrendcontinues,theavailabledatawilleventuallybedepleted,implyingthatwithoutsignificantimprovementsindataefficiencyorthediscoveryofnewdatasources,thegrowthofLLMsmayslowdownconsiderably.Giventheseimpendinglimitations,datasynthesisandaugmen-tationtechniquesbecomeessentialtoextendingthelifespanandgeneralizationofLLMs.Traditionaldatasynthesisandaugmenta-tiontechniques
[34,
98,
135,
194],suchasimagerotation,cropping,
flipping,andrule-basednaturallanguagegeneration,havebeenwidelyusedtoaddressthesedatalimitations.Althoughtheseap-proachesimprovedatadiversityandaddressdatascarcitytosomeextent,theystillstruggletofullycapturethecomplexitiesofreal-worlddata
[55],generatedataatscale
[233],anddefendagainst
2
Figure1:StatisticsofthepublicationsrelatedtoLLM-orienteddatasynthesisandaugmentationtechnologies,groupedbythepublicationyearandvenue.
adversarialexamples
[162],limitingtheireffectivenessfortraining
LLMs.
Toovercomethesechallenges,researchershaveincreasinglyturnedtoLLM-orienteddatasynthesisandaugmentationtech-niques,recognizingtheabilityofLLMstomodelcomplexpatternsfromlargedatasetsandgeneratesyntheticdatathatcloselymir-rorreal-worlddistributionswhileintroducingvaluablevariations
[37,
175,
260]
.Thesestudiesreducetherelianceonmanuallycu-rateddatasetsandenablethegenerationofhigh-quality,diversedatathatmeetstheevolvingdemandsofLLMsthroughouttheirlifecycleandfunctions.Tocapturethebreadthoftheseefforts,wecollectedpapersrelatedtoLLM-orienteddatasynthesisandaug-mentationbysearchingGoogleScholarusingkeywordssuchas"datasynthesis,""dataaugmentation,"and"largemodels."Figure
1
illustratesthepublicationtrendsbyyearandvenue,reflectingtheincreasinginterestinthisfield.AsofOctober2024,weidenti-fied250uniquepublicationscoveringdiverseresearchtopicsandvenues.Summarizingtheseeffortsprovidescriticalinsightsintotheprogressandchallengesthatremain,offeringafoundationforfutureresearch.Despitetheseadvancements,severalkeychal-lengesremaininLLM-orienteddatasynthesisandaugmentation.Themisuseofsyntheticdataposesrisks,particularlyinspreadingmisinformationandraisingethicalconcernsaroundmanipulat-ingpublicopinion.Additionally,syntheticdataoftenintroducesambiguitywhenaligningAImodelswithhumanvalues,poten-tiallyleadingtobiasedoutcomes.Evaluatingmodelstrainedonsyntheticdataisalsocomplex,astraditionalbenchmarksmaynotfullycapturethenuancesofthisdata.Ensuringreliabilityisan-otherconcern,asbiasesandinaccuraciesfromoriginaldatasetscanpersistinsyntheticdata,limitingitsgeneralizationacrossdomains.Moreover,thecomputationaldemandsofLLMs,alongwithchal-lengesinhandlinglesscommonlanguagesornovelinstructions,complicatebroaderapplications.Finally,thelackofaunifiedframe-workfororganizingandcomparingthemethodsproposedinbothacademiaandindustryremainsabarrierforresearchersnavigatingthisrapidlyevolvingfield.
Thissurveyaimstoaddressthesegapsbyprovidingacompre-hensiveoverviewofLLM-orienteddatasynthesisandaugmenta-tiontechniques.AsshowninFigure
2,unlikeprevioussurveys
[43,
140,
147,
214,
271],whichprimarilyfocusonapplyingthese
methodstosupportspecificdownstreamtasksorparticularstagesofLLMs,ourworkemphasizesthedirectroleofLLM-orientedtech-niquesinimprovingtheoverallperformanceofLLMsacrossvariousstagesoftheirlifecycleandcorefunctions.Incontrasttothework
[137],whichfocusesonpracticesforsyntheticdatagenerationto
addresschallengeslikedatascarcityandprivacy,oursurveyex-tendsbeyondpracticalguidancebycategorizingmethodsaimedatimprovingLLMperformanceholistically.WeexaminenotonlydatagenerationbutalsohowthesetechniquesenhanceLLMsacrossallstagesandfunctions,offeringamoreintegrated,data-centricframe-workforadvancingLLMs.Specifically,wesystematicallyreviewandcategorizeexistingresearchfromtwokeyperspectives:thelifecycleofLLMs(frompre-trainingtofine-tuningandapplication)andtheircorefunctions(understanding,logic,memory,andgener-ation).Byframingthediscussionaroundthesedualperspectives,weofferclearerinsightsintothedevelopment,interconnections,andpracticalapplicationsofdifferentapproaches.Moreover,weidentifycriticalchallenges,exploreemergingresearchdirections,andhighlightpotentialbreakthroughsthatcouldfurtherdrivead-vancementsinLLMperformancethroughdata-centricmethods.
Thecontributionsofthissurveyaresummarizedasfollows:
?FirstSurvey:Toourknowledge,wepresentthefirstcom-prehensivesurveyfocusedonadvancingLLMsthroughdatasynthesisandaugmentation,systematicallycoveringtheen-tirelifecyclestagesandcorefunctionsofLLMs.Thissurveyprovidesanin-depthanalysisofcurrentmethodologiesandhighlightstheuniquechallengesateachstage.
?Newtaxonomy:Weintroduceaninnovativeorganizationalframeworkthatcategorizesexistingresearchfromtwokeyperspectives:thelifecyclestagesofLLMsandtheircorefunctions.Thistaxonomyoffersaclearerunderstandingoftheprogression,interconnections,andapplicabilityof
3
Figure2:Acomparisonbetweenexistingsurveysondatasynthesisandaugmentationtechniquesandourwork.PrevioussurveysprimarilyfocusonLLM-baseddatasynthesisandaugmentationmethodsaimedatsupportingdownstreamtasks.Incontrast,ourworkemphasizesLLM-orienteddatasynthesisandaugmentation,systematicallycoveringthefulllifecycleofLLMs—fromdatapreparationtoapplications—andaddressingcoreLLMfunctionssuchasunderstandingandgeneration,withtheultimategoalofimprovingLLMsthemselvesthroughdata-centrictechniques.
differentapproaches,providingvaluableinsightsintobothdevelopmentalandfunctionalaspectsofLLM-orienteddatasynthesisandaugmentation.
?Newfrontiers:Weidentifycriticalchallenges,andexploreemergingresearchdirections,andpotentialbreakthroughsinLLM-orienteddatasynthesisandaugmentation.Thisdiscus-sionaimstoinspirefutureresearchandguidedevelopmentsindata-centrictechniquesforLLMadvancement.
?Abundantresources:Weorganizeandmaintainadedi-catedrepositorytosupportongoingresearchandcollabo-rationinLLM-orienteddatasynthesisandaugmentation.Thisresourceincludesacuratedcollectionofrelatedpapers,multipleleaderboardstrackingthelatestadvancements,andregularupdatestofosterinnovation,guidefutureresearchdirections,andacceleratebreakthroughsinthefield.
ByofferingacomprehensiveoverviewofLLM-orienteddatasynthesisandaugmentationapproaches,thissurveyaimstoclarifythecurrentstateofthefieldandinspirefutureresearchdirectionsthatcanfurtherenhanceLLMcapabilitiesthroughdatasynthesisandaugmentationmethodologies.
Weorganizetheremainderofthissurveyasfollows:Section2categorizestheprimaryareasofLLM-orienteddatasynthesisandaugmentation,providinganoverviewofthefoundationaltech-niques.Section3discussesthecurrentLLM-orienteddatasynthesisandaugmentationmethodsfromtheperspectiveofthefulllife-cycleofLLMs,detailinghowthesetechniquesareemployedatdifferentstagesofmodeldevelopment.InSection4,wereviewthesemethodsfromtheviewpointofcoreLLMfunctions,exploringhowdatasynthesisandaugmentationenhancekeycapabilitiessuchasunderstanding,logic,memory,andgeneration.Section5delvesintotheevaluationstrategiesforLLM-orienteddatasynthe-sisandaugmentation,addressingbenchmarks,evaluationmetrics,
andleaderboardsusedtoassessandcomparetheeffectivenessofexistingapproaches.Finally,Section6providesinsightsintochal-lengesandemergingtrendsinLLM-orienteddatasynthesisandaugmentation,offeringrecommendationsforfutureresearchdirec-tionsthatcancontributetothecontinuedadvancementofLLMsthroughdatasynthesisandaugmentationmethodologies.
2Taxonomy
Datagenerationmethodsplayapivotalroleinaddressingdatascarcityandimbalance,therebyimprovingmodelperformanceandgeneralization.AsshowninFig.
4,wesummarizethedevelopment
andevolutionofdataaugmentationandsynthesistechniquesinrecentyears.Thissectionprimarilyintroducesthecurrentclassi-ficationofdatagenerationmethods,distinguishingbetweendataaugmentation,whichenhancesexistingdatasamplesthroughtransformations,anddatasynthesis,whichcreatesentirelynewsamplesfromscratchorbasedongenerativemodels.Bothmethodsdifferintheirapproachtoacquiringdatabutaimtoexpanddatasets.Furthermore,dataaugmentationandsynthesismethodscanbecat-egorizedintosubclassesfrommultipledimensions.Eachapproachhasuniquestrengthsandapplications,enablingresearcherstotailortheirdatagenerationstrategiestospecificneedsandgoals.
2.1DataAugmentation
Dataaugmentation,atypeofgenerationapproachfromdatatodata,generallyinvolvesmanipulatingtheoriginaldatatoincreaseitsdiversityandquantitywithoutsignificantlyalteringitsessen-tialcharacteristics.Techniquesusedindataaugmentationarede-signedtoenhancetherichnessofexistingdatasamplesthroughtransformationsorperturbations.Acrossdifferentmodalities,dataaugmentationtechniquesoftenexhibitsimilarities.Forinstance,inimagedata,augmentationoperationsencompassmosaic
[90],
4
DataLabeling
DataReformation
DataAugmentation
(§2.1)
T-SciQ
[205]
,ChatGPT-based
[3,
63,
275]
Mosaic
[90]
,CORE
[45]
,ALIA
[51]
,ChatAug
[37]
Co-Annotation
Taxonomy(§2)
GeneralModelDistillation
Co-annotating
[116]
,ToolCoder
[259]
DataSynthesis(§2.2)
DomainModelDistillation
TinyStories
[53]
,Phi-1
[67,
120]
,Alpagasus
[22]
,WizardLM
[223]
Minerva
[108]
,DeepSeek-Prover
[220]
,WizardCoder
[146]
ModelSelf-Improvement
Rephrasing
[150]
,Self-instruct
[210]
,SPIN
[26]
,SelTDA
[94]
GeneralModelDistillation
DataPreparation
(§3.1)
Dialogic
[122]
,MathInstruct
[244]
,Genixer
[266]
,Magpie
[227]
,MMIQC
[131]
,Genie
[236]
,Case2Code
[180]
,UltraChat
[44]
DataAugmentation
Disco
[27]
,GPT3Mix
[237]
,CoAnnotating
[116]
,ALIA
[51]
,FullAnno
[74]
,Dialgen
[142]
,TinyGSM
[128]
,AMPS
[77]
GeneralModelDistillation
Pretraining(§3.2)
Phi-1
[67]
,SciLitLLM
[118]
,TRAIT
[123]
,AnyGPT
[251]
,Phi-1.5
[120]
,TinyDialogues
[56]
ModelSelf-Improvement
VILA-2
[54]
DataAugmentation
WRAP
[150]
,KMLM
[133]
,bioR
[276]
,Physics-based
[134]
DataSynthesisandAugmentationforLargeLanguageModels:ASurvey
GeneralModelDistillation
LAB
[191]
,LLM2LLM
[107]
,GLAN
[111]
,Clingen
[226]
,
Baize
[222]
,Evol-Instruct
[223]
,HuaTuo
[204]
,NExT-GPT
[219]
Finetuning(§3.3)
ModelSelf-Improvement
STaR
[248]
,REST
[66]
,Self-Translate
[170]
,Self-Instruct
[210]
,RFT
[242]
,CodeRL
[104]
,REST-EM
[187]
,DeepSeekProver
[220]
DataAugmentation
FullLifecycle
ofLM(§3)
MathGenie
[144]
,DISC-MedLLM
[10]
,MetaMath
[238]
,
Symboltuning
[213]
,Llama-3-UltraMedical
[258]
,Llemma
[6]
GeneralModelDistillation
Instruction-Tuning
(§3.4)
Alpaca
[196]
,Vicuna
[29]
,Orca
[154]
,Baize
[222]
,LLaVA
[130]
ModelSelf-Improvement
Self-Instruct
[210]
,SPIN
[26]
,CAI
[8]
,Toolformer
[177]
DataAugmentation
T-SciQ
[205]
,CORE
[45]
,ChatAug
[37]
,ToolCoder
[259]
GeneralModelDistillation
ULTRAFEEDBACK
[35]
,HelpSteer
[212]
,LEMA
[4]
DomainModelDistillation
Preference
Alignment(§3.5)
ModelSelf-Improvement
BAD
[225]
,BEAVERTAILS
[85]
,PRM800K
[124]
,WebGPT
[156]
OAIF
[69]
,SELF-JUDGE
[235]
,SALMON
[193]
,SteerLM
[47]
DataAugmentation
Starling-7B
[273]
,UltraInteract
[240]
,CriticBench
[126]
Math
MetaMath
[238]
,MammoTH
[244]
,STaR
[248]
,Galactia
[197]
,DeepSeekProver
[220]
,WizardMath
[145]
Science
SciLitLLM
[118]
,ChemLLM
[254]
,SciGLM
[253]
,Galactia
[197]
Code
Applications(§3.6)
WizardCoder
[146]
,MagicCoder
[215]
,CodeAlpaca
[17]
,CodeLLama
[173]
,Phi-1
[67]
,Phi-1.5
[120]
Medical
DISC-MedLLM
[10]
,HuatuoGPT
[20,
256]
,ChatCounselor
[132]
,ClinGen
[226]
Law
DISC-LawLLM
[243]
,LawyerLLaMA
[82]
,LawGPT
[272]
,WisdomInterrogatory
[270]
Understanding(§4.1)
Alpaca
[196]
,WizardLM
[223]
,WRAP
[150]
,LLaVA
[130]
,ChartLlama
[73]
,Genixer
[266]
Logic(§4.2)
Functionality(§4)
Memory(§4.3)
ReSTEM
[187]
,Case2Code
[180]
,MathInstruct
[244]
,MMIQC
[131]
,STaR
[248]
,SelTDA
[94]
Quiet-STaR
[247]
,AutoKG
[274]
,PersonaHub
[16]
,AceCoder
[113]
,RepoCoder
[255]
Generation(§4.4)
Synthesizingand
AugmentingMethod
(§5.1)
Genie
[236]
,UltraMedical
[258]
,HuaTuo
[204]
,TinyStories
[53]
,DIALOGIC
[122]
,ALIA
[51]
d-RLAIF
[106]
,LLM2LLM
[107]
,Wizardmath
[145]
,STaR
[248]
,SciGLM
[253]
,ChemLLM
[254]
DataQuality(§5.2)
ImpactofDataSynthesisand
Augmentation(§5.3)
LLMs4Synthesis
[62]
,CoRAL
[217]
,FORD
[221]
,LTGC
[267]
Challengesand
Limitations(§5)
DataDreamer
[159]
,HARMONIC
[209]
ImpactonDifferentApplicationsand
Tasks(§??)
FutureDirections
(§5.5)
PANDA
[127]
,REGA
[206]
TabSynDex
[32]
,CoLa-Diff
[87]
,WizardCoder
[146]
,WebGPT
[156]
Figure3:Themaincontentflowandcategorizationofthissurvey.
flipping
[184],copy-pasting
[61],addingnoise
[149],pairing
[84]and
soforth.Similarly,intextdata,augmentationoperationsinvolvesynonymreplacement
[95],copy-pasting
[185],etc
.Moreover,to
catertothedemandsofmultimodallearning,existingresearchhasaddressedcross-modalinformationalignmentduringdataaug-mentation.MixGen
[75]generatesnewtrainingsamplesbylinearly
5
interpolatingimagesandconcatenatingtextsequencesfromtwoex-istingimage-textpairs.Thesemanticrelationshipwithinthenewlygeneratedimage-textpairremainsconsistentandmatched.Re-cently,intherapidlyadvancinglandscapeofLLMs,dataaugmenta-tionhasemergedasacornerstoneforbolsteringmodelperformancethroughthediversificationoftrainingexemplars,circumventingthenecessityforextensiveadditionaldatagathering.Fromadata-centricperspective,wesystematicallycategorizeexistingresearchondataaugmentationintothreedistinctcategories:datalabel-ing
[3,
63,
94,
136,
198,
275],
datareformation[45,
51,
143,
237],
andco-annotation
[11,
43,
116]
.
2.1.1DataLabeling.DatalabelingendeavorstoleveragethecomprehensivelanguageunderstandingcapabilitiesofLLMstoannotatevastunlabeleddatasets.Thismethodologyisparticularlybeneficialinfieldsthatpossessasubstantialunlabeleddatacorpus,encompassingdomainssuchascross-lingualprocessingandmulti-modallearning
[3,
63,
275],wheretheautomationofannotationcan
significantlyexpeditethedatapreparationprocess.Recentresearchstudiesthezero-shotannotationabilityofLLMs,suchasGPT-4onlabelingpoliticalTwitter
[198]
.Moreover,Khanetal.
[94]focuson
Visualquestionanswering(VQA)tasksbygeneratingpseudo-labeldatafromunlabeledimagesbyutilizingtheSelTDAframework.
2.1.2DataReformation.Datareformationinvolvestransform-ingandrestructuringexistingdataintoabroaderspectrumofvaria-tions,therebyfacilitatingmorefine-graineddataaugmentation
[45,
51]
.Thisapproachaimstoenrichthetraininglandscapewithdi-
verseyetpertinentexamples,enhancingthemodel’srobustnessandgeneralizationcapabilities.Classicmethodssuchasrotation
[92],
colorchanneltransformation
[64],andsynonymreplacement
[95]
arecommonlyused.Recently,approachesutilizingLLMshavealsoemerged.Forexample,Chenetal.
[27]proposeDisco,anapproach
thatharnessesLLMstoproducelarge-scale,high-qualitycounter-factualdata.
2.1.3Co-Annotation.Co-annotationdesignatesthecollabora-tiveeffortbetweenhumanannotatorsandLLMsintheannota-tionprocess
[11]
.Byintegratingthestrengthsofbothannotationmethodologies,co-annotationnotonlymitigatesannotationcostsbutalsoconcurrentlyenhancesannotationperformance,fosteringamoreefficientandeffectiveapproachtodataannotation.Lietal.
[116]introduceCoAnnotating,aframeworkthatstrategically
assignsdatapointsforannotationeithertohumansortoLLMs,basedonanassessmentoftheLLM’sannotationuncertainty.
2.2DataSynthesis
Datasynthesis,ontheotherhand,aimstocreateentirelynewdatafromscratchorbasedongenerativemodels,whicharesimilartothedistributionofrealdata.Inrecentyears,withtheexplosionandadvancementsingenerativeAI
[13,
41,
42,
78,
139,
161,
169],
therehavebeensignificantstridesinthequalityandgenerationefficiencyofsyntheticdata.BasedontherequirementsofLMs,thispapercategorizesdatasynthesismethodsintothreemaintypes:
generalmodeldistillation[22,
53,
120,
263,
266],
domainmodel
distillation[108,
145,
146,
215],and
modelself-improvement[54,
150,
210,
248]
.
2.2.1GeneralModelDistillation.Amongthese,generalmodeldistillationinvolvesleveragingpowerfulgeneralmodels,typicallyfeaturinglargerparametersandsuperiorperformance,suchasSta-bleVicuna,ChatGPT,andGPT-4,togeneratedatasetsthatcanen-hancethecapabilitiesofweakermodels.Therearevariouswaystoemploythesepowerfulmodels,suchasusingpredefinedtemplatestogeneratetinystories
[53]andleveragingtheLLMsthemselves
toevaluatethequalityofthegenerateddata.Phi-1anditsseries
[67,
120]havedemonstratedthatasmallamountofhigh-quality
datacanalsotrainapowerfulmodel,byleveragingthecompre-hensivegenerationoftextbooksandexercisesfromGPT-3.5.Someothermethodshavealsoachievedperformanceimprovementsbygeneratinginstructiondatasetsandfine-tuningmodelsafterim-provingthequalityofthesedatasets
[22,
80,
196]
.
2.2.2DomainModelDistillation.Domainmodeldistillationpertainstotheutilizationofmodelsthataretailoredtogeneratedatawithinaparticulardomain.Thisapproachisoftennecessarywhengeneralmodelsfailtomeetthespecificneedsofindustryapplications.Forinstance,inthecontextofcodeprogramming,do-mainmodeldistillationcanbeemployedtogenerateinstructional
datatailoredtospecificcodingtasks
[146,
215].Intherealmofmath
-ematics,methodssuchasMinerva
[108]andDeepSeekMath
[220]
aredesignedtogeneratesolutionstomathematicalproblemswhileensuringtheiraccuracyanddiversity.Additionally,therealmofindustrydataoftenpresentsbarriers,suchaslimiteddatascalesandtheinaccessibilityofdatawithinspecificenterpriseswithinthedomain.Thesefactorsnecessitatetheadoptionofdomain-specificmodelsthatcaneffectivelyaddresstheuniquechallengesposedbythesescenarios.
2.2.3ModelSelf-Improvement.Modelself-improvementreferstotheprocesswhereamodelgenerateshigher-qualitydatatoen-hanceitscapabilities.Forinstance,leveragingexistinginstructionstoadjustthemodelandpromptingittoparaphrasedocumentsonthewebinspecificstyles,suchasWikipedia-styleorQA-style,canbeusedtojointlypre-trainLLMsforbothauthenticandsyntheticparaphrasingtasks
[150]
.Self-Instruct
[210]enhancesLMsthem
-selvesbyautogeneratingandrefininginstructionaldata,boostingperformancewithminimalhumanintervention.
3DataSynthesisandAugmentationintheFullLifecycleofLLM
FromtheperspectiveofthefulllifecycleofLLM,Wedividetheexistinginvestigationsintosixstages,includingdatapreparation,pre-training,fine-tuning,instruction-tuning,preferencealignment,andapplications.Thepresentsectionintroducesrelevantresearchineachstage.
3.1DataPreparation
Inthedatapreparationphase,datasynthesisandaugmentationaimtogeneratediverseandhigh-qualitydatasetsforthetrainingofLLMs,addressingthechallengeofthescarcityofreal-worlddata.AccordingtothetaxonomydiscussedinSection2,Wedividethepresentsubsectionintogeneralmodeldistillationanddataaugmentation.
6
Figure4:Illustrationoftheevolutionarystepsinthedevelopmentofdatasynthesisandaugmentationtechniquesforlargemodels.
3.1.1GeneralModelDistillation.ThiswayaimstoleveragethepowerfulcapabilitiesofgeneralLLMstodistillhigh-qualitydata.Accordingtotheapproachanddatamodality,wefurtherdi-videdgeneralmodeldistillationintofivecategories:synthesizefromseeds,synthesizereasoningsteps,synthesizewithcontrollability,synthesizefromscratch,andsynthesizemultimodaldata.
SynthesizefromSeeds.Tosynthesizedatasetsforspecifictasks,promptingLLMswithasmallnumberofrelevantexamplescaneffectivelyproducehigh-qualitydatasetsatalowcost.Forinstance,toinvestigate“howsmallcananLLMbetoachievecertaincapabil-ities”,TinyStories
[53]isconstructedbyinstructinganLLMtogen
-eratestoriesthatcombinethreewordsrandomlychosenfrom1500basicwords,anditcanbeusedtotrainandevaluatelanguagemod-els.Basedonthecollectedlarge-scalefunctions,Case2Code
[180]
incorporatesLLMstogeneratesuitableinputsforthesefunctionsandutilizesthecodeinterpretertocalculatethe
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 濰坊科技學院《平面設(shè)計競賽》2023-2024學年第一學期期末試卷
- 2025作為實施《勞動合同法》的重要舉措,昨日,全新的全日制勞動合同書
- 農(nóng)場煙葉訂購合同范例
- 工商注冊租房合同范例
- 中冶集團合同范例
- 2025安全責任協(xié)議合同
- 婚紗租賃贈送合同范例
- 家具釆購合同范例
- 拍攝設(shè)備使用合同范例
- 企業(yè)員工加班合同范例
- 電梯曳引系統(tǒng)設(shè)計-畢業(yè)設(shè)計
- 瑪帕導條刀具課件
- 班會課件 勿以惡小而為之勿以善小而不為
- 中醫(yī)針灸治療中風后語言障礙病例分析專題報告
- 醫(yī)院消毒供應(yīng)中心清洗、消毒、滅菌質(zhì)控評分表
- 2022年學校寒假德育特色作業(yè)實踐方案(詳細版)
- 可愛卡通插畫風讀書分享通用PPT模板
- 小學數(shù)學西南師大四年級上冊五相交與平行《相交》課堂設(shè)計
- 光伏發(fā)電項目試驗計劃
- 圖書館工作流程(新)
- 1:青巖古鎮(zhèn)發(fā)展及規(guī)劃
評論
0/150
提交評論