Gemini 1.5：解鎖數(shù)百萬(wàn)代幣的多模式理解（英文版）

上傳人：1*** IP屬地：山西上傳時(shí)間：2024-12-22 格式：DOCX 頁(yè)數(shù)：108 大?。?93.35KB 積分：19.9 舉報(bào) 版權(quán)申訴

Gemini 1.5：解鎖數(shù)百萬(wàn)代幣的多模式理解（英文版）_第2頁(yè)

Gemini 1.5：解鎖數(shù)百萬(wàn)代幣的多模式理解（英文版）_第3頁(yè)

Gemini 1.5：解鎖數(shù)百萬(wàn)代幣的多模式理解（英文版）_第4頁(yè)

Gemini 1.5：解鎖數(shù)百萬(wàn)代幣的多模式理解（英文版）_第5頁(yè)

已閱讀5頁(yè)，還剩103頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

GocgleDeepMind

?2024Google.Allrightsreserved

Gemini1.5:Unlockingmultimodal

understandingacrossmillionsoftokensofcontext

GeminiTeam,Google

Inthisreport,wepresentthelatestmodeloftheGeminifamily,Gemini1.5Pro,ahighlycompute-efficientmultimodalmixture-of-expertsmodelcapableofrecallingandreasoningoverfine-grainedinformationfrommillionsoftokensofcontext,includingmultiplelongdocumentsandhoursofvideoandaudio.Gemini1.5Proachievesnear-perfectrecallonlong-contextretrievaltasksacrossmodalities,improvesthestate-of-the-artinlong-documentQA,long-videoQAandlong-contextASR,andmatchesorsurpassesGemini1.0Ultra’sstate-of-the-artperformanceacrossabroadsetofbenchmarks.StudyingthelimitsofGemini1.5Pro’slong-contextability,wefindcontinuedimprovementinnext-tokenpredictionandnear-perfectretrieval(>99%)uptoatleast10Mtokens,agenerationalleapoverexistingmodelssuchasClaude2.1(200k)andGPT-4Turbo(128k).Finally,wehighlightsurprisingnewcapabilitiesoflargelanguagemodelsatthefrontier;whengivenagrammarmanualforKalamang,alanguagewithfewerthan200speakersworldwide,themodellearnstotranslateEnglishtoKalamangatasimilarleveltoapersonlearningfromthesamecontent.

1.Introduction

WepresentourlatestmultimodalmodelfromtheGeminiline:Gemini1.5Pro.ThisisourfirstreleasefromGemini1.5,anewfamilyofhighly-capablemultimodalmodelswhichincorporatesanovelmixture-of-expertsarchitectureaswellasmajoradvancesintrainingandservinginfrastructurethatallowittopushtheboundaryofefficiency,reasoning,andlong-contextperformance.Gemini1.5Proisbuilttohandleextremelylongcontexts;ithastheabilitytorecallandreasonoverfine-grainedinformationfromuptoatleast10Mtokens.Thisscaleisunprecedentedamongcontemporarylargelanguagemodels(LLMs),andenablestheprocessingoflong-formmixed-modalityinputsincludingentirecollectionsofdocuments,multiplehoursofvideo,andalmostadaylongofaudio.Gemini1.5ProsurpassesGemini1.0Proandperformsatasimilarlevelto1.0Ultraonawidearrayofbenchmarkswhilerequiringsignificantlylesscomputetotrain.

Theabilitytomodeldataofincreasinglylongercontextshastrackedthedevelopmentofmoregeneralandcapablelanguagemodels,fromthenowtoy2-gramlanguagemodelproposedby

Shannon

(1948),tothemodernn-grammodelsofthe1990s&2000s(

Brantsetal.,

2007;

ChenandGoodman,

1999;

Jelinek,

1998;

KneserandNey,

1995)typicallyconstrainedto5tokensofcontext,torecurrent

neuralnetworkslanguagemodelsfromthe2010swhichcouldeffectivelyconditiononhundredsof

tokens(Jozefowiczetal.,

2016;

Mikolovetal.,

2010),tothemodernTransformer(Vaswanietal.,

2017

)whichcanconditiononhundredsofthousandsoftokens(

Anthropic,

2023)

.Gemini1.5Procontinuesthistrendbyextendinglanguagemodelcontextlengthsbyoveranorderofmagnitude.Scalingtomillionsoftokens,wefindacontinuedimprovementinpredictiveperformance(Section

),nearperfectrecall(>99%)onsyntheticretrievaltasks(Figure

andSection

),andahostof

surprisingnewcapabilitieslikein-contextlearningfromentirelongdocuments(Section

1Pleasesendcorrespondencetogemini-1_5-report@.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

video

videoHaystack

6121824303642485460

Minutes

3672108144180

Minutes

Audio

AudioHaystack

252

204

492540

588

444

300348

Minutes

84010801320

Minutes

4357

Text

TextHaystack

32k

256k

512k

128k

Tokens

Figure1|Gemini1.5Proachievesnear-perfect“needle”recall(>99.7%)upto1Mtokensof“haystack”inallmodalities,i.e.,text,videoandaudio.Itevenmaintainsthisrecallperformancewhenextendingto10Mtokensinthetextmodality(approximately7Mwords);2Mtokensintheaudiomodality(upto22hours);2.8Mtokensinthevideomodality(upto3hours).Thex-axisrepresentsthecontextwindow,andthey-axisthedepthpercentageoftheneedleplacedforagivencontextlength.Theresultsarecolor-codedtoindicate:greenforsuccessfulretrievalsandredforunsuccessfulones.

Tomeasuretheeffectivenessofourmodel’slong-contextcapabilities,weconductexperimentsonbothsyntheticandreal-worldtasks.Insynthetic“needle-in-a-haystack”tasksinspiredby

Kamradt

(2023)thatprobehowreliablythemodelcanrecallinformationamidstdistractorcontext,we

findthatGemini1.5Proachievesnear-perfect(>99%)“needle”recalluptomultiplemillionsoftokensof“haystack”inallmodalities,i.e.,text,videoandaudio,andevenmaintainingthisrecallperformancewhenextendingto10Mtokensinthetextmodality.Inmorerealisticmultimodallong-contextbenchmarkswhichrequireretrievalandreasoningovermultiplepartsofthecontext(suchasansweringquestionsfromlongdocumentsorlongvideos),wealsoseeGemini1.5Prooutperformingallcompetingmodelsacrossallmodalitiesevenwhenthesemodelsareaugmentedwithexternalretrievalmethods.Finally,wequalitativelyshowcasethein-contextlearningabilitiesofGemini1.5Proenabledbyverylongcontext:forexample,learningtotranslateanewlanguagefromasinglesetoflinguisticdocumentation.Withonlyinstructionalmaterials(500pagesoflinguisticdocumentation,adictionary,and≈400parallelsentences)allprovidedincontext,Gemini1.5ProiscapableoflearningtotranslatefromEnglishtoKalamang,alanguagespokenbyfewerthan200speakersinwesternNewGuineaintheeastofIndonesianPapua

andthereforealmostnoonlinepresence.Moreover,wefindthatthequalityofitstranslationsiscomparabletothatofapersonwhohaslearnedfromthesamematerials.

2KalamangLanguage:

/lang/1891

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

Gemini1.5Pro

Relativeto1.0Pro

Relativeto1.0Ultra

Long-ContextText,Video&Audio

from32kupto10Mtokens

CoreCapabilities

Text

Win-rate:87.1%(27/31benchmarks)

Win-rate:54.8%(17/31benchmarks)

Win-rate:100%(13/13benchmarks)

Win-rate:77%(10/13benchmarks)

Vision

Win-rate:77%(10/13benchmarks)

Win-rate:46%(6/13benchmarks)

Audio

Win-rate:60%(3/5benchmarks)

Win-rate:20%(1/5benchmarks)

Table1|Gemini1.5ProcomparedtoGemini1.0family.Gemini1.5Promaintainshighlevelsofperformanceevenasitscontextwindowincreases.DetailedresultsarepresentedinTable

Importantly,thisleapinlong-contextperformancedoesnotcomeattheexpenseofthecoremulti-modalcapabilitiesofthemodel

Overall,wefindthatGemini1.5ProgreatlysurpassesGemini1.0Pro,performingbetteronthevastmajorityofbenchmarks(i.e.,27/31),increasingthemargininparticularforMath,ScienceandReasoning(+28.9%),Multilinguality(+22.3%),VideoUnderstanding(+11.2%)andCode(+8.9%)(seeTable

forbreakdowns).However,amorestrikingcomparisonistheonewithGemini1.0Ultra,astate-of-the-artmodelacrossmanycapabilities.DespiteGemini1.5Prousingsignificantlylesstrainingcomputeandbeingmoreefficienttoserve,wefindGemini1.5Protoperformbetteronmorethanhalfofthebenchmarks(16/31),inparticularontextbenchmarks(10/13)andmanyofthevisionbenchmarks(6/13).

Inthefollowingsections,weprovideanoverviewofthemodelarchitectureandpresenttheresultsoflarge-scalequantitativeevaluationscomparingGemini1.5ProtootherLLMs.Wepresentdetailedevaluationsforthemodel’slongcontextcapabilitiesfollowedbyevaluationsofitscorecapabilities,similartotheGemini1.0technicalreport(

Gemini-Teametal.,

2023

),coveringwell-studiedbenchmarksacrosstext,code,image,videoandaudio.Finally,wediscussourapproachtoresponsibledeployment,includingourprocessforimpactassessmentdevelopingmodelpolicies,evaluations,andmitigationsofharmbeforedeploymentdecisions

2.ModelArchitecture

Gemini1.5Proisasparsemixture-of-expert(MoE)Transformer-basedmodelthatbuildsonGemini1.

0’s(Gemini-Teametal.,

2023)researchadvancesandmultimodalcapabilities

.Gemini1.5ProalsobuildsonamuchlongerhistoryofMoEresearchatGoogle(

Clarketal.,

2022;

Duetal.,

2022;

Fedus

etal.,

2021;

Lepikhinetal.,

2020;

Riquelmeetal.,

2021;

Shazeeretal.,

2017;

Zophetal.,

2022)and

languagemodelresearchinthebroaderliterature(Aniletal.,

2023;

Anthropic,

2023;

Brownetal.,

2020;

Chowdheryetal.,

2023;

Hoffmannetal.,

2022;

Jiangetal.,

2024;

Kimetal.,

2021;

OpenAI,

2023;

Raeetal.,

2021;

Raffeletal.,

2020;

Rolleretal.,

2021;

Thoppilanetal.,

2022;

Touvronetal.,

2023a

,b;

Vaswanietal.,

2017

).MoEmodelsusealearnedroutingfunctiontodirectinputstoasubsetofthemodel’sparametersforprocessing.Thisformofconditionalcomputation(

Bengioetal.,

2013;

3Wedefinethecorecapabilitiesasthosecapabilitiesofthemodelthatareprimarilynonlong-context(e.g.,math,science,reasoning,multilinguality,codeetc.)similartocapabilitiescoveredintheGemini1.0technicalreport

(Gemini-Teametal.,

2023

4Seethemodelcard

(Mitchelletal.,

2019a

)inAppendixSection

8.1.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

DavisandArel,

2014

;

Jacobsetal.,

1991

)allowsmodelstogrowtheirtotalparametercountwhilekeepingthenumberofparametersthatareactivatedforanygiveninputconstant.

Ahostofimprovementsmadeacrossnearlytheentiremodelstack(architecture,data,optimizationandsystems)allowsGemini1.5ProtoachievecomparablequalitytoGemini1.0Ultra(seeSection

),whileusingsignificantlylesstrainingcomputeandbeingsignificantlymoreefficienttoserve.Gemini1.5Proalsoincorporatesaseriesofsignificantarchitecturechangesthatenablelong-contextunderstandingofinputsupto10milliontokenswithoutdegradingperformance.Translatedintorealworlddata,thiscontextlengthenablesGemini1.5Promodelstocomfortablyprocessalmostadayofaudiorecordings(i.e.,22hours),morethantentimestheentiretyofthe1440pagebook(or

587,287words)"WarandPeace",theentireFlax(Heeketal.,

2023)codebase(41,070linesofcode),

orthreehoursofvideoat1frame-per-second.Further,sincethemodelisnativelymultimodalandsupportsinterleavingofdatafromdifferentmodalities,itcansupportamixofaudio,visual,text,andcodeinputsinthesameinputsequence.InSection

4.1

,wehighlightsomeofthenovelcapabilitiesenabledbytheseadvances,includingevaluationsthatyieldedpositiveresultsoncontextlengthsupto10million.Wenotethatunderstandingthelimitsofthesecapabilitiesandstudyingtheirexcitingcapabilitiesandapplicationsremainsanareaofcontinuedresearchexploration.

3.TrainingInfrastructureandDataset

LikeGemini1.0Ultraand1.0Pro,Gemini1.5Proistrainedonmultiple4096-chippodsofGoogle’sTPUv4accelerators,distributedacrossmultipledatacenters,andonavarietyofmultimodalandmultilingualdata.Ourpre-trainingdatasetincludesdatasourcedacrossmanydifferentdomains,includingwebdocumentsandcode,andincorporatesimage,audio,andvideocontent.Fortheinstruction-tuningphasewefinetunedGemini1.5Proonacollectionofmultimodaldata(containingpairedinstructionsandappropriateresponses),withfurthertuningbasedonhumanpreferencedata.WereferreaderstotheGemini1.0TechnicalReport(

Gemini-Teametal.,

2023

)forfurtherinformation.

4.Long-contextEvaluation

Existingevaluationsareincreasinglystrainedbythenewandrapidlyadvancingcapabilitiesoflargemultimodalmodels.Theytypicallyfocusonindividualmodalitiesand/orarerestrictedtotaskswithshortercontextlengths.Hence,thereisagrowingneedforbenchmarkswhichexemplifythenuancedrequirementsofrealworldlongmixed-modalityusecases.Amongthese,wehighlightthequantitativeassessmentofreasoningcapabilitiesacrosslongmixed-modalitysequencesasakeychallenge.

Withthechallengesofevaluatingincreasinglycapablemodelsinmind,ourevaluationofGemini1.5Profirstfocusesonunderstandingandevaluatingitsnovelcapabilities.Subsequently,weexplorecorebenchmarks,coveringcapabilitiesstudiedintheGemini1.0TechnicalReport(

Gemini-Team

etal.,

2023

).Specifically,weevaluateGemini1.5Prointhreemaincategories:

1.Qualitativelong-contextmultimodalevaluations:manuallyprobeandstress-testthemodel’slong-contextabilities,especiallyfornovelcapabilitieswherenoquantitativebenchmarksexist.

2.Quantitativelong-contextmultimodalevaluations:measurethemodel’slong-contextabilitiesonbothsyntheticandreal-worldtaskswithwell-definedmetrics.

3.Quantitativecoreevaluations:identifyprogressandregressionincorecapabilities(e.g.,coding,

math,science,multilingualityandinstructionfollowing).

5WenotethatalltheevaluationsarefromthesamecheckpointoftheGemini1.5Promodelthatisinstructiontunedpostpre-training,unlessotherwisestated.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

4.1.QualitativeExamplesofMultimodalLong-ContextCapabilities

Theabilitytoprocessmultiplemillionsoftokensunlockspracticalapplicationsthatwerenotpossiblebefore.InthissectionwedemonstratesomesurprisinginteractionsweobservedwithGemini1.5Proacrosscode,textandvideo.

AsshownintheFigure

Gemini1.5ProisabletoingestentirelargecodebasessuchasJAX(746,152tokens),andanswerveryspecificqueriesaboutthem.inFigure

weshowGemini1.5Pro’sabilitytolearnanewlanguagebasedonlyonreferencematerialsgiveninitsinput(seeSection

forquantitativemetricsforthisusecase).Additionally,wetestGemini1.5Pro’sabilitytoansweranimagequerygiventheentiretextofLesMisérablesandobservethatbeingnativelymultimodalallowsittolocateafamousscenefromahand-drawnsketch,asshowninFigure

Lastly,weaskGemini1.5Proquestionsaboutanentiremovieof45minutesinFigure

whichthemodelanswersseamlesslywhileretrievingmomentsandtimestampsdowntoasecond.

Figure2|Giventheentire746,152tokenJAXcodebaseincontext,Gemini1.5Procanidentifythespecificlocationofacoreautomaticdifferentiationmethod.

Figure3|Givenareferencegrammarbookandabilingualwordlist(dictionary),Gemini1.5ProisabletotranslatefromEnglishtoKalamangwithsimilarqualitytoahumanwholearnedfromthe

samematerials.

6ForadditionalshortvideosofdemonstrationsofthelongcontextabilitiesofGemini1.5Proacrossvideo,text,andcodesee

https://deepmind.google/technologies/gemini/.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

Figure4|WiththeentiretextofLesMisérablesintheprompt(1382pages,732ktokens),Gemini

1.5Proisabletoidentifyandlocateafamousscenefromahand-drawnsketch.

Figure5|Whenpromptedwitha45minuteBusterKeatonmovie“SherlockJr."(1924)(2,674framesat1FPS,684ktokens),Gemini1.5Proretrievesandextractstextualinformationfromaspecificframeinandprovidesthecorrespondingtimestamp.Atbottomright,themodelidentifiesasceneinthemoviefromahand-drawnsketch.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

4.2.Long-contextEvaluations

Forthepastfewyears,LLMresearchhasprioritizedexpandingthecontextwindowfromwhichmodelscanincorporateinformation(

Anthropic,

2023;

OpenAI,

2023)

.Thisemphasisstemsfromtherecognitionthatawidercontextwindowallowsmodelstoincorporatealargeramountofnew,task-specificinformationnotfoundinthetrainingdataatinferencetime,leadingtoimprovedperformanceinvariousnaturallanguageormultimodaltasks.Recentapproachestoimprovingthelong-contextcapabilitiesofmodelsfallintoafewcategories,includingnovelarchitecturalapproaches(

Ainslieetal.,

2023

;

GuandDao,

2023

;

Guoetal.,

2021

;

Orvietoetal.,

2023

;

Zaheeretal.,

2020

),post-training

modifications(Bertschetal.,

2023;

Chenetal.;

Pressetal.,

2021;

Xiongetal.,

2023),retrieval

augmentedmodels(Guuetal.,

2020;

Izacardetal.,

2022;

Jiangetal.,

2022;

Karpukhinetal.,

2020;

Santhanametal.,

2021

),memory-augmentedmodels(

Bulatovetal.,

2022,

2023

;

Martinsetal.,

2022;

Muetal.,

2023;

Wuetal.,

2022a,b;

Zhongetal.,

2022),andtechniquesforbuildingmore

coherentlong-contextdatasets(

Shietal.,

2023c;

Staniszewskietal.,

2023

).Thisactivityhasresultedinmeasurableimprovementsonlong-contextcapabilitiesofLLMsoverthepastseveralmonths,withtherecentconcurrentworkofLiuetal.(2024)exploringcontextwindowof7Bmodelsupto1Mmultimodaltokens.Notably,amongthestate-of-the-artLLMs,Anthropichassuccessfullyextendedthecontextoftheirtext-onlyClaude2modelto100ktokens,whileOpenAIhasrecentlyreleasedGPT-4Turboreaching128ktokens.Finally,thelatestadditiontotheserieswasClaude2.1withacontextwindowof200ktokens.

Gemini1.5Prosignificantlyextendsthiscontextlengthfrontiertomultiplemillionsoftokenswithalmostnodegradationinperformance,makingitpossibletoprocesssignificantlylargerinputs.ComparedtoClaude2.1witha200ktokencontextwindow,Gemini1.5Proachievesa100%recallat200ktokens,surpassingClaude2.1’s98%.This100%recallismaintainedupto530ktokens,andrecallis99.7%at1Mtokens.Whenincreasingfrom1Mtokensto10Mtokens,themodelretains99.2%recall.Moreover,Gemini1.5Pro’snativemultimodalcapabilitiesenablesthemodeltoingestmultiplehoursofaudioandvideorecordingsalongsideorinterleavedwithtext.SuchrecallcapabilitiesaresummarizedinFigure

Belowwereportresultsonlong-contextevaluationsacrossallthreemodalities,i.e.,text,visionandaudio.

Theevaluationmethodologywefollowedtomeasurethelong-contextcapabilityofGemini1.5Proconsistsofbothdiagnostic-focusedprobingofthelongcontextcapabilities(e.g.,perplexityoverlongsequences,needle-in-a-haystackretrievalstudies)andrealisticevaluationsspecificallydesignedformultimodallong-contexttasks(e.g.,long-documentQA,long-contextautomaticspeechrecognition,learningtotranslateanewlanguagefromonlyonebook,andlong-contextvideoQA).Toprovideareferencepoint,throughoutthissectionwecompareGemini1.5Prowiththeleadingmodelavailableexternallyforeachtask.WiththeevaluationharnesswedevelopedforGemini1.5Proweareabletoquantifythequalityoflong-contextunderstandingcapabilitiesreliablyallthewayupto10Mtokens.

4.2.1.DiagnosticLong-ContextEvaluations

PerplexityoverLongSequences

Westartbyreportingresultsonthetextmodality.Toevaluatetheabilityofthemodelstomakeuseofverylongcontextstoimprovenext-tokenprediction,whichistheobjectivefunctionusedtotrainlanguagemodels,werecordthenegativelog-likelihood(NLL)oftokensatdifferentpositionsintheinputsequencesfromheld-outtext(i.e.,notusedintraining).Here,alowervalueimpliesanimprovedprediction.Typically,weexpecttokensatthebeginningofasequencetohavehighNLL,asthereislittletonocontextthatthemodelcanusetopredictthem,andtokenslaterinthesequencetohavelowerNLLasmoreinformationbecomesavailabletothemodel.Theshapeoftheresulting

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

NegativeLog-Likelihood

(lowerisbetter)

CumulativeAverageNLLforCode

Gemini1.0ProGemini1.5Pro

Powerlawfit(r2=0.998)

1285122K8K32K128K512K2M10M

Sequenceposition

CumulativeAverageNLLforLongDocuments

Gemini1.0ProGemini1.5Pro

Powerlawfit(r2=0.998)

2561K4K16K64K256K1M

Sequenceposition

NegativeLog-Likelihood

(lowerisbetter)

Figure6|Cumulativeaveragenegativelog-likelihood(NLL)asafunctionoftokenpositioninlongdocumentsandcodedata.Alowervaluedemonstratesbetterprediction.Gemini1.5Proshowsimprovedpredictionsupto1Mtokensforlong-documentsand10Mtokensforcode,whereasGemini1.0Proimprovesuptoonly32Ktokens.TheNLLfollowsapower-lawtrendupuntil1Mtokens(documents)and2Mtokens(code)withadeviatingtrendat10Mtokens.

curveindicatestheabilitiesofmodelstoreasonoverlong-context.Adownwardtrendsignifiesmodelsmakinguseoflong-contexttoreducemodels’uncertainty.Ontheotherhand,anupwardtrendsignifiesthatmodelsareunabletoeffectivelyuseinformationfromthepreviouscontextandmaybedeterioratinginpredictionquality,highlightingthelimitationsintheirlong-contextunderstandingcapability.

Weperformthisanalysisontwodatasources:(a)adatasetoflongdocumentswithupto1milliontokens,and(b)adatasetofcoderepositoriesconstructedbyfirstrandomlyshufflingallthefilesandthenconcatenatingthem.Thecodedatasetcontainssequenceslongerthan1milliontokenswithsomenaturalformofsemanticassociation(e.g.,awholerepository),allowingforfurtherevaluationofsequencesofupto10Mtokens.Figure

showsthecumulativeNLLuptoaspecifictokenindex

WealsofitapowerlawoftheformL(x)=αxβ+ytothesedatapoints(dashedline).

WefindinFigure

thatNLLdecreasesmonotonicallywithsequencelengthandthuspredictionaccuracyimprovesuptothetestedsequencelengths(1Mforlongdocuments,and10Mforcode),indicatingthatourmodelscanmakeuseofthewholeinputevenatverylong-contextlengths.Thissuggeststhatthemodelisabletoimproveitspredictionsbyfindingusefulpatternsintokens,eveniftheyoccurredmillionsoftokensinthepast,asinthecaseofcode.

Finally,weseethisimprovedpredictionfollowsaregularpower-lawstructure.Whileitiswellknownthatlanguagemodelsfollowapower-lawintermsoftrainingcomputetomodelperformance(NLL)(

Kaplanetal.,

2020)uptoaverylargescale,wedemonstratethatapowerlawcanhold

betweenlog-lossandcontextlengthuptoextremelylongcontextlengths.Weseethepower-lawfitisquiteaccurateupto1Mtokensforlong-documentsandabout2Mtokensforcode.Frominspectinglongercodetokenpredictionscloserto10M,weseeaphenomenaoftheincreasedcontextoccasionallyprovidingoutsizedbenefit(e.g.duetorepetitionofcodeblocks)whichmayexplainthepower-law

deviation.Howeverthisdeservesfurtherstudy,andmaybedependentontheexactdatasetused.

7WenotethatweareunabletoobtainlogitsforothercommerciallyavailableLLMsforcomparison.

Gemini1.5:Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext

Depth(%)

100

Gemini1.5Pro:From1kto1Mtokens

32k128k256k512k1M

Tokens

Upto10Mtokens

2M5M10M

Tokens

GPT-4Turbo:From1kto128ktokens

Depth(%)

0142943577186

100

32k128k256k512k1M

Tokens

Figure7|TextHaystack.ThisfigurecomparesGemini1.5ProwithGPT-4Turboforthetextneedle-in-a-haystacktask.Greencellsindicatethemodelsuccessfullyretrievedthesecretnumber,graycellsindicateAPIerrors,andredcellsindicatethatthemodelresponsedidnotcontainthesecretnumber.ThetoprowshowsresultsforGemini1.5Pro,from1kto1Mtokens(topleft),andfrom1Mto10Mtokens(topright).ThebottomrowshowsresultsonGPT-4Turbouptothemaximumsupportedcontextlengthof128ktokens.Theresultsarecolor-codedtoindicate:greenforsuccessfulretrievalsandredforunsuccessfulones.

Text-Haystack

Next,wemovetotestinglong-contextrecallusingtherecentlyintroducedneedle-in-a-haystack

evaluation(Kamradt,

2023),whichtestsamodel’sabilitytoretrieveatext(i.e.,“needle”)insertedat

variouspositionsintoasequence(i.e.,“haystack”).Followingpriorwork(

Dhinakaran,

2024

),weuseasetofconcatenatedandrepeatedessayswrittenbyPaulGraham

tofillthedesiredcontextlength.Weinsertaneedleatlinearlyspacedintervalsfromthebeginningtotheendofthecontext,wheretheneededisi.e.,“Thespecialmagic{city}numberis:{number}”wherethecityandnumberarevariedforeachquery,andpromptthemodelwith“Hereisthemagicnumber:”.Wereportwhetherthemagicnumberrecallwascorrectatvariouscontextlengths(xaxis–thehaystack)asafunctionofitspositionintheinputsequenceexpressedintermsofdepthpercentage(yaxis),e.g.,depthat100%wouldindicateaneedleinsertedattheveryendoftheinputwhereas0%attheverybeginning.

AscanbeseeninFigure

Gemini1.5Proachieves100%recallupto530ktokensand>99.7%recallupto1Mtokens.Thistask,whilesimple,providesacleardemonstrationthatGemini1.5Proisabletoreliablyretrieveinformationfromlongdocumentsupto1Mtokens.Forreference,wereportresultsforGPT-4Turbouptothe128Ksequencele

人人文庫(kù)> 全部分類> 應(yīng)用文書 > 研究報(bào)告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

Gemini 1.5：解鎖數(shù)百萬(wàn)代幣的多模式理解（英文版）

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

Gemini 1.5：解鎖數(shù)百萬(wàn)代幣的多模式理解（英文版）

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔